vibe.review.parsing.ir

Intermediate Representation (IR) for the parsing pipeline.

Provides traceability through: - Span model: every node carries source location references - Provenance log: transformation events with rule, inputs, outputs, confidence - Stable IDs: deterministic hashing with content + position + ancestor context

Design based on designdocs/parsing-architecture.md section 2.

BBox

Bounding box in points (72 points = 1 inch).

width

width: float

Width in points.

height

height: float

Height in points.

center

center: tuple[float, float]

Center point (x, y).

overlaps

overlaps(other: BBox) -> bool

Check if this bbox overlaps with another.

contains

contains(other: BBox) -> bool

Check if this bbox fully contains another.

merge

merge(other: BBox) -> BBox

Return bbox that encompasses both.

to_dict

to_dict() -> dict[str, float]

Convert to dictionary for serialization.

from_dict

from_dict(d: dict[str, float]) -> BBox

Create from dictionary.

Span

Source location reference for traceability.

Every IR node carries one or more spans indicating where in the source document its content originated.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict

from_dict(d: dict[str, Any]) -> Span

Create from dictionary.

ProvenanceEvent

Transformation audit trail entry.

Records which rule fired, what inputs it consumed, what outputs it produced, and the confidence level. Used for debugging and understanding why the parser made specific decisions.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict

from_dict(d: dict[str, Any]) -> ProvenanceEvent

Create from dictionary.

IRNode

Base class for all IR nodes in the parsing pipeline.

Every node has: - A stable, deterministic ID for diffing and citation - Source spans for traceability back to the original document - Provenance log of transformations that created/modified it - Arbitrary metadata for layer-specific information

add_provenance

add_provenance(event: ProvenanceEvent) -> None

Add a provenance event to this node's history.

add_span

add_span(span: Span) -> None

Add a source span to this node.

get_pages

get_pages() -> set[int]

Get all page numbers this node spans.

get_merged_bbox

get_merged_bbox() -> BBox | None

Get bounding box that encompasses all spans on the same page.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict

from_dict(d: dict[str, Any]) -> IRNode

Create from dictionary.

generate_stable_id

generate_stable_id(content: str, page: int, position_hint: tuple[float, float] | None = None, ancestor_ids: list[str] | None = None, node_type: str = 'node') -> str

Generate a deterministic ID for stable diffing and citation.

The ID is based on: - Content hash (primary disambiguation) - Position hint (page, x0, y0) for same-content disambiguation - Ancestor context (last 3 ancestors) for hierarchical uniqueness

Format: {type_prefix}-{content_hash}-{position_hash} Example: "par-a1b2c3d4-e5f6"

Parameters:
  • content (str) –

    Text content of the node

  • page (int) –

    Page number (1-based)

  • position_hint (tuple[float, float] | None, default: None ) –

    (x0, y0) position in points, or None

  • ancestor_ids (list[str] | None, default: None ) –

    List of ancestor node IDs (parent, grandparent, etc.)

  • node_type (str, default: 'node' ) –

    Type prefix for the ID (e.g., "heading", "paragraph")

Returns:
  • str

    Stable deterministic ID string

merge_provenance

merge_provenance(nodes: list[IRNode], rule_id: str, rule_name: str, layer: str) -> list[ProvenanceEvent]

Create provenance events for a merge operation.

When multiple nodes are merged into one, this creates the appropriate provenance tracking.

Parameters:
  • nodes (list[IRNode]) –

    Source nodes being merged

  • rule_id (str) –

    ID of the rule performing the merge

  • rule_name (str) –

    Human-readable rule name

  • layer (str) –

    Pipeline layer ("extraction", "layout", "structure", "semantic")

Returns: