vibe.review.parsing.structure.adapters

Adapters for non-PDF document formats.

Markdown, DOCX, and HTML documents enter the pipeline at the structure layer, bypassing extraction and layout. These adapters convert directly to DocumentStructure.

MarkdownAdapter

Convert Markdown documents to DocumentStructure.

Parses Markdown syntax and creates structured blocks for: - Headings (ATX style: # to ######) - Paragraphs - Lists (ordered and unordered) - Code blocks (fenced) - Blockquotes

adapt

adapt(content: str, source_path: str | None = None) -> DocumentStructure

Convert Markdown content to DocumentStructure.

Parameters:
  • content (str) –

    Raw Markdown text.

  • source_path (str | None, default: None ) –

    Path to source file.

Returns:

DocxAdapter

Convert DOCX documents to DocumentStructure.

Uses python-docx to parse Word documents and creates structured blocks based on paragraph styles and content.

Extracts Word outline numbering from heading styles to create properly numbered clause blocks that the semantic layer can detect.

__init__

__init__() -> None

Initialize the DocxAdapter.

adapt

adapt(path: Path, source_path: str | None = None) -> DocumentStructure

Convert DOCX file to DocumentStructure.

Parameters:
  • path (Path) –

    Path to DOCX file.

  • source_path (str | None, default: None ) –

    Path for metadata (defaults to path).

Returns:

HtmlAdapter

Convert HTML documents to DocumentStructure.

Uses BeautifulSoup to parse HTML and creates structured blocks based on HTML elements.

adapt

adapt(content: str, source_path: str | None = None) -> DocumentStructure

Convert HTML content to DocumentStructure.

Parameters:
  • content (str) –

    HTML string.

  • source_path (str | None, default: None ) –

    Path to source file.

Returns: