Skip to content

Creating a Golden Record for Parsed Documents

This document describes the methodology for creating a "golden record" - an ideal reference version of a parsed document that represents what the parsing pipeline should produce.

Golden records use a streamlined YAML format (.golden.yml) that focuses on the essential structure without internal tracking fields. The format uses nested children arrays for hierarchy rather than parent_id references.

Purpose

Golden records serve as:

  1. Test fixtures - Verify parsing pipeline correctness
  2. Training data - Examples for improving parsing rules
  3. Documentation - Show expected output format for document types
  4. Regression detection - Catch parsing degradation over time

Prerequisites

Before creating a golden record, you need:

  1. Access to the source document (PDF, DOCX, etc.)
  2. The current parsed output from the pipeline (or view via parsing visualizer)
  3. Understanding of the parsing pipeline architecture (see doc/architecture/parsing-pipeline.md)
  4. Knowledge of the document type (contract, technical spec, policy, etc.)

Step-by-Step Process

Step 1: Examine the Source Document

Read through the entire source document to understand:

  • Document structure: Title page, sections, appendices, etc.
  • Numbering schemes: How clauses/sections are numbered (1.1, 1.2 vs 1, 2, 3)
  • Visual layout: Single column, two-column, tables
  • Document type: Contract, terms & conditions, specification
  • Language: Swedish, English, etc.
  • Metadata: Date, version, parties, location

For the Avanix document, I identified:

  • Title page (page 1) with company logo and document title
  • Main agreement (pages 1-2) with clauses 1.1-1.15
  • Five appendices (Bilaga 1-5) on pages 3-16
  • Three-column layout in Bilaga 2 and 3 (identified from bbox x-coordinates clustering around 36, 213, and 390)
  • Swedish language
  • Document dated 2021-03-19, version 2021:1

Step 2: Analyze the Current Parsed Output

Read the existing .json file and identify issues:

Common Issues to Look For

  1. Merged content: Multiple numbered items incorrectly combined into one part

    • Example: Clause 1.9 contained text from 1.10 and 1.11
  2. Truncated content: Definitions or paragraphs cut off mid-sentence

    • Example: "Avtal" avser huvuddokumentet inklusive (incomplete)
  3. Incorrect part types: Headings marked as paragraphs, clauses marked as annexes

    • Example: part_type: "annex" for items that are just clauses
  4. Wrong parent relationships: Incorrect parent_id values

    • Example: Clause under wrong section
  5. Reading order issues: Multi-column text interleaved incorrectly

    • Common in two-column and three-column PDFs where columns are read in wrong order
  6. Missing structure: Subheadings, list items, or sections not detected

    • Example: "BAKGRUND" and "UPPDRAGETS ART OCH OMFATTNING" not captured as headings
  7. Duplicate content: Same text appearing in multiple parts

  8. Split clauses: Clause title and body split into separate section + paragraph parts

    • Example: "§ 2" as a section, then body text as separate paragraph children
    • These should be combined into a single clause with all content

Step 3: Design the Correct Structure

Create an outline of what the document structure should look like:

Document Title: AVTAL IS/IT-TJÄNSTER – ALLMÄNNA LEVERANSVILKOR
├── Section: Avtalskonstruktion och definitioner
│   ├── Clause 1.1: Parternas överenskommelse...
│   ├── Clause 1.2: Avtalsdokument / Abonnemangsavtal
│   ├── ...
│   └── Clause 1.11: Begrepp och definitioner...
├── Section: Omfattning och specifikation
│   ├── Clause 1.12: Detta Avtal omfattar...
│   └── ...
├── Annex 1: IT/IS-TJÄNSTER, SERVICENIVÅER OCH ERSÄTTNING
│   ├── Heading: IT/IS-TJÄNSTER OCH SERVICENIVÅER
│   ├── Heading: BAKGRUND
│   └── ...
├── Annex 2: ALLMÄNNA VILLKOR (2021:1)
│   ├── Section 1: DEFINITIONER M.M.
│   │   ├── Definition: "Avtal"
│   │   ├── Definition: "Avtalsdagen"
│   │   └── ...
│   ├── Section 2: OMFATTNING OCH UTFÖRANDE
│   │   ├── Clause 2.1: Ändring av IT/IS-tjänster
│   │   └── ...
│   └── ...
└── ...

Step 4: Create the Golden Record YAML

4.1 Document Metadata

Add comprehensive doc_metadata at the root level:

doc_metadata:
  source: pdf
  page_count: 16
  language: sv
  layout: two_column
  title: AVTAL IS/IT-TJÄNSTER – ALLMÄNNA LEVERANSVILKOR
  date: '2021-03-19'
  version: '2021:1'
parts:
  # ... parts go here

4.2 Part Structure

Each part uses a minimal set of fields:

Field Required Description
type Yes One of: clause, heading, paragraph, definition, list_item, annex
content Yes The text content (number prefix extracted to number field)
number No For clauses, list items, annexes: the number/marker as it appears in the source document (e.g., '1.1', '§ 2', 'a)', '(i)', 'BILAGA 1', 'Article 7'). Different source documents use different conventions - capture what the document uses. For unnumbered list items (bullets), omit this field. For annexes without an explicit number, use number: ''.
term No For definitions only: the defined term
children No Nested parts (replaces parent_id references)

Example clause with children:

- type: clause
  number: '2'
  content: |
    OMFATTNING OCH UTFÖRANDE

    Leverantören ska utföra avtalade IT/IS-tjänsterna i enlighet med bestämmelserna
    i detta Avtal och med den skicklighet och omsorg som Kunden har anledning att
    förvänta av motsvarande leverantör i branschen.
  children:
  - type: clause
    number: '2.1'
    content: |
      Ändring av IT/IS-tjänster

      Ändringar av IT/IS-tjänsternas karaktär eller utförande får ske endast efter
      skriftlig överenskommelse mellan parterna.
  - type: clause
    number: '2.2'
    content: |
      Tilläggstjänster

      Utöver de IT/IS-tjänster som regleras i Avtalet per Avtalsdagen, kan Kunden
      även beställa tillkommande konsulttjänster och/eller projekt av Leverantören.

Example definition:

- type: definition
  term: Avtal
  content: |
    "Avtal" avser huvuddokumentet inklusive samtliga angivna bilagor som
    benämns i huvuddokumentet.

Example list with nested items:

- type: clause
  number: '3.2'
  content: Kundens åtaganden
  children:
  - type: list_item
    number: 'a)'
    content: |
      Kunden ska lämna Leverantören tillgång och tillträde till bl.a.
      lokaler, utrustning, om det krävs för utförandet.
    children:
    - type: list_item
      number: 'i)'
      content: det krävs för Leverantörens utförande av sina åtaganden
    - type: list_item
      number: 'ii)'
      content: det inte är oförenligt med tvingande lagstiftning
  - type: list_item
    number: 'b)'
    content: Kunden ska säkerställa att Leverantören kan nyttja programvara.

4.3 Fields NOT Used in Golden Records

The streamlined YAML format deliberately excludes fields that are internal to the parsing pipeline:

  • part_id - Generated at runtime
  • parent_id - Replaced by children nesting
  • part_metadata - Fields moved to part level or dropped
  • bbox - Spatial info not needed for structure validation
  • sequence - Order determined by YAML list position
  • char_count - Computed from content
  • page_number - Not needed for structure validation
  • source - Per-part pipeline stage tracking (not to be confused with doc_metadata.source)
  • level - Implicit from nesting depth

4.4 Type Mappings

The YAML format uses simplified type names:

JSON/Legacy Type YAML Type Notes
section clause Numbered sections are clauses
annex_heading annex Appendix/bilaga titles
clause clause Unchanged
heading heading Unchanged
paragraph paragraph Unchanged
definition definition Unchanged
list_item list_item Unchanged

4.5 Part Types to Use

Part Type When to Use
heading Section/chapter headings, unnumbered titles (including document titles)
annex Appendix/annex/bilaga titles (use number field for "BILAGA 1", etc.). The number field is required - use number: '' for annexes without an explicit identifier.
clause Numbered contract clauses/sections with their body text (§ 2, 1.1, 2.3.4, Article 4)
paragraph Unnumbered body text not belonging to a clause
definition Defined terms with their definitions (use term field)
list_item Bullet points or numbered list items (use number for markers like "a)", "i)")
table Tabular data with rows and columns (requirements tables, pricing tables, etc.)
table_row A row within a table (child of table)
table_cell A cell within a table row (child of table_row)

4.6 Clauses vs Sections

Important distinction: A clause is a numbered contractual provision that includes both its title/heading AND its body text as a single unit. Do NOT split a clause into separate heading + paragraph parts.

Incorrect (splitting clause into separate parts):

- type: heading
  content: § 2 Licensupplåtelsens omfattning
- type: paragraph
  content: CAB upplåter härmed till Abonnenten...
- type: paragraph
  content: Licensupplåtelsen omfattar även...

Correct (single clause with all content):

- type: clause
  number:  2'
  content: |-
    Licensupplåtelsens omfattning

    CAB upplåter härmed till Abonnenten rätten att på i detta avtal angivna
    villkor nyttja den vid var tid gällande versionen av CAB Plan systemet,
    med de funktioner och tilläggssystem samt tilläggstjänster som kunden
    har beställt.

    Licensupplåtelsen omfattar även de nya utgåvor, andra uppdateringar,
    eventuellt övriga anpassningar och förändringar av CAB Plan och upplåtna
    komponenter som CAB under avtalstiden framställer och distribuerar enligt
    detta avtal.

Key points:

  • The number goes in the number field, not in content
  • The number includes any markers used in the document (e.g., '§ 2', not just '2')
  • The content includes the clause title followed by all body paragraphs, separated by blank lines
  • Use YAML literal block style (|- or |) for multi-line content

4.7 Nested Clauses (Sub-clauses)

Clauses can have sub-clauses. Use children arrays for nesting:

- type: clause
  number:  9'
  content: |-
    Avtalstid

    Detta avtal träder i kraft den dag båda parter undertecknat avtalet
    och gäller till och med den sista december därpå följande kalenderår.
  children:
  - type: clause
    number: '9.1'
    content: |-
      Förlängning

      Om inte uppsägning sker senast en månad före avtalstidens utgång
      förlängs avtalet med ett år i sänder.
  - type: clause
    number: '9.2'
    content: |-
      Uppsägning

      Uppsägning av avtal ska ske skriftligen.

Note that:

  • Both § 9 and 9.1 are clause parts (not separate heading + paragraph)
  • A top-level clause can have substantial body content AND sub-clauses as children
  • The numbering scheme can vary: "§ 9" for top-level, "9.1" for sub-clause
  • Nesting depth implies level (no explicit level field needed)

Clauses with title only (when all content is in sub-clauses):

If a clause consists only of a title followed immediately by sub-clauses with no additional body text, the content field contains only the title:

- type: clause
  number:  6'
  content: Tillgänglighet
  children:
  - type: clause
    number: '6.1'
    content: |-
      CAB garanterar att CAB Plan driftas på en server...
  - type: clause
    number: '6.2'
    content: |-
      Abonnenten är å sin sida ansvarig för att anslutning kan ske...

4.8 Article-Based Numbering

Some contracts (especially international ones) use "Article X" as the numbering scheme. The same conventions apply - Articles are clauses:

Example from Council of Europe IT Development Contract:

- type: clause
  number: Article 7
  content: Obligations of the Service Provider
  children:
  - type: clause
    number: '7.1'
    content: |-
      Provision of Services and Deliverables

      The Service Provider undertakes to provide to the Council of Europe
      all the Services and Deliverables described in the Tender file.

      It shall hand over to the Council of Europe all the Deliverables,
      in the format and on the media indicated and in compliance with
      the imperative deadlines prescribed.
  - type: clause
    number: '7.2'
    content: |-
      Obligation to provide advice, information and warnings

      The Service Provider recognises that it is subject to a general
      obligation to provide advice, and particularly to provide information
      and make recommendations, to the Council of Europe.

Note:

  • "Article 7" is the full number in the number field
  • Sub-clauses use plain numbers ("7.1", "7.2") as they appear in the source
  • The title goes at the start of content, not duplicated in number

4.9 Document Metadata Fields

Document-level metadata goes in the doc_metadata section at the root. The schema is designed to contain only fields that are unambiguous and determinable by rule-based parsing (not requiring AI interpretation).

Source Document Properties

These are mechanical properties of the source file itself:

Field Type Required Description
source enum Yes Source format: pdf, html, md, or docx
page_count integer If PDF Number of pages (only required/meaningful for PDFs)
language string Yes ISO 639-1 code (en, sv, etc.)
layout enum Yes single_column, two_column, or mixed
numbering_style enum No Primary clause numbering: numeric (1, 2, 3), decimal (1.1, 1.2), alpha (a, b, c), roman (i, ii, iii), section (§ 1, § 2), article (Article 1), or mixed
ocr bool or list No OCR indicator: true if entire document was OCR'd, or list of page numbers that were OCR'd (e.g., [1, 5, 6])
Extracted Document Identity

These are text strings that appear verbatim in the document in designated locations:

Field Type Required Description
title string Yes Document title from heading or title page
date string No Date string as it appears in document (preserve original format)
version string No Version identifier if explicitly stated
reference string No Document reference number/ID if stated
Fields NOT to Include

The following fields require interpretation and should not be included:

  • document_type - Requires classification (contract vs. terms vs. specification)
  • parties - Requires role identification and entity extraction
  • governing_law, jurisdiction, forum - Requires understanding clause semantics
  • publisher, organization - May not be explicitly stated
  • contract_period, start_date, end_date - Requires date interpretation in context
  • Domain-specific fields (procurement_*, legal_references, etc.)

Example (PDF source):

doc_metadata:
  # Source document properties (always present)
  source: pdf
  page_count: 16
  language: sv
  layout: two_column
  numbering_style: decimal
  ocr: [3, 4, 5]  # pages 3-5 were scanned images

  # Extracted identity (present if found in document)
  title: AVTAL IS/IT-TJÄNSTER – ALLMÄNNA LEVERANSVILKOR
  date: '2021-03-19'
  version: '2021:1'
  reference: DNR-2021-001

Example (HTML source):

doc_metadata:
  source: html
  language: en
  layout: single_column
  title: Terms of Service

Step 5: Handle Special Cases

Multi-Column Layouts

Note the layout in doc_metadata if relevant:

doc_metadata:
  layout: two_column

Reading order should follow columns left-to-right, each column top-to-bottom. The golden record captures the correct reading order as the sequence of parts in the YAML.

Definitions

Include the complete definition text with the term field:

- type: definition
  term: Avtal
  content: |-
    "Avtal" avser huvuddokumentet inklusive samtliga angivna bilagor som
    benämns i huvuddokumentet.

Multiple definitions under a heading:

- type: clause
  number: '1'
  content: DEFINITIONER M.M.
  children:
  - type: paragraph
    content: Följande definierade begrepp gäller mellan Leverantören och Kunden.
  - type: definition
    term: Avtal
    content: '"Avtal" avser huvuddokumentet inklusive samtliga angivna bilagor.'
  - type: definition
    term: Avtalsdagen
    content: '"Avtalsdagen" avser den dag Leverantören och Kunden ingår Avtal.'
  - type: definition
    term: IT/IS-tjänst
    content: '"IT/IS-tjänst" avser de tjänster som Leverantören tillhandahåller.'

Definitions with embedded lists (uncommon but valid):

Some definitions may contain enumerated components as children:

- type: definition
  term: Phase(s)
  content: |-
    "Phase(s)" means the different phases of the Project as described in the Tender file:
  children:
    - type: list_item
      content: 'Phase 1: drawing up of the Detailed Specifications'
    - type: list_item
      content: 'Phase 2: validation of the Detailed Specifications'

Numbered vs Unnumbered Lists

List items may or may not have a number field:

  • Numbered lists: Include the marker in the number field (e.g., 'a)', '(i)', '1.')
  • Unnumbered lists (bullet points): Omit the number field entirely

Example numbered list:

- type: list_item
  number: 'a)'
  content: First item with explicit marker
- type: list_item
  number: 'b)'
  content: Second item with explicit marker

Example unnumbered list (bullets):

- type: list_item
  content: First bullet point
- type: list_item
  content: Second bullet point

The number field captures what appears in the source document. Different documents use different conventions (a), (a), (i), 1., etc.) - use whatever the source document uses.

Nested Lists

Use children to establish list hierarchy:

- type: clause
  number: '13.5'
  content: 'CAB ansvarar ej för fel eller otillgänglighet orsakat av:'
  children:
  - type: list_item
    number: 'a)'
    content: felaktigt utnyttjande av CAB Plan;
  - type: list_item
    number: 'b)'
    content: ändringar eller ingrepp i utrustning i strid med instruktioner;
  - type: list_item
    number: 'c)'
    content: användning av eller fel i utrustning som tillhandahålles av kunden;

Deeply nested lists:

- type: clause
  number: '3.2'
  content: |-
    Kundens åtaganden

    För att Leverantören ska kunna utföra sina åtaganden enligt detta Avtal
    ska Kunden ansvara för följande:
  children:
  - type: list_item
    number: 'a)'
    content: 'Kunden ska lämna Leverantören tillgång och tillträde, om:'
    children:
    - type: list_item
      number: 'i)'
      content: det krävs för Leverantörens utförande av sina åtaganden
    - type: list_item
      number: 'ii)'
      content: det inte är oförenligt med tvingande lagstiftning
  - type: list_item
    number: 'b)'
    content: Kunden ska säkerställa att Leverantören kan nyttja programvara.

Tables

Tables use a three-level hierarchy: tabletable_rowtable_cell.

Table fields:

Field Type Required Description
type string Yes Must be table
content string No Optional table description/caption
columns list No Column header names (extracted from first row if has_header: true)
has_header bool No Whether the first row is a header row (default: false)
children list Yes List of table_row parts

Table row fields:

Field Type Required Description
type string Yes Must be table_row
row_index int No 0-based row index in table
is_header bool No Whether this row is a header row
children list Yes List of table_cell parts

Table cell fields:

Field Type Required Description
type string Yes Must be table_cell
content string Yes Cell text content
column_index int No 0-based column index
column_name string No Column header name (from first row or columns)
rowspan int No Number of rows this cell spans (default: 1)
colspan int No Number of columns this cell spans (default: 1)
children list No Nested content (paragraph, list_item) for complex cells

Example: Requirements table

- type: table
  columns: ['#', 'Krav', 'ISO kapitel', 'ISO kravområde']
  has_header: true
  children:
    - type: table_row
      row_index: 0
      is_header: true
      children:
        - type: table_cell
          column_index: 0
          column_name: '#'
          content: '#'
        - type: table_cell
          column_index: 1
          column_name: Krav
          content: Krav
        - type: table_cell
          column_index: 2
          column_name: ISO kapitel
          content: ISO kapitel
        - type: table_cell
          column_index: 3
          column_name: ISO kravområde
          content: ISO kravområde

    - type: table_row
      row_index: 1
      children:
        - type: table_cell
          column_index: 0
          column_name: '#'
          content: '3501'
        - type: table_cell
          column_index: 1
          column_name: Krav
          content: |
            Leverantören ska för de delar av verksamheten som berörs i
            leveransen ha ett ledningssystem för informationssäkerhet (LIS)
            som baseras på SS-EN ISO/IEC27001:2017 eller motsvarande.
        - type: table_cell
          column_index: 2
          column_name: ISO kapitel
          content: A.6.1 Intern organisation
        - type: table_cell
          column_index: 3
          column_name: ISO kravområde
          content: A.6.1.1 Informationssäkerhetsroller och ansvar

Tables with merged cells:

Use rowspan and colspan for cells that span multiple rows or columns:

- type: table_cell
  column_index: 0
  content: "Spans two columns"
  colspan: 2

Tables with nested content in cells:

Cells can contain paragraphs or list items:

- type: table_cell
  column_index: 1
  content: "Requirements include:"
  children:
    - type: list_item
      content: First requirement
    - type: list_item
      content: Second requirement

Step 6: Validate the Golden Record

After creating the golden record:

  1. Check completeness: All content from source document is present
  2. Check accuracy: Content matches source exactly (no OCR errors, no truncation)
  3. Check structure: Hierarchy is correct (proper nesting via children)
  4. Check types: All parts use valid types: clause, heading, paragraph, definition, list_item, annex, table, table_row, table_cell
  5. Check numbers: Numbers quoted as strings (e.g., '1.1' not 1.1)
  6. Check content: No empty content fields, no duplicated content

Validation with Linter

Use the golden record linter to validate:

python tools/lint-golden-yaml.py path/to/file.golden.yml

The linter checks:

  • Part types are from allowed set
  • Required fields present per type
  • Number fields are quoted strings
  • Content is not empty
  • Children types are valid for parent types

To auto-fix common issues:

python tools/lint-golden-yaml.py path/to/file.golden.yml --fix

This will:

  • Merge deprecated types (sectionclause, annex_headingannex)
  • Quote unquoted number fields
  • Rewrap long content lines to 80 columns using literal block style

Step 7: Document Issues Found

Create notes about parsing issues discovered, which can inform rule improvements:

## Issues Found in Current Parsing

1. **Merged clauses**: 1.9-1.11 merged into single part
2. **Truncated definitions**: Only first few words captured
3. **Missing subheadings**: BAKGRUND, UPPDRAGETS ART not detected
4. **Wrong hierarchy**: Section 10 TVIST had wrong parent
5. **Layout confusion**: Three-column reading order incorrect in Bilaga 2

Naming Convention

Golden records should use the naming pattern:

{original-filename}.golden.yml

Example:

  • Source: Allmanna-leveransvilkor_2021-03-19-1.pdf
  • Golden: Allmanna-leveransvilkor_2021-03-19-1.pdf.golden.yml

For multi-language documents, include the language code:

  • Point and click avtal CAB Plan Finland (Sv) 2018-05-25.sv.pdf.golden.yml

Using Golden Records for Testing

Golden records can be used in tests to validate parsing output structure:

import yaml

def test_parsing_matches_golden_record():
    source_path = "example_contracts/doc.pdf"
    golden_path = "example_contracts/doc.pdf.golden.yml"

    # Parse the document
    result = pipeline.parse(source_path)

    # Load golden record
    with open(golden_path) as f:
        golden = yaml.safe_load(f)

    # Flatten nested structure for comparison
    def flatten_parts(parts, flat_list=None):
        if flat_list is None:
            flat_list = []
        for part in parts:
            children = part.pop('children', [])
            flat_list.append(part)
            flatten_parts(children, flat_list)
        return flat_list

    expected_parts = flatten_parts(golden["parts"])

    # Compare key fields
    for actual, expected in zip(result.parts, expected_parts):
        assert actual.content == expected["content"]
        assert actual.part_type == expected["type"]
        if "number" in expected:
            assert actual.part_metadata.get("number") == expected["number"]

Checklist

Before finalizing a golden record:

  • All pages of source document reviewed
  • All text content captured (no missing sections)
  • Content not duplicated across parts
  • Part types appropriate for content (clause, heading, paragraph, definition, list_item, annex, table, table_row, table_cell)
  • Clauses include both title and body text (not split into separate heading + paragraphs)
  • Numbers in number field include markers as they appear in source (e.g., '§ 2', 'Article 4')
  • Number fields are quoted strings (e.g., '1.1' not 1.1)
  • Definitions have term field with the defined term
  • Definitions include complete text (not truncated)
  • Hierarchy correct via children arrays (not parent_id)
  • Nesting depth appropriate for document structure
  • Numbered list items have number field for markers as they appear in source (e.g., 'a)', '(i)')
  • Unnumbered list items (bullets) omit the number field
  • Annexes have number field (use number: '' for unnumbered annexes)
  • Tables use tabletable_rowtable_cell hierarchy
  • Table cells have content field (required)
  • Table cells have column_index and optionally column_name for position tracking
  • Merged cells use rowspan/colspan fields
  • Document metadata complete (doc_metadata section)
  • Multi-column layout noted in doc_metadata.layout if relevant
  • Reading order correct (columns read left-to-right, top-to-bottom)
  • Linter passes: python tools/lint-golden-yaml.py path/to/file.golden.yml

Converting from Legacy JSON Format

If you have existing .golden.json files, convert them to YAML:

python tools/convert-golden-json-to-yaml.py path/to/file.golden.json
# Or convert all in a directory:
python tools/convert-golden-json-to-yaml.py example_contracts/

This will:

  1. Merge section and clause types into clause
  2. Extract numbers from content to number field
  3. Convert flat parent_id structure to nested children
  4. Move part_metadata.term to top-level term (for definitions)
  5. Remove internal fields: bboxes, part_ids, char_count, sequence, source