Creating a Golden Record for Parsed Documents¶
This document describes the methodology for creating a "golden record" - an ideal reference version of a parsed document that represents what the parsing pipeline should produce.
Golden records use a streamlined YAML format (.golden.yml) that focuses on the essential structure without internal tracking fields. The format uses nested children arrays for hierarchy rather than parent_id references.
Purpose¶
Golden records serve as:
- Test fixtures - Verify parsing pipeline correctness
- Training data - Examples for improving parsing rules
- Documentation - Show expected output format for document types
- Regression detection - Catch parsing degradation over time
Prerequisites¶
Before creating a golden record, you need:
- Access to the source document (PDF, DOCX, etc.)
- The current parsed output from the pipeline (or view via parsing visualizer)
- Understanding of the parsing pipeline architecture (see
doc/architecture/parsing-pipeline.md) - Knowledge of the document type (contract, technical spec, policy, etc.)
Step-by-Step Process¶
Step 1: Examine the Source Document¶
Read through the entire source document to understand:
- Document structure: Title page, sections, appendices, etc.
- Numbering schemes: How clauses/sections are numbered (1.1, 1.2 vs 1, 2, 3)
- Visual layout: Single column, two-column, tables
- Document type: Contract, terms & conditions, specification
- Language: Swedish, English, etc.
- Metadata: Date, version, parties, location
For the Avanix document, I identified:
- Title page (page 1) with company logo and document title
- Main agreement (pages 1-2) with clauses 1.1-1.15
- Five appendices (Bilaga 1-5) on pages 3-16
- Three-column layout in Bilaga 2 and 3 (identified from bbox x-coordinates clustering around 36, 213, and 390)
- Swedish language
- Document dated 2021-03-19, version 2021:1
Step 2: Analyze the Current Parsed Output¶
Read the existing .json file and identify issues:
Common Issues to Look For¶
-
Merged content: Multiple numbered items incorrectly combined into one part
- Example: Clause 1.9 contained text from 1.10 and 1.11
-
Truncated content: Definitions or paragraphs cut off mid-sentence
- Example:
"Avtal" avser huvuddokumentet inklusive(incomplete)
- Example:
-
Incorrect part types: Headings marked as paragraphs, clauses marked as annexes
- Example:
part_type: "annex"for items that are just clauses
- Example:
-
Wrong parent relationships: Incorrect
parent_idvalues- Example: Clause under wrong section
-
Reading order issues: Multi-column text interleaved incorrectly
- Common in two-column and three-column PDFs where columns are read in wrong order
-
Missing structure: Subheadings, list items, or sections not detected
- Example: "BAKGRUND" and "UPPDRAGETS ART OCH OMFATTNING" not captured as headings
-
Duplicate content: Same text appearing in multiple parts
-
Split clauses: Clause title and body split into separate section + paragraph parts
- Example: "§ 2" as a section, then body text as separate paragraph children
- These should be combined into a single clause with all content
Step 3: Design the Correct Structure¶
Create an outline of what the document structure should look like:
Document Title: AVTAL IS/IT-TJÄNSTER – ALLMÄNNA LEVERANSVILKOR
├── Section: Avtalskonstruktion och definitioner
│ ├── Clause 1.1: Parternas överenskommelse...
│ ├── Clause 1.2: Avtalsdokument / Abonnemangsavtal
│ ├── ...
│ └── Clause 1.11: Begrepp och definitioner...
├── Section: Omfattning och specifikation
│ ├── Clause 1.12: Detta Avtal omfattar...
│ └── ...
├── Annex 1: IT/IS-TJÄNSTER, SERVICENIVÅER OCH ERSÄTTNING
│ ├── Heading: IT/IS-TJÄNSTER OCH SERVICENIVÅER
│ ├── Heading: BAKGRUND
│ └── ...
├── Annex 2: ALLMÄNNA VILLKOR (2021:1)
│ ├── Section 1: DEFINITIONER M.M.
│ │ ├── Definition: "Avtal"
│ │ ├── Definition: "Avtalsdagen"
│ │ └── ...
│ ├── Section 2: OMFATTNING OCH UTFÖRANDE
│ │ ├── Clause 2.1: Ändring av IT/IS-tjänster
│ │ └── ...
│ └── ...
└── ...
Step 4: Create the Golden Record YAML¶
4.1 Document Metadata¶
Add comprehensive doc_metadata at the root level:
doc_metadata:
source: pdf
page_count: 16
language: sv
layout: two_column
title: AVTAL IS/IT-TJÄNSTER – ALLMÄNNA LEVERANSVILKOR
date: '2021-03-19'
version: '2021:1'
parts:
# ... parts go here
4.2 Part Structure¶
Each part uses a minimal set of fields:
| Field | Required | Description |
|---|---|---|
type |
Yes | One of: clause, heading, paragraph, definition, list_item, annex |
content |
Yes | The text content (number prefix extracted to number field) |
number |
No | For clauses, list items, annexes: the number/marker as it appears in the source document (e.g., '1.1', '§ 2', 'a)', '(i)', 'BILAGA 1', 'Article 7'). Different source documents use different conventions - capture what the document uses. For unnumbered list items (bullets), omit this field. For annexes without an explicit number, use number: ''. |
term |
No | For definitions only: the defined term |
children |
No | Nested parts (replaces parent_id references) |
Example clause with children:
- type: clause
number: '2'
content: |
OMFATTNING OCH UTFÖRANDE
Leverantören ska utföra avtalade IT/IS-tjänsterna i enlighet med bestämmelserna
i detta Avtal och med den skicklighet och omsorg som Kunden har anledning att
förvänta av motsvarande leverantör i branschen.
children:
- type: clause
number: '2.1'
content: |
Ändring av IT/IS-tjänster
Ändringar av IT/IS-tjänsternas karaktär eller utförande får ske endast efter
skriftlig överenskommelse mellan parterna.
- type: clause
number: '2.2'
content: |
Tilläggstjänster
Utöver de IT/IS-tjänster som regleras i Avtalet per Avtalsdagen, kan Kunden
även beställa tillkommande konsulttjänster och/eller projekt av Leverantören.
Example definition:
- type: definition
term: Avtal
content: |
"Avtal" avser huvuddokumentet inklusive samtliga angivna bilagor som
benämns i huvuddokumentet.
Example list with nested items:
- type: clause
number: '3.2'
content: Kundens åtaganden
children:
- type: list_item
number: 'a)'
content: |
Kunden ska lämna Leverantören tillgång och tillträde till bl.a.
lokaler, utrustning, om det krävs för utförandet.
children:
- type: list_item
number: 'i)'
content: det krävs för Leverantörens utförande av sina åtaganden
- type: list_item
number: 'ii)'
content: det inte är oförenligt med tvingande lagstiftning
- type: list_item
number: 'b)'
content: Kunden ska säkerställa att Leverantören kan nyttja programvara.
4.3 Fields NOT Used in Golden Records¶
The streamlined YAML format deliberately excludes fields that are internal to the parsing pipeline:
- Generated at runtimepart_id- Replaced byparent_idchildrennesting- Fields moved to part level or droppedpart_metadata- Spatial info not needed for structure validationbbox- Order determined by YAML list positionsequence- Computed from contentchar_count- Not needed for structure validationpage_number- Per-part pipeline stage tracking (not to be confused withsourcedoc_metadata.source)- Implicit from nesting depthlevel
4.4 Type Mappings¶
The YAML format uses simplified type names:
| JSON/Legacy Type | YAML Type | Notes |
|---|---|---|
section |
clause |
Numbered sections are clauses |
annex_heading |
annex |
Appendix/bilaga titles |
clause |
clause |
Unchanged |
heading |
heading |
Unchanged |
paragraph |
paragraph |
Unchanged |
definition |
definition |
Unchanged |
list_item |
list_item |
Unchanged |
4.5 Part Types to Use¶
| Part Type | When to Use |
|---|---|
heading |
Section/chapter headings, unnumbered titles (including document titles) |
annex |
Appendix/annex/bilaga titles (use number field for "BILAGA 1", etc.). The number field is required - use number: '' for annexes without an explicit identifier. |
clause |
Numbered contract clauses/sections with their body text (§ 2, 1.1, 2.3.4, Article 4) |
paragraph |
Unnumbered body text not belonging to a clause |
definition |
Defined terms with their definitions (use term field) |
list_item |
Bullet points or numbered list items (use number for markers like "a)", "i)") |
table |
Tabular data with rows and columns (requirements tables, pricing tables, etc.) |
table_row |
A row within a table (child of table) |
table_cell |
A cell within a table row (child of table_row) |
4.6 Clauses vs Sections¶
Important distinction: A clause is a numbered contractual provision that includes both its title/heading AND its body text as a single unit. Do NOT split a clause into separate heading + paragraph parts.
Incorrect (splitting clause into separate parts):
- type: heading
content: § 2 Licensupplåtelsens omfattning
- type: paragraph
content: CAB upplåter härmed till Abonnenten...
- type: paragraph
content: Licensupplåtelsen omfattar även...
Correct (single clause with all content):
- type: clause
number: '§ 2'
content: |-
Licensupplåtelsens omfattning
CAB upplåter härmed till Abonnenten rätten att på i detta avtal angivna
villkor nyttja den vid var tid gällande versionen av CAB Plan systemet,
med de funktioner och tilläggssystem samt tilläggstjänster som kunden
har beställt.
Licensupplåtelsen omfattar även de nya utgåvor, andra uppdateringar,
eventuellt övriga anpassningar och förändringar av CAB Plan och upplåtna
komponenter som CAB under avtalstiden framställer och distribuerar enligt
detta avtal.
Key points:
- The number goes in the
numberfield, not incontent - The number includes any markers used in the document (e.g.,
'§ 2', not just'2') - The content includes the clause title followed by all body paragraphs, separated by blank lines
- Use YAML literal block style (
|-or|) for multi-line content
4.7 Nested Clauses (Sub-clauses)¶
Clauses can have sub-clauses. Use children arrays for nesting:
- type: clause
number: '§ 9'
content: |-
Avtalstid
Detta avtal träder i kraft den dag båda parter undertecknat avtalet
och gäller till och med den sista december därpå följande kalenderår.
children:
- type: clause
number: '9.1'
content: |-
Förlängning
Om inte uppsägning sker senast en månad före avtalstidens utgång
förlängs avtalet med ett år i sänder.
- type: clause
number: '9.2'
content: |-
Uppsägning
Uppsägning av avtal ska ske skriftligen.
Note that:
- Both § 9 and 9.1 are
clauseparts (not separate heading + paragraph) - A top-level clause can have substantial body content AND sub-clauses as
children - The numbering scheme can vary: "§ 9" for top-level, "9.1" for sub-clause
- Nesting depth implies level (no explicit
levelfield needed)
Clauses with title only (when all content is in sub-clauses):
If a clause consists only of a title followed immediately by sub-clauses with no additional body text, the content field contains only the title:
- type: clause
number: '§ 6'
content: Tillgänglighet
children:
- type: clause
number: '6.1'
content: |-
CAB garanterar att CAB Plan driftas på en server...
- type: clause
number: '6.2'
content: |-
Abonnenten är å sin sida ansvarig för att anslutning kan ske...
4.8 Article-Based Numbering¶
Some contracts (especially international ones) use "Article X" as the numbering scheme. The same conventions apply - Articles are clauses:
Example from Council of Europe IT Development Contract:
- type: clause
number: Article 7
content: Obligations of the Service Provider
children:
- type: clause
number: '7.1'
content: |-
Provision of Services and Deliverables
The Service Provider undertakes to provide to the Council of Europe
all the Services and Deliverables described in the Tender file.
It shall hand over to the Council of Europe all the Deliverables,
in the format and on the media indicated and in compliance with
the imperative deadlines prescribed.
- type: clause
number: '7.2'
content: |-
Obligation to provide advice, information and warnings
The Service Provider recognises that it is subject to a general
obligation to provide advice, and particularly to provide information
and make recommendations, to the Council of Europe.
Note:
- "Article 7" is the full number in the
numberfield - Sub-clauses use plain numbers ("7.1", "7.2") as they appear in the source
- The title goes at the start of
content, not duplicated innumber
4.9 Document Metadata Fields¶
Document-level metadata goes in the doc_metadata section at the root. The schema is designed to contain only fields that are unambiguous and determinable by rule-based parsing (not requiring AI interpretation).
Source Document Properties¶
These are mechanical properties of the source file itself:
| Field | Type | Required | Description |
|---|---|---|---|
source |
enum | Yes | Source format: pdf, html, md, or docx |
page_count |
integer | If PDF | Number of pages (only required/meaningful for PDFs) |
language |
string | Yes | ISO 639-1 code (en, sv, etc.) |
layout |
enum | Yes | single_column, two_column, or mixed |
numbering_style |
enum | No | Primary clause numbering: numeric (1, 2, 3), decimal (1.1, 1.2), alpha (a, b, c), roman (i, ii, iii), section (§ 1, § 2), article (Article 1), or mixed |
ocr |
bool or list | No | OCR indicator: true if entire document was OCR'd, or list of page numbers that were OCR'd (e.g., [1, 5, 6]) |
Extracted Document Identity¶
These are text strings that appear verbatim in the document in designated locations:
| Field | Type | Required | Description |
|---|---|---|---|
title |
string | Yes | Document title from heading or title page |
date |
string | No | Date string as it appears in document (preserve original format) |
version |
string | No | Version identifier if explicitly stated |
reference |
string | No | Document reference number/ID if stated |
Fields NOT to Include¶
The following fields require interpretation and should not be included:
document_type- Requires classification (contract vs. terms vs. specification)parties- Requires role identification and entity extractiongoverning_law,jurisdiction,forum- Requires understanding clause semanticspublisher,organization- May not be explicitly statedcontract_period,start_date,end_date- Requires date interpretation in context- Domain-specific fields (
procurement_*,legal_references, etc.)
Example (PDF source):
doc_metadata:
# Source document properties (always present)
source: pdf
page_count: 16
language: sv
layout: two_column
numbering_style: decimal
ocr: [3, 4, 5] # pages 3-5 were scanned images
# Extracted identity (present if found in document)
title: AVTAL IS/IT-TJÄNSTER – ALLMÄNNA LEVERANSVILKOR
date: '2021-03-19'
version: '2021:1'
reference: DNR-2021-001
Example (HTML source):
Step 5: Handle Special Cases¶
Multi-Column Layouts¶
Note the layout in doc_metadata if relevant:
Reading order should follow columns left-to-right, each column top-to-bottom. The golden record captures the correct reading order as the sequence of parts in the YAML.
Definitions¶
Include the complete definition text with the term field:
- type: definition
term: Avtal
content: |-
"Avtal" avser huvuddokumentet inklusive samtliga angivna bilagor som
benämns i huvuddokumentet.
Multiple definitions under a heading:
- type: clause
number: '1'
content: DEFINITIONER M.M.
children:
- type: paragraph
content: Följande definierade begrepp gäller mellan Leverantören och Kunden.
- type: definition
term: Avtal
content: '"Avtal" avser huvuddokumentet inklusive samtliga angivna bilagor.'
- type: definition
term: Avtalsdagen
content: '"Avtalsdagen" avser den dag Leverantören och Kunden ingår Avtal.'
- type: definition
term: IT/IS-tjänst
content: '"IT/IS-tjänst" avser de tjänster som Leverantören tillhandahåller.'
Definitions with embedded lists (uncommon but valid):
Some definitions may contain enumerated components as children:
- type: definition
term: Phase(s)
content: |-
"Phase(s)" means the different phases of the Project as described in the Tender file:
children:
- type: list_item
content: 'Phase 1: drawing up of the Detailed Specifications'
- type: list_item
content: 'Phase 2: validation of the Detailed Specifications'
Numbered vs Unnumbered Lists¶
List items may or may not have a number field:
- Numbered lists: Include the marker in the
numberfield (e.g.,'a)','(i)','1.') - Unnumbered lists (bullet points): Omit the
numberfield entirely
Example numbered list:
- type: list_item
number: 'a)'
content: First item with explicit marker
- type: list_item
number: 'b)'
content: Second item with explicit marker
Example unnumbered list (bullets):
The number field captures what appears in the source document. Different documents use different conventions (a), (a), (i), 1., etc.) - use whatever the source document uses.
Nested Lists¶
Use children to establish list hierarchy:
- type: clause
number: '13.5'
content: 'CAB ansvarar ej för fel eller otillgänglighet orsakat av:'
children:
- type: list_item
number: 'a)'
content: felaktigt utnyttjande av CAB Plan;
- type: list_item
number: 'b)'
content: ändringar eller ingrepp i utrustning i strid med instruktioner;
- type: list_item
number: 'c)'
content: användning av eller fel i utrustning som tillhandahålles av kunden;
Deeply nested lists:
- type: clause
number: '3.2'
content: |-
Kundens åtaganden
För att Leverantören ska kunna utföra sina åtaganden enligt detta Avtal
ska Kunden ansvara för följande:
children:
- type: list_item
number: 'a)'
content: 'Kunden ska lämna Leverantören tillgång och tillträde, om:'
children:
- type: list_item
number: 'i)'
content: det krävs för Leverantörens utförande av sina åtaganden
- type: list_item
number: 'ii)'
content: det inte är oförenligt med tvingande lagstiftning
- type: list_item
number: 'b)'
content: Kunden ska säkerställa att Leverantören kan nyttja programvara.
Tables¶
Tables use a three-level hierarchy: table → table_row → table_cell.
Table fields:
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be table |
content |
string | No | Optional table description/caption |
columns |
list | No | Column header names (extracted from first row if has_header: true) |
has_header |
bool | No | Whether the first row is a header row (default: false) |
children |
list | Yes | List of table_row parts |
Table row fields:
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be table_row |
row_index |
int | No | 0-based row index in table |
is_header |
bool | No | Whether this row is a header row |
children |
list | Yes | List of table_cell parts |
Table cell fields:
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be table_cell |
content |
string | Yes | Cell text content |
column_index |
int | No | 0-based column index |
column_name |
string | No | Column header name (from first row or columns) |
rowspan |
int | No | Number of rows this cell spans (default: 1) |
colspan |
int | No | Number of columns this cell spans (default: 1) |
children |
list | No | Nested content (paragraph, list_item) for complex cells |
Example: Requirements table
- type: table
columns: ['#', 'Krav', 'ISO kapitel', 'ISO kravområde']
has_header: true
children:
- type: table_row
row_index: 0
is_header: true
children:
- type: table_cell
column_index: 0
column_name: '#'
content: '#'
- type: table_cell
column_index: 1
column_name: Krav
content: Krav
- type: table_cell
column_index: 2
column_name: ISO kapitel
content: ISO kapitel
- type: table_cell
column_index: 3
column_name: ISO kravområde
content: ISO kravområde
- type: table_row
row_index: 1
children:
- type: table_cell
column_index: 0
column_name: '#'
content: '3501'
- type: table_cell
column_index: 1
column_name: Krav
content: |
Leverantören ska för de delar av verksamheten som berörs i
leveransen ha ett ledningssystem för informationssäkerhet (LIS)
som baseras på SS-EN ISO/IEC27001:2017 eller motsvarande.
- type: table_cell
column_index: 2
column_name: ISO kapitel
content: A.6.1 Intern organisation
- type: table_cell
column_index: 3
column_name: ISO kravområde
content: A.6.1.1 Informationssäkerhetsroller och ansvar
Tables with merged cells:
Use rowspan and colspan for cells that span multiple rows or columns:
Tables with nested content in cells:
Cells can contain paragraphs or list items:
- type: table_cell
column_index: 1
content: "Requirements include:"
children:
- type: list_item
content: First requirement
- type: list_item
content: Second requirement
Step 6: Validate the Golden Record¶
After creating the golden record:
- Check completeness: All content from source document is present
- Check accuracy: Content matches source exactly (no OCR errors, no truncation)
- Check structure: Hierarchy is correct (proper nesting via
children) - Check types: All parts use valid types:
clause,heading,paragraph,definition,list_item,annex,table,table_row,table_cell - Check numbers: Numbers quoted as strings (e.g.,
'1.1'not1.1) - Check content: No empty content fields, no duplicated content
Validation with Linter¶
Use the golden record linter to validate:
The linter checks:
- Part types are from allowed set
- Required fields present per type
- Number fields are quoted strings
- Content is not empty
- Children types are valid for parent types
To auto-fix common issues:
This will:
- Merge deprecated types (
section→clause,annex_heading→annex) - Quote unquoted number fields
- Rewrap long content lines to 80 columns using literal block style
Step 7: Document Issues Found¶
Create notes about parsing issues discovered, which can inform rule improvements:
## Issues Found in Current Parsing
1. **Merged clauses**: 1.9-1.11 merged into single part
2. **Truncated definitions**: Only first few words captured
3. **Missing subheadings**: BAKGRUND, UPPDRAGETS ART not detected
4. **Wrong hierarchy**: Section 10 TVIST had wrong parent
5. **Layout confusion**: Three-column reading order incorrect in Bilaga 2
Naming Convention¶
Golden records should use the naming pattern:
Example:
- Source:
Allmanna-leveransvilkor_2021-03-19-1.pdf - Golden:
Allmanna-leveransvilkor_2021-03-19-1.pdf.golden.yml
For multi-language documents, include the language code:
Point and click avtal CAB Plan Finland (Sv) 2018-05-25.sv.pdf.golden.yml
Using Golden Records for Testing¶
Golden records can be used in tests to validate parsing output structure:
import yaml
def test_parsing_matches_golden_record():
source_path = "example_contracts/doc.pdf"
golden_path = "example_contracts/doc.pdf.golden.yml"
# Parse the document
result = pipeline.parse(source_path)
# Load golden record
with open(golden_path) as f:
golden = yaml.safe_load(f)
# Flatten nested structure for comparison
def flatten_parts(parts, flat_list=None):
if flat_list is None:
flat_list = []
for part in parts:
children = part.pop('children', [])
flat_list.append(part)
flatten_parts(children, flat_list)
return flat_list
expected_parts = flatten_parts(golden["parts"])
# Compare key fields
for actual, expected in zip(result.parts, expected_parts):
assert actual.content == expected["content"]
assert actual.part_type == expected["type"]
if "number" in expected:
assert actual.part_metadata.get("number") == expected["number"]
Checklist¶
Before finalizing a golden record:
- All pages of source document reviewed
- All text content captured (no missing sections)
- Content not duplicated across parts
- Part types appropriate for content (
clause,heading,paragraph,definition,list_item,annex,table,table_row,table_cell) - Clauses include both title and body text (not split into separate heading + paragraphs)
- Numbers in
numberfield include markers as they appear in source (e.g.,'§ 2','Article 4') - Number fields are quoted strings (e.g.,
'1.1'not1.1) - Definitions have
termfield with the defined term - Definitions include complete text (not truncated)
- Hierarchy correct via
childrenarrays (notparent_id) - Nesting depth appropriate for document structure
- Numbered list items have
numberfield for markers as they appear in source (e.g.,'a)','(i)') - Unnumbered list items (bullets) omit the
numberfield - Annexes have
numberfield (usenumber: ''for unnumbered annexes) - Tables use
table→table_row→table_cellhierarchy - Table cells have
contentfield (required) - Table cells have
column_indexand optionallycolumn_namefor position tracking - Merged cells use
rowspan/colspanfields - Document metadata complete (
doc_metadatasection) - Multi-column layout noted in
doc_metadata.layoutif relevant - Reading order correct (columns read left-to-right, top-to-bottom)
- Linter passes:
python tools/lint-golden-yaml.py path/to/file.golden.yml
Converting from Legacy JSON Format¶
If you have existing .golden.json files, convert them to YAML:
python tools/convert-golden-json-to-yaml.py path/to/file.golden.json
# Or convert all in a directory:
python tools/convert-golden-json-to-yaml.py example_contracts/
This will:
- Merge
sectionandclausetypes intoclause - Extract numbers from content to
numberfield - Convert flat
parent_idstructure to nestedchildren - Move
part_metadata.termto top-levelterm(for definitions) - Remove internal fields:
bboxes,part_ids,char_count,sequence,source