Specification
Structuring rules
Once items are extracted, structuring rules decide how they are named, scored, deduplicated, merged and promoted into higher-order sections.
ID conventions
- All ids are kebab-case slugs of ASCII letters, digits and underscores.
- Each section has a prefix:
ent_,con_,pri_,heu_,dec_,proc_,pat_,ant_,cau_,ctx_,ift_,exc_,mm_,pb_,qa_,chk_,atm_. - Ids must be unique within their section and stable across re-compilations of the same source.
- The
package_idis global; recommended pattern is<source-slug>-<version>.
Confidence scoring
Confidence is a float in [0, 1] with two decimal places. The protocol defines five bands; see also the Protocol page.
0.90 – 1.00— the source states the claim directly and unambiguously.0.75 – 0.89— the source supports the claim with mild interpretation.0.50 – 0.74— inferred by combining two or more explicit statements.0.25 – 0.49— weak inference; agents should treat as a hypothesis.0.00 – 0.24— extractor uncertain; emit only if useful, marksource_basis: uncertain.
Normalization
- Strings are NFC-normalized; whitespace collapsed to single spaces.
- Languages follow BCP-47 (
en,pt-BR, …). - Domains use the closed enum from the schema; novel topics go in
subdomains. - Singular labels for entities and concepts; verb-led labels for procedures and playbooks.
Deduplication and merging
When two extractions produce overlapping items (e.g. one item per chunk in the reduce phase):
- Merge by canonical
id. If ids differ but labels are aliases, prefer the higher-confidence record and add the other label toaliases. - When confidences disagree, keep the higher value and lower it by the standard deviation between sources, capped at the higher of the two.
- When source-basis disagrees, downgrade conservatively:
explicit+inferred⇒inferred. - Rich sections (procedures, playbooks, if_then_rules) are deduplicated on a normalized fingerprint of their action sequence, not just on label.
Promotion (v1.2)
After reduce, the compiler walks atomic_units and retrieval_chunks and promotes statements into higher-order sections:
- Conditionals ("if X then Y", "when X, do Y") →
if_then_rules. - Multi-step actions (numbered or sequenced) →
playbooks. - Stated failure modes ("avoid", "do not", "common mistake") →
anti_patterns.
Promotion is rejected when the candidate fails the language filter, completeness check or the truncation filter. Rejections are counted in result.promotion.rejected.
Source traceability
Every non-synthesized item must have at least one matching source_traceability entry whose extraction_type equals the item's source_basis. Synthesized items (e.g. retrieval chunks) may omit it. The v1.3.1 pipeline rebuilds this section automatically after sanitation, so stale entries pointing at removed items never reach the final package — and now propagates source_record_id when the source is record-oriented (JSONL, json_array, FAQ, normative texts), so consumers can pinpoint the exact record an item came from.
Auditability is a feature