Documentation

Specification

Structuring rules

Once items are extracted, structuring rules decide how they are named, scored, deduplicated, merged and promoted into higher-order sections.

ID conventions

  • All ids are kebab-case slugs of ASCII letters, digits and underscores.
  • Each section has a prefix: ent_, con_, pri_, heu_, dec_, proc_, pat_, ant_, cau_, ctx_, ift_, exc_, mm_, pb_, qa_, chk_, atm_.
  • Ids must be unique within their section and stable across re-compilations of the same source.
  • The package_id is global; recommended pattern is <source-slug>-<version>.

Confidence scoring

Confidence is a float in [0, 1] with two decimal places. The protocol defines five bands; see also the Protocol page.

  • 0.90 – 1.00 — the source states the claim directly and unambiguously.
  • 0.75 – 0.89 — the source supports the claim with mild interpretation.
  • 0.50 – 0.74 — inferred by combining two or more explicit statements.
  • 0.25 – 0.49 — weak inference; agents should treat as a hypothesis.
  • 0.00 – 0.24 — extractor uncertain; emit only if useful, mark source_basis: uncertain.

Normalization

  • Strings are NFC-normalized; whitespace collapsed to single spaces.
  • Languages follow BCP-47 (en, pt-BR, …).
  • Domains use the closed enum from the schema; novel topics go in subdomains.
  • Singular labels for entities and concepts; verb-led labels for procedures and playbooks.

Deduplication and merging

When two extractions produce overlapping items (e.g. one item per chunk in the reduce phase):

  • Merge by canonical id. If ids differ but labels are aliases, prefer the higher-confidence record and add the other label to aliases.
  • When confidences disagree, keep the higher value and lower it by the standard deviation between sources, capped at the higher of the two.
  • When source-basis disagrees, downgrade conservatively: explicit + inferredinferred.
  • Rich sections (procedures, playbooks, if_then_rules) are deduplicated on a normalized fingerprint of their action sequence, not just on label.

Promotion (v1.2)

After reduce, the compiler walks atomic_units and retrieval_chunks and promotes statements into higher-order sections:

  • Conditionals ("if X then Y", "when X, do Y") → if_then_rules.
  • Multi-step actions (numbered or sequenced) → playbooks.
  • Stated failure modes ("avoid", "do not", "common mistake") → anti_patterns.

Promotion is rejected when the candidate fails the language filter, completeness check or the truncation filter. Rejections are counted in result.promotion.rejected.

Source traceability

Every non-synthesized item must have at least one matching source_traceability entry whose extraction_type equals the item's source_basis. Synthesized items (e.g. retrieval chunks) may omit it. The v1.3.1 pipeline rebuilds this section automatically after sanitation, so stale entries pointing at removed items never reach the final package — and now propagates source_record_id when the source is record-oriented (JSONL, json_array, FAQ, normative texts), so consumers can pinpoint the exact record an item came from.

Auditability is a feature

The traceability section is what makes a CKF package auditable by humans and verifiable by other agents. Treat it as required, not optional.

CKF v1.0 for this page has not been compiled yet. Downloads become available once an admin runs the compiler.