Reading Order for New Contributors¶
This page tells you which internal docs to read, and in what order, depending on what you are trying to do. The internals documents form a dependency graph: some assume knowledge from others. Start at the right entry point and you will not need to backtrack.
Entry points by role¶
If you are completely new to this codebase¶
Begin with the PDF Primer before touching any code.
- 02 — PDF Primer — What a PDF file actually is at the byte level. Cross-reference tables, object streams, content streams. You cannot reason about the parser or the redaction logic without this.
- 14 — Glossary — Keep this open as a reference tab. Every term used across the internals docs is defined here.
If you want to understand the architecture¶
Read these two documents in order before looking at any crate.
- 01 — Architecture Overview — The 30,000-foot view. Crate dependency graph, pipeline stages, design principles, error model, and what is explicitly out of scope.
- 15 — Spec to Code Map — Maps PDF specification concepts (operators, dictionaries, object types) to the Rust types and functions that implement them. Bridges the gap between the primer and the code.
If you are modifying the parser¶
The parser lives in crates/pdf_objects. Read these before changing anything there.
- 03 — Parsing Model — How bytes become tokens, how tokens become objects, how the xref table and incremental update chains are resolved.
- 04 — Object Model — The Rust type hierarchy for PDF objects, how indirect references are resolved, and how the document tree is structured in memory.
If you are working on text extraction or search¶
Text extraction depends on a correct understanding of coordinate systems. Do not skip the graphics state document.
- 05 — Graphics State — The current transformation matrix, text matrix, font size scaling, and how page-space coordinates are produced from operator arguments.
- 06 — Text System — Font loading, glyph decoding, character widths, and how individual glyphs become positioned quads in page space.
- 07 — Search Geometry — How extracted glyphs are sorted into visual reading order, how text is normalized for matching, and how substring matches are mapped back to glyph quads for the redaction pipeline.
If you are working on redaction¶
Redaction depends on correct targets and a clear understanding of the apply pipeline.
- 08 — Redaction Targets — The
NormalizedPageTargetmodel. How Rect, Quad, and QuadGroup inputs are validated and normalized into page-space quads. - 09 — Redaction Pipeline — How the apply step intersects targets with glyphs, vectors, and images; how content streams are rewritten; how annotations and metadata are stripped.
If you are working on the writer or output format¶
- 10 — Writer — Deterministic full-document serialization, xref table construction, why incremental updates are never written.
If you are working on the WASM or JavaScript layer¶
- 11 — WASM Boundary — The wasm-bindgen surface, serialization conventions, error propagation across the boundary, and how the TypeScript SDK wraps the raw WASM exports.
If you need to understand what is and is not supported¶
- 13 — Limitations — The explicit list of PDF features that are not supported, and why. Each limitation notes whether it is a deliberate scope decision or a known gap.
- 12 — Security Model — The threat model for redaction correctness. What "redacted" means in this engine, what attacks are in scope, and what the engine does not protect against.
Essential reading for everyone¶
16 — Top 10 Decisions — Regardless of which area you are working in, read this document. It describes the ten most important implementation decisions made during the design of this engine. Every maintainer is expected to understand all ten before merging significant changes.
Visual map of documentation dependencies¶
The arrows point from "required reading" to "depends on it". A document with multiple incoming arrows requires all of its prerequisites before it will make sense.
02-pdf-primer ──────────────────────────────────────────────┐
│ │
▼ ▼
14-glossary 01-architecture-overview
│
▼
15-spec-to-code
│
┌──────────────────┤
│ │
▼ ▼
03-parsing-model 05-graphics-state
│ │
▼ ▼
04-object-model 06-text-system
│
▼
07-search-geometry
│
┌──────────────────┘
│
▼
08-redaction-targets
│
▼
09-redaction-pipeline
│
▼
10-writer
│
┌───────────┤
│ │
▼ ▼
11-wasm-boundary 12-security-model
│
▼
13-limitations
16-top-ten-decisions ← read at any point; no prerequisites
Document index¶
| # | Title | Primary audience |
|---|---|---|
| 00 | Reading Order for New Contributors | Everyone |
| 01 | Architecture Overview | Everyone |
| 02 | PDF Primer | New contributors |
| 03 | Parsing Model | Parser contributors |
| 04 | Object Model | Parser contributors |
| 05 | Graphics State | Text / redaction contributors |
| 06 | Text System | Text contributors |
| 07 | Search Geometry | Text contributors |
| 08 | Redaction Targets | Redaction contributors |
| 09 | Redaction Pipeline | Redaction contributors |
| 10 | Writer | Writer / serialization contributors |
| 11 | WASM Boundary | WASM / JS contributors |
| 12 | Security Model | Everyone |
| 13 | Limitations | Everyone |
| 14 | Glossary | Everyone (reference) |
| 15 | Spec to Code Map | Everyone |
| 16 | Top Ten Decisions | Everyone |