Parsing Model¶
This document is a deep dive into how pdf_objects parses a raw PDF byte slice into a structured ParsedDocument. It covers the entry point, each stage of the pipeline, the cursor abstraction, stream handling, and document building — plus the reasoning behind every significant design decision.
1. Entry point¶
pdf_objects::parse_pdf is the single public entry point for parsing. It accepts a raw byte slice and returns a ParsedDocument that owns the parsed object table, trailer, and page tree. Every other parsing function in the crate is an implementation detail called transitively from here.
2. Parsing pipeline¶
Parsing proceeds in five ordered stages. Each stage depends on the output of the previous one.
Stage 1: parse_header¶
Checks that the file begins with %PDF- and extracts the version string (e.g., 1.7 or 2.0). The version is stored on the resulting document but is not used to gate behavior at this stage — the engine treats all version strings uniformly unless a later stage encounters a version-specific construct it cannot handle.
Stage 2: find_startxref¶
Scans backwards from the end of the file for the startxref keyword, then reads the decimal integer that follows it on the same or next line. This integer is the byte offset of the cross-reference table (or xref stream, which this engine rejects — see Stage 3).
Scanning backwards rather than forwards is mandated by the PDF specification: the startxref entry is always near the end of the file, and scanning from the beginning would be fragile in the presence of binary content that happens to contain the ASCII bytes of startxref.
Stage 3: parse_xref_table¶
This is the most structurally complex stage. It resolves the entire cross-reference chain and merges all revisions into a single flat object table.
Cycle detection. A BTreeSet<usize> of visited byte offsets is maintained across the chain walk. If parse_xref_table is asked to follow a Prev pointer to an offset it has already visited, it returns an error rather than looping forever. This protects against both accidentally malformed and deliberately adversarial PDFs.
Section parsing. For each xref table offset (starting from startxref, then following each Prev), parse_xref_section reads:
- One or more xref subsections, each with a starting object number and entry count.
- Each 20-byte entry: in-use entries (n) record the byte offset of the object; free entries (f) are noted but not materialized into objects.
- The trailer dictionary that follows all subsections.
Merge policy. The table is built as a BTreeMap<ObjectRef, XRefEntry> using BTreeMap::entry().or_insert(). Because the chain is walked newest-revision-first (from the most recent startxref backward through Prev), the first time any given ObjectRef is inserted it comes from the newest revision. or_insert preserves that first-seen entry and ignores subsequent (older) entries for the same ref. This correctly implements incremental update semantics: newer revisions win.
Xref streams rejected. If the trailer contains an XRefStm key, the engine returns Err(PdfError::Unsupported). Xref streams (PDF 1.5+) use a compressed binary format that would require a separate decoding path and would complicate the bootstrap problem described in Stage 4. Failing explicitly here is safer than silently misreading the object table.
Output. Stage 3 returns the merged BTreeMap<ObjectRef, XRefEntry> and the newest trailer dictionary.
Stage 4: Parse each in-use object¶
For each in-use entry in the merged xref table, parse_indirect_object is called at the recorded byte offset. It reads:
1. The object number and generation number from the N G obj header.
2. The object body — either a PdfValue or a stream (dictionary followed by stream ... endstream).
3. The endobj keyword.
The resulting PdfObject is inserted into the object table keyed by its ObjectRef.
Stream parsing at this stage only reads and stores the raw (still-encoded) bytes. Decompression happens lazily on demand, never during the initial parse pass.
Stage 5: build_document¶
Validates the object table and page tree structure, then builds the final ParsedDocument. Details in section 5 below.
3. Cursor-based parsing¶
All low-level parsing is done through a Cursor struct:
The cursor holds a reference to the original byte slice and a mutable position. There are no heap allocations for intermediate tokens; the cursor reads directly from the input.
skip_ws_and_comments¶
Advances past any combination of:
- Space (0x20), tab (0x09), carriage return (0x0D), line feed (0x0A), form feed (0x0C), NUL (0x00)
- % comment lines — consumes everything up to and including the next line ending
PDF defines whitespace broadly and allows comments almost anywhere between tokens. Handling them in a single method called before every token read keeps the rest of the parsing code clean.
parse_value¶
The main dispatch method. Peeks at the first non-whitespace byte and routes to the appropriate sub-parser:
| First byte(s) | Target |
|---|---|
t, f |
Boolean (true, false) |
n |
Null |
( |
Literal string |
< followed by < |
Dictionary |
< |
Hex string |
[ |
Array |
/ |
Name |
0–9, -, +, . |
Number or indirect reference |
parse_name¶
Reads a /-prefixed PDF name. Handles #XX hex escapes: a # followed by two hex digits is decoded to the corresponding byte value. This allows names to contain bytes that would otherwise be syntactically significant (e.g., /Type#20Name → /Type Name).
parse_literal_string¶
Reads a (...) string with the following escape handling:
- \n → LF, \r → CR, \t → tab, \b → backspace, \f → form feed
- \( → (, \) → ), \\ → \
- \ddd (1–3 octal digits) → the corresponding byte value
- \ followed by a line ending → line continuation (the newline is discarded)
- Nested parentheses are tracked with a depth counter; the string ends at the unmatched ).
parse_hex_string¶
Reads a <...> hex string. Whitespace within the angle brackets is filtered out (PDF allows it). If the number of hex digits is odd, a trailing 0 is appended before decoding, producing the correct behavior per the spec.
parse_number_or_reference¶
PDF's indirect reference syntax (N G R) is ambiguous at the point of reading the first integer: it could be a standalone integer or the start of a three-token reference. This method resolves the ambiguity speculatively:
- Parse the first integer
N. - Save the cursor position.
- Skip whitespace; try to parse a second integer
G. - Skip whitespace; check for
R. - If all three succeed → return
PdfValue::Reference(N, G). - Otherwise → restore the saved position and return
PdfValue::Integer(N)(orPdfValue::Numberif a decimal point was present).
The rollback is cheap (one integer assignment) and avoids the need for a lookahead buffer or a separate tokenization pass.
4. Stream parsing¶
A PDF stream object is a dictionary followed by the keyword stream, raw bytes, then endstream. The parser must know how many bytes to read, which creates a subtle bootstrap problem.
Length resolution¶
The stream dictionary contains a Length entry giving the byte count of the raw data. The ideal approach would be to resolve Length as an indirect reference (many PDFs write Length N 0 R). But at the time each stream is being parsed, the object table is not yet fully built — we are currently in the process of building it. Resolving Length as a reference would require looking up an object that may not have been parsed yet.
The parser therefore reads Length only as a direct integer literal. If Length is a reference, the direct integer read fails and the parser falls through to the fallback.
Fallback: scan for endstream¶
When Length is missing, wrong, or an unresolvable reference, the parser scans forward byte-by-byte for the literal sequence endstream. This is slower but handles the large fraction of real-world PDFs that use a reference for Length or have an inaccurate Length value.
consume_stream_line_break¶
The PDF spec requires exactly one line ending (CR, LF, or CRLF) between the stream keyword and the first byte of data. The parser consumes exactly that sequence before recording the data start offset. Reading more or fewer bytes here would cause all stream offsets to be off by one.
5. Document building (build_document)¶
Encryption check¶
If the trailer dictionary contains an Encrypt key, build_document immediately returns Err(PdfError::Unsupported). There is no attempt to parse encrypted content. Attempting to process encrypted PDF structure as if it were plaintext would produce garbage results with no error signal — explicit rejection is the only safe behavior.
Page tree traversal: collect_pages¶
PDF pages are organized in a Pages tree of arbitrary depth. collect_pages recurse walks the tree and flattens it into an ordered Vec<Page>, carrying inherited properties downward at each level.
Inheritance. The following properties are inherited from ancestor nodes if not defined on the leaf:
- Resources — font and graphics state dictionaries
- MediaBox — the nominal page rectangle
- CropBox — defaults to MediaBox when absent
- Rotate — page rotation in degrees; defaults to 0
Depth limit. MAX_PAGE_TREE_DEPTH = 64. If the recursion depth exceeds this value, collect_pages returns an error. A deeply nested page tree is almost certainly either malformed or adversarial; 64 levels is far beyond what any real document requires.
Cycle detection. A BTreeSet<ObjectRef> of visited page tree nodes is passed through the recursion. If a node references an object that is already in the visited set, traversal stops with an error. This prevents infinite loops on PDFs with cyclic Kids arrays.
6. Step-by-step walkthrough: parsing a tiny PDF¶
Consider this minimal valid PDF (simplified, showing structure):
%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Resources << >> >>
endobj
xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000115 00000 n
trailer
<< /Size 4 /Root 1 0 R >>
startxref
210
%%EOF
Step 1 (parse_header): The cursor reads %PDF-1.4. Version string "1.4" is recorded. Position advances past the newline.
Step 2 (find_startxref): The parser scans backward from the end. It finds startxref followed by 210. Byte offset 210 is the xref table start.
Step 3 (parse_xref_table): The parser positions the cursor at byte 210 and reads:
- Keyword xref
- Subsection header 0 4 — starting object number 0, four entries
- Four 20-byte entries:
- Object 0, gen 65535: free (type f) — ignored
- Object 1, gen 0: in-use at byte 9 (n)
- Object 2, gen 0: in-use at byte 58 (n)
- Object 3, gen 0: in-use at byte 115 (n)
- Trailer dictionary << /Size 4 /Root 1 0 R >>
No Prev key in the trailer, so the chain walk stops. The merged xref table has three in-use entries. The trailer is stored.
Step 4 (object parsing): For each in-use entry:
- Byte 9: cursor reads 1 0 obj, then dictionary << /Type /Catalog /Pages 2 0 R >>, then endobj. Stored as ObjectRef { 1, 0 } → PdfObject::Value(PdfValue::Dictionary(...)).
- Byte 58: object 2 — the Pages node.
- Byte 115: object 3 — the Page leaf, containing a MediaBox array.
Step 5 (build_document): No Encrypt key. Trailer has /Root 1 0 R. Object 1 is resolved; its /Type is /Catalog and /Pages points to object 2. collect_pages starts at object 2:
- Object 2 is a Pages node with /Kids [3 0 R] and /Count 1.
- Recursion enters object 3. It is a Page leaf.
- MediaBox [0 0 612 792] is read directly from object 3.
- CropBox defaults to MediaBox. Rotate defaults to 0.
- One Page is appended to the result vector.
ParsedDocument is returned with one page, version "1.4", and three objects in the table.
7. Design rationale¶
Cursor-based rather than tokenizer-based. A separate tokenizer would produce an intermediate Vec<Token> before any structure is parsed. The cursor approach parses structure directly from bytes in a single pass. The code is simpler, the allocation profile is lower, and there is no impedance mismatch between the token stream and the byte-level details (like stream data offsets) that must be tracked precisely.
Speculative reference parsing. The three-token N G R syntax is not prefixed by a unique sigil, so the parser cannot know at byte 0 of the value whether it is reading an integer or a reference. The speculative rollback strategy handles this without backtracking buffers or two-pass parsing.
Length hint with scan fallback. Real-world PDFs routinely use Length N 0 R in stream dictionaries. Requiring a directly-encoded integer would reject those PDFs outright. The scan fallback is slower but makes the parser useful on documents generated by the majority of PDF producers.
Incremental update flattening at parse time. The xref chain walk and or_insert merge policy collapse all revisions into a single object table before any consumer sees the document. This means the rest of the engine — extraction, redaction, serialization — never needs to reason about revision history. The tradeoff is that revision history cannot be recovered after parsing; that is acceptable for a redaction engine, which only cares about the current logical state of the document.
8. What breaks if you change this carelessly¶
| Change | Consequence |
|---|---|
Remove xref cycle detection (BTreeSet<usize>) |
Infinite loop on any PDF with a self-referential Prev pointer |
Change or_insert to insert |
Old revisions overwrite new ones; deleted or replaced objects reappear |
Resolve Length as an indirect reference during parse |
Deadlock or panic — the object table is not built yet |
Remove the endstream fallback |
Any PDF where Length is a reference (common) produces a parse error |
Remove the endstream fallback |
Any PDF with an inaccurate Length (common) produces corrupt stream data |
Remove MAX_PAGE_TREE_DEPTH |
Stack overflow on deeply nested or cyclically structured page trees |
Remove page tree cycle detection (BTreeSet<ObjectRef>) |
Infinite recursion on cyclic Kids arrays |
Accept XRefStm without implementing xref stream decoding |
Object table is silently incomplete; missing objects appear as errors later |