Writing and Saving Sanitized PDFs¶
1. Overview¶
The output pipeline is deliberately simple. pdf_writer::save_document() wraps pdf_objects::serialize_pdf(). The writer produces a clean, single-revision PDF with a classic xref table.
2. Serialization process (serialize_pdf)¶
- Write header:
%PDF-<version>\n%\xFF\xFF\xFF\xFF\n(high bytes trigger binary detection in transfer tools) - Write each object in
BTreeMaporder (deterministic by object number) - For streams: update
Lengthto currentdata.len(), ensure newline beforeendstream - Build single xref table with byte offsets
- Write trailer dictionary — crucially removes
PrevandXRefStmkeys - Write
startxrefand%%EOF
3. Value serialization¶
- Numbers: zero fractional part → no decimal point; otherwise 6 decimal places with trailing zeros trimmed
- Names:
#XXhex escapes for non-printable bytes, delimiter characters, and#itself - Strings: escape
(,),\; named escapes for control characters; octal for non-printable bytes - Dictionaries: keys written in BTree order (alphabetical)
4. Why incremental updates are flattened¶
The writer always produces a single-revision document. This is critical for security:
PrevandXRefStmremoval prevents readers from following the chain to old revisions- A redacted document with old revisions would leak the unredacted content to any reader that follows the
Prevpointer - Single revision = simpler output, smaller file, no revision history
5. Why new content streams are uncompressed¶
The redaction engine writes new content stream bytes without re-compression (no FlateDecode filter applied). Reasons:
- Simplicity: avoids re-encoding complexity
- Debuggability: uncompressed streams are human-readable during development and debugging
- The original compressed streams are replaced entirely, not patched
- Trade-off: slightly larger output files
6. The pdf_writer crate¶
Currently a one-function passthrough (save_document → serialize_pdf). It exists as a named boundary:
- Allows future enhancements (object renumbering, compression, linearization) without touching
pdf_objects - Keeps the dependency graph clean: callers depend on
pdf_writer, not onpdf_objectsserialization internals
7. What would break¶
| Omission | Consequence |
|---|---|
Not removing Prev |
Old unredacted content accessible via Prev chain — security violation |
Not updating Length |
Readers cannot find endstream; file corrupted |
HashMap instead of BTreeMap |
Non-deterministic output order; tests become flaky |
| Not flattening to single revision | Pre-redaction content survives in the file |