Security and Correctness Model¶
1. Core security principle¶
A visible black rectangle is not a redaction unless the underlying content is removed or neutralized at the byte level.
2. What this engine guarantees¶
- Targeted text bytes are physically removed from or replaced in the content stream
- In Redact mode: kern compensation preserves layout, and an overlay covers the resulting gap
- In Strip/Erase modes: bytes are removed without overlay
- Metadata and attachments can be stripped
- Output is a single-revision PDF — no old content is accessible via a
Prevchain FileAttachmentannotations are always removed regardless of their position
3. What this engine does NOT guarantee¶
- Complete redaction of all copies of targeted content (text may appear in bookmarks, outlines, or destinations not parsed by this engine)
- Redaction of content inside Form XObjects (hard error if present on targeted pages)
- Redaction of content in unsupported font encodings
- Protection against PDF recovery or forensics on the original file
4. Defensive design choices¶
- Operator whitelist: unknown operators on redacted pages cause hard errors rather than silently passing through
- Explicit unsupported errors: encrypted PDFs, xref streams, unknown filters, and non-Identity-H encodings all fail explicitly
- Decompression bomb protection: 256 MiB limit on decoded stream size
- Page tree depth limit:
MAX_PAGE_TREE_DEPTH = 64prevents stack overflow from malformed trees - Cycle detection: applied in page tree traversal,
Prevchain following, and reachable-ref collection - Conservative annotation removal: annotations without a
Rectare removed (except Links)
5. The "fail explicitly" philosophy¶
Every unsupported feature returns PdfError::Unsupported or PdfError::UnsupportedOption. The engine never silently degrades. This is critical for redaction: silent degradation could mean unredacted content passes through to the output file without the caller being aware.
6. Known security-relevant limitations¶
vandybezier curves: path bounds may be underestimated because these curves are not fully accumulated- Quad intersection uses AABB approximation: for rotated quads, narrow slivers may be missed
- No ToUnicode for simple fonts: non-ASCII text in Type1/TrueType fonts appears as replacement characters and cannot be searched or redacted by text search
- Text in invisible mode (
Tr=3): included in glyphs for redaction but excluded from search results — this is correct behavior, since you must be able to redact what you cannot see
7. Why it was coded this way¶
- Whitelist over blacklist: an unknown operator might carry redactable content; passing it through blindly is unsafe
- Fail-explicit over fail-soft: for a redaction tool, silent failure is a security vulnerability, not a graceful degradation
- Conservative annotation removal: an annotation without geometric overlap may still contain sensitive information in its metadata
8. What would break¶
| Change | Consequence |
|---|---|
| Switching to an operator blacklist | Unknown operators pass through; potential data leak |
| Allowing Form XObjects to pass through | Content inside them escapes redaction |
Not stripping Prev from saved files |
Entire pre-redaction document accessible via Prev chain |
Not removing FileAttachment annotations |
Attached files survive redaction intact |