Redaction Workflow¶
End-to-end pipeline¶
- Parse the PDF structure
- Traverse the page tree and normalize page boxes
- Parse page content streams into operations
- Extract glyph geometry and searchable text
- Normalize authoring input into canonical page-space targets
- Remove or neutralize intersecting text, vectors, images, and annotations
- Paint visible redaction fills
- Save a new deterministic PDF
Manual rectangles¶
Manual rectangle authoring is a UI convenience layer. The engine still receives canonical page-space targets, not DOM coordinates.
Search-driven redaction¶
Search works in visual glyph order and returns quad groups. These can be passed directly into apply_redactions.
Redaction modes¶
The mode field on RedactionPlan controls the visual and structural output:
| Mode | Bytes removed | Overlay painted | Surrounding text |
|---|---|---|---|
strip |
yes | no | shifts to fill gap |
redact |
yes (blank space) | yes | stays in place |
erase |
yes (blank space) | no | stays in place |
redact is the default when mode is omitted. The fill color for the overlay defaults to black and can be overridden via fill_color / fillColor.
Apply semantics¶
- text glyphs intersecting a target are removed or replaced according to the selected mode
- intersecting path paints are neutralized
- intersecting image draws are removed conservatively at invocation level
- Form XObjects whose bounding quad intersects a target are redacted via per-page copy-on-write — the copy's content stream is rewritten for text, vector paint, and inner Image
Doinvocations, and nested Forms recurse up to depth 8 - documents with hidden-by-default Optional Content Groups are refused outright so hidden-layer content cannot slip through
- optional annotation removal can strip intersecting annotation objects from touched pages
- when
overlayTextis set inredactmode, a Helvetica label is stamped inside each overlay rectangle, auto-sized to fit and coloured for contrast against the fill
Save semantics¶
The writer emits a new PDF with a full save. Rewritten content streams are FlateDecode-compressed. The output does not rely on hidden references back to the original file, and xref streams + object streams from the input are flattened into a single classic xref table with inline indirect objects.