Redaction Application Pipeline¶
1. Entry point¶
apply_redactions(file, pages, plan) in crates/pdf_redact/src/redact.rs.
Accepts the parsed PDF file, the extracted page representations, and a NormalizedRedactionPlan. Returns a modified in-memory PDF ready for serialization by pdf_writer.
2. Phase 1: Global operations¶
Global operations run once before any per-page work.
reject_hidden_optional_content / sanitize_hidden_optional_content¶
Before anything else, the catalog's /OCProperties is inspected. If it declares Optional Content Groups that are off in the default configuration — either via a non-empty /OFF array or via /BaseState /OFF / /Unchanged — the engine's default behaviour is to refuse the document with PdfError::Unsupported. Hidden-layer content that the user never saw cannot be targeted through the visible glyph list, so silently redacting only the visible portion would be a correctness hole.
Setting sanitize_hidden_ocgs: true on the plan replaces the rejection with an in-place sanitization pass:
- Collect the set of hidden OCG object refs from the catalog (
/OFF, or/BaseState /OFFminus/ON). - For each page, resolve
/Resources /Propertiesto the set of names that map to hidden OCGs. - Walk each page content stream, tracking marked-content nesting, and strip
BDC /OC /<name> ... EMCblocks whose name is in the hidden set. - Rewrite the catalog's
/OCProperties /Dto clear/OFFand set/BaseState /ON, so the saved output no longer advertises the hidden state.
Form XObject content is not rewritten yet — a warning is emitted on any sanitized page that also has XObjects so callers can audit.
strip_metadata¶
Removes the Info dictionary from the trailer and the Metadata stream from the document catalog. Both can contain author names, creation timestamps, software identifiers, and other identifying information.
strip_attachments¶
Walks the Names/EmbeddedFiles name tree via depth-first search with cycle detection (to handle malformed PDFs with circular references). Removes all reachable embedded file objects. Attached files can contain the original unredacted source document or other sensitive material.
3. Phase 2: Per-page processing¶
For each page that has at least one redaction target, the engine runs the following steps in order.
Step 1: analyze_page_text¶
Extracts all glyphs from the page with their page-space positions and dimensions. This is the same extraction path used by the search subsystem, ensuring that what the engine sees during redaction matches what the user saw when they identified the match.
Step 2: parse_page_contents¶
Parses the page content stream into a structured list of PDF operators and their operands.
Step 3: ensure_supported_operators¶
Checks every operator in the content stream against a whitelist of operators the engine understands. Any unknown operator causes an explicit error rather than silent pass-through. This prevents an attacker or a malformed PDF from smuggling content through an unrecognized operator.
Step 4: load_xobjects¶
Identifies all XObjects referenced by Do operators in the content stream and classifies each as an Image XObject or a Form XObject.
Step 5: collect_glyph_removals¶
Intersects each glyph's bounding quad against each target's bounding box. Glyphs that intersect are recorded for removal. The result is a set of glyph indices that should not appear in the output stream.
Step 6: rewrite_text_operations¶
Walks the content stream operators. For text-painting operators (Tj, TJ, ', "), removes or compensates each glyph that appears in the removal set. The exact behavior depends on the redaction mode (see kern compensation below).
Step 7: neutralize_vector_operations¶
Scans path construction and painting operators. If the accumulated path's bounding box intersects any target, the painting operator is replaced with n (no-paint), which discards the path without drawing it.
Step 8: neutralize_image_operations¶
For each Do operator referencing an Image XObject, transforms the unit square [0,0,1,1] by the current CTM to find the image's page-space footprint. The pass distinguishes three cases:
- No overlap — leave the
Dointact. - Full cover — the union of intersecting target AABBs (mapped back to image-unit-square space) contains the entire
[0,1] × [0,1]. TheDois replaced withnand the XObject is marked for deferred removal (current behaviour, unchanged). - Partial overlap — the targeted region maps to a strict sub-rectangle of the image. The pass records a
PendingImageMaskwith the originalObjectRef, the image-space pixel rectangle, and the redaction'sfill_color.
After the per-page neutralization completes, each pending mask is applied via apply_partial_mask:
- Clone the original image stream.
- Pass through
image_mask::mask_image_regionwhich detects the format (raw /FlateDecode/DCTDecode), decodes pixels (viadecode_streamorjpeg_decoder), paints the rectangular pixel region with the plan'sfill_color, and re-encodes (Flate-compressed for raw / Flate input, DCT-encoded at quality 85 for JPEG input). - Allocate a fresh
ObjectReffor the masked stream; insert it intofile.objects. - Repoint the page's
Resources.XObject[name]at the new ref via copy-on-write of any indirect XObject dictionary, so other pages that share the same image stream are unaffected.
Unsupported image formats (Indexed, ICCBased, JBIG2Decode, JPXDecode, CCITTFaxDecode, non-8-bpc) and any decode error make mask_image_region return PdfError::Unsupported; in that case apply_partial_mask falls back to whole-invocation neutralization (the Do is rewritten to n and the original stream is queued for removal). The ApplyReport.image_draws_masked and image_draws_removed counters distinguish the two outcomes.
Inside Form XObject content streams the same neutralization runs but partial masks always fall back to drop, since per-Form COW would require a deeper rewrite of nested Resources.
Step 9: remove_annotations¶
If remove_annotations is enabled in the plan (default: true), parses each annotation's Rect, transforms it to page space, and checks for intersection with any target. FileAttachment annotations are always removed regardless of intersection, because attached files bypass the content-stream redaction entirely. Annotations without a Rect entry are conservatively removed unless they are Link annotations.
Step 10: serialize_operations¶
Converts the modified operator list back to content-stream bytes.
Step 11: Overlay stream (Redact mode only)¶
If the mode is Redact, overlay_stream_bytes generates a separate content stream that paints filled colored quads over each target. The overlay is appended as an additional content stream after the rewritten page content.
Step 12: Write new content stream¶
The new content bytes replace the page's content stream. Old content stream object references are queued for deferred removal.
4. Kern compensation (the heart of Redact/Erase modes)¶
When a glyph is removed in Redact or Erase mode, the surrounding text must not shift. PDF text positioning depends on each glyph's advance width; removing a glyph shortens the line.
build_compensated_array(string, removed_indices, glyph_starts) converts a Tj string into a TJ array:
- Iterates through each character in the string.
- If the character is in the removal set, accumulates its advance width in
kern_accum(in text-space units). - If the character is not removed, and
kern_accum > 0, emits a negative kern entryPdfValue::Number(-kern_accum)before emitting the character's byte. In the TJ operator, a negative number moves the text position to the right by that many thousandths of a text-space unit, compensating for the missing glyph width. - Emits the kept character's byte as a string fragment.
- Changes the operator from
TjtoTJ.
The ' and " operators (move-to-next-line variants of Tj) produce side effects on the text position that are not safely reproducible through kern compensation alone. These fall back to Strip mode with a warning logged.
5. Vector neutralization¶
The engine simulates the current transformation matrix (CTM) through the content stream:
q— pushes a copy of the CTM stack.Q— pops the CTM stack.cm— concatenates a matrix onto the current CTM.
Path construction operators (m, l, c, h, re) accumulate path segments into a working path bounding box. The v and y bezier curve operators are whitelisted (accepted without error) but their control points are not accumulated into the bounding box — a known gap that could cause a slightly undersized bounds estimate for paths that use them.
On any paint operator (S, s, f, F, f*, B, B*, b, b*), the engine inflates the path bounding box by the current stroke width and tests it against all targets. If any target intersects, the paint operator is replaced with n. The path is discarded; the surrounding non-intersecting paths are unaffected.
6. Image neutralization¶
For a Do operator referencing an Image XObject, the engine:
- Takes the unit square corners
(0,0),(1,0),(1,1),(0,1). - Applies the current CTM to transform them to page space (images are placed by setting the CTM before calling
Do). - Computes the axis-aligned bounding box of the transformed corners.
- Tests against all targets.
- If any intersection is found: replaces the
Dowithn, adds the XObject reference to the deferred-removal set.
Form XObjects are handled intersection-aware. Each Form carries a BBox and an optional Matrix. At neutralization time the Form's rectangle is transformed through Matrix × current CTM × page transform and compared against the redaction targets.
When the resulting quad does not touch any target, the Form is left untouched and the page redacts normally. When it does touch a target, the engine allocates a per-page copy of the Form (a new ObjectRef with a cloned stream dictionary), rewrites the copy's content stream to strip the targeted glyph bytes — using the glyphs that were already tagged with that Form's ref during extraction — and re-emits the bytes with FlateDecode compression. The page's Resources.XObject entry is then rewritten on the page dictionary to point at the per-page copy, so other pages that still use the original Form are unaffected.
If the Form's content itself invokes another Form whose bounding quad also intersects a target, the rewrite recurses: the inner Form is copied too, its content is rewritten, and the outer Form's own Resources.XObject is repointed at the inner copy. Recursion is capped at depth 8 with a warning. After the text rewrite, the same neutralize_vector_operations and neutralize_image_operations passes used on the page are invoked on the Form's operations — with the Form's invocation CTM threaded in as the base — so vector paint and Do of Image XObjects that fall under a target inside the Form are neutralized alongside the text.
Text extraction and search still recurse into Form XObjects — see 06-text-system.md §1. The pipeline above is the redaction side of the story.
7. Deferred cleanup¶
Old content streams and neutralized Image XObjects are not removed during per-page processing. They are removed in a single pass after all pages have been processed.
This is necessary because PDF objects are shared by reference. A logo image used on every page of a document is stored as one XObject referenced from every page. If that XObject is removed when the first page is processed, all subsequent pages will reference a missing object and the output PDF will be corrupt.
Deferring to a post-loop cleanup also makes the page loop idempotent: each page sees a consistent object graph.
8. Overlay generation¶
overlay_stream_bytes(targets, color, page_transform, final_ctm) produces the content stream for the Redact-mode colored rectangle overlay.
The overlay is appended after the page's main content stream. At that point, the CTM may not be the identity matrix — the content stream may have left an active transformation. The overlay must draw in page space regardless of what the prior stream did.
Steps:
- Compute
final_page_ctm(operations)by simulating the CTM through the entire rewritten content stream (same logic as vector neutralization). - Compute the inverse of
page_transform(the crop-box and rotation normalization applied to convert PDF coordinates to page space). - Emit
qto save the graphics state. - Emit
cmwith the product of the inverse CTM and the inverse page transform, so that subsequent coordinates are interpreted in page space. - Emit
rgwith the fill color. - For each target quad: emit a
reor path sequence followed byfto paint a filled rectangle or quadrilateral. - Emit
Qto restore the graphics state.
9. Annotation removal¶
For each annotation on the page:
- Parse the annotation's
Rectarray (PDF rectangle in the page's user space). - Transform the rectangle to normalized page space using the page transform.
- Test against all targets using AABB intersection.
- If any intersection: mark the annotation for removal from the page's
Annotsarray.
Additional rules applied regardless of intersection:
- FileAttachment annotations are always removed. They reference embedded files that exist outside the content stream and would survive content-stream redaction intact.
- Annotations without a
Rectare conservatively removed, with the exception of Link annotations (which are positional by definition and cannot contain content).
10. Why it was coded this way¶
| Decision | Reason |
|---|---|
| Deferred cleanup | Prevents removal of shared XObjects before all referencing pages are processed. Removing during the loop corrupts the object graph. |
| Operator whitelist instead of blacklist | Unknown operators are rejected with an explicit error. A blacklist approach would silently pass through operators the engine does not understand, potentially allowing redacted content to survive in an unrecognized form. |
| Kern compensation | Legal and regulatory documents depend on layout stability. Removing glyphs without compensation shifts surrounding text, changes line breaks, and may alter the visible meaning of adjacent content. |
| Final CTM simulation for overlay | The overlay stream is appended after the page content stream. The content stream may have left any CTM active. Without inverting the final CTM, the overlay coordinates would be interpreted in whatever space the content stream ended in, not in page space. |
| FileAttachment always removed | File attachments contain the attached file's bytes directly in the PDF. They do not appear in the content stream and are invisible to content-stream redaction. Always removing them closes the data exfiltration path. |
11. What would break¶
- Removing objects during the page loop: Shared XObjects (logos, repeated images) disappear before other pages use them. The output PDF references missing objects and is corrupt in any conforming viewer.
- Not inverting the final CTM: The overlay is drawn in the coordinate space left active by the content stream, not in page space. The colored rectangles appear at the wrong position, size, or orientation.
- Not removing FileAttachment annotations: The attached file survives redaction. An adversary with access to the output PDF can extract the original unredacted document from the attachment.
- Using a blacklist for operators: Any operator not in the blacklist passes through silently. A PDF crafted with a non-standard operator carrying redacted text would survive the pipeline unchallenged.