Skip to content

Redaction Application Pipeline

1. Entry point

apply_redactions(file, pages, plan) in crates/pdf_redact/src/redact.rs.

Accepts the parsed PDF file, the extracted page representations, and a NormalizedRedactionPlan. Returns a modified in-memory PDF ready for serialization by pdf_writer.

2. Phase 1: Global operations

Global operations run once before any per-page work.

strip_metadata

Removes the Info dictionary from the trailer and the Metadata stream from the document catalog. Both can contain author names, creation timestamps, software identifiers, and other identifying information.

strip_attachments

Walks the Names/EmbeddedFiles name tree via depth-first search with cycle detection (to handle malformed PDFs with circular references). Removes all reachable embedded file objects. Attached files can contain the original unredacted source document or other sensitive material.

3. Phase 2: Per-page processing

For each page that has at least one redaction target, the engine runs the following steps in order.

Step 1: analyze_page_text

Extracts all glyphs from the page with their page-space positions and dimensions. This is the same extraction path used by the search subsystem, ensuring that what the engine sees during redaction matches what the user saw when they identified the match.

Step 2: parse_page_contents

Parses the page content stream into a structured list of PDF operators and their operands.

Step 3: ensure_supported_operators

Checks every operator in the content stream against a whitelist of operators the engine understands. Any unknown operator causes an explicit error rather than silent pass-through. This prevents an attacker or a malformed PDF from smuggling content through an unrecognized operator.

Step 4: load_xobjects

Identifies all XObjects referenced by Do operators in the content stream and classifies each as an Image XObject or a Form XObject.

Step 5: collect_glyph_removals

Intersects each glyph's bounding quad against each target's bounding box. Glyphs that intersect are recorded for removal. The result is a set of glyph indices that should not appear in the output stream.

Step 6: rewrite_text_operations

Walks the content stream operators. For text-painting operators (Tj, TJ, ', "), removes or compensates each glyph that appears in the removal set. The exact behavior depends on the redaction mode (see kern compensation below).

Step 7: neutralize_vector_operations

Scans path construction and painting operators. If the accumulated path's bounding box intersects any target, the painting operator is replaced with n (no-paint), which discards the path without drawing it.

Step 8: neutralize_image_operations

For each Do operator referencing an Image XObject, transforms the unit square [0,0,1,1] by the current CTM to find the image's page-space footprint. If the footprint intersects any target, replaces the Do with n and marks the XObject for deferred removal.

Step 9: remove_annotations

If remove_annotations is enabled in the plan (default: true), parses each annotation's Rect, transforms it to page space, and checks for intersection with any target. FileAttachment annotations are always removed regardless of intersection, because attached files bypass the content-stream redaction entirely. Annotations without a Rect entry are conservatively removed unless they are Link annotations.

Step 10: serialize_operations

Converts the modified operator list back to content-stream bytes.

Step 11: Overlay stream (Redact mode only)

If the mode is Redact, overlay_stream_bytes generates a separate content stream that paints filled colored quads over each target. The overlay is appended as an additional content stream after the rewritten page content.

Step 12: Write new content stream

The new content bytes replace the page's content stream. Old content stream object references are queued for deferred removal.

4. Kern compensation (the heart of Redact/Erase modes)

When a glyph is removed in Redact or Erase mode, the surrounding text must not shift. PDF text positioning depends on each glyph's advance width; removing a glyph shortens the line.

build_compensated_array(string, removed_indices, glyph_starts) converts a Tj string into a TJ array:

  1. Iterates through each character in the string.
  2. If the character is in the removal set, accumulates its advance width in kern_accum (in text-space units).
  3. If the character is not removed, and kern_accum > 0, emits a negative kern entry PdfValue::Number(-kern_accum) before emitting the character's byte. In the TJ operator, a negative number moves the text position to the right by that many thousandths of a text-space unit, compensating for the missing glyph width.
  4. Emits the kept character's byte as a string fragment.
  5. Changes the operator from Tj to TJ.

The ' and " operators (move-to-next-line variants of Tj) produce side effects on the text position that are not safely reproducible through kern compensation alone. These fall back to Strip mode with a warning logged.

5. Vector neutralization

The engine simulates the current transformation matrix (CTM) through the content stream:

  • q — pushes a copy of the CTM stack.
  • Q — pops the CTM stack.
  • cm — concatenates a matrix onto the current CTM.

Path construction operators (m, l, c, h, re) accumulate path segments into a working path bounding box. The v and y bezier curve operators are whitelisted (accepted without error) but their control points are not accumulated into the bounding box — a known gap that could cause a slightly undersized bounds estimate for paths that use them.

On any paint operator (S, s, f, F, f*, B, B*, b, b*), the engine inflates the path bounding box by the current stroke width and tests it against all targets. If any target intersects, the paint operator is replaced with n. The path is discarded; the surrounding non-intersecting paths are unaffected.

6. Image neutralization

For a Do operator referencing an Image XObject, the engine:

  1. Takes the unit square corners (0,0), (1,0), (1,1), (0,1).
  2. Applies the current CTM to transform them to page space (images are placed by setting the CTM before calling Do).
  3. Computes the axis-aligned bounding box of the transformed corners.
  4. Tests against all targets.
  5. If any intersection is found: replaces the Do with n, adds the XObject reference to the deferred-removal set.

Form XObjects are not supported. A Do referencing a Form XObject returns a hard error. Form XObjects contain their own content streams with their own resource dictionaries and coordinate spaces; correct redaction would require recursively applying the full pipeline, which is not implemented.

7. Deferred cleanup

Old content streams and neutralized Image XObjects are not removed during per-page processing. They are removed in a single pass after all pages have been processed.

This is necessary because PDF objects are shared by reference. A logo image used on every page of a document is stored as one XObject referenced from every page. If that XObject is removed when the first page is processed, all subsequent pages will reference a missing object and the output PDF will be corrupt.

Deferring to a post-loop cleanup also makes the page loop idempotent: each page sees a consistent object graph.

8. Overlay generation

overlay_stream_bytes(targets, color, page_transform, final_ctm) produces the content stream for the Redact-mode colored rectangle overlay.

The overlay is appended after the page's main content stream. At that point, the CTM may not be the identity matrix — the content stream may have left an active transformation. The overlay must draw in page space regardless of what the prior stream did.

Steps:

  1. Compute final_page_ctm(operations) by simulating the CTM through the entire rewritten content stream (same logic as vector neutralization).
  2. Compute the inverse of page_transform (the crop-box and rotation normalization applied to convert PDF coordinates to page space).
  3. Emit q to save the graphics state.
  4. Emit cm with the product of the inverse CTM and the inverse page transform, so that subsequent coordinates are interpreted in page space.
  5. Emit rg with the fill color.
  6. For each target quad: emit a re or path sequence followed by f to paint a filled rectangle or quadrilateral.
  7. Emit Q to restore the graphics state.

9. Annotation removal

For each annotation on the page:

  1. Parse the annotation's Rect array (PDF rectangle in the page's user space).
  2. Transform the rectangle to normalized page space using the page transform.
  3. Test against all targets using AABB intersection.
  4. If any intersection: mark the annotation for removal from the page's Annots array.

Additional rules applied regardless of intersection:

  • FileAttachment annotations are always removed. They reference embedded files that exist outside the content stream and would survive content-stream redaction intact.
  • Annotations without a Rect are conservatively removed, with the exception of Link annotations (which are positional by definition and cannot contain content).

10. Why it was coded this way

Decision Reason
Deferred cleanup Prevents removal of shared XObjects before all referencing pages are processed. Removing during the loop corrupts the object graph.
Operator whitelist instead of blacklist Unknown operators are rejected with an explicit error. A blacklist approach would silently pass through operators the engine does not understand, potentially allowing redacted content to survive in an unrecognized form.
Kern compensation Legal and regulatory documents depend on layout stability. Removing glyphs without compensation shifts surrounding text, changes line breaks, and may alter the visible meaning of adjacent content.
Final CTM simulation for overlay The overlay stream is appended after the page content stream. The content stream may have left any CTM active. Without inverting the final CTM, the overlay coordinates would be interpreted in whatever space the content stream ended in, not in page space.
FileAttachment always removed File attachments contain the attached file's bytes directly in the PDF. They do not appear in the content stream and are invisible to content-stream redaction. Always removing them closes the data exfiltration path.

11. What would break

  • Removing objects during the page loop: Shared XObjects (logos, repeated images) disappear before other pages use them. The output PDF references missing objects and is corrupt in any conforming viewer.
  • Not inverting the final CTM: The overlay is drawn in the coordinate space left active by the content stream, not in page space. The colored rectangles appear at the wrong position, size, or orientation.
  • Not removing FileAttachment annotations: The attached file survives redaction. An adversary with access to the output PDF can extract the original unredacted document from the attachment.
  • Using a blacklist for operators: Any operator not in the blacklist passes through silently. A PDF crafted with a non-standard operator carrying redacted text would survive the pipeline unchallenged.