Redaction Target Model¶
1. Why rectangles alone are not enough¶
PDF text is not restricted to horizontal baselines. The text matrix can rotate, skew, or scale individual characters or entire text blocks. A word placed at 45 degrees occupies a diamond-shaped area in page space, not an axis-aligned rectangle.
An axis-aligned bounding box around a rotated word is significantly larger than the word itself. Using that enlarged box as a redaction target would either catch adjacent content that should not be redacted or require per-case manual correction. A quadrilateral that follows the actual glyph corners avoids both problems.
2. The three target types¶
All coordinates are in normalized page space: crop-box-relative, with page rotation applied. This means callers do not need to know the internal PDF coordinate system; they work in the coordinate space the viewer presents.
Rect¶
An axis-aligned rectangle. The simplest and most common case. Used when the caller knows the approximate region to redact without needing glyph-level precision.
Quad¶
An arbitrary quadrilateral defined by four corner points in page space. Handles rotated or skewed text exactly. Points are conventionally ordered bottom-left, bottom-right, top-right, top-left, matching the PDF specification's quad point convention.
QuadGroup¶
A collection of quads that together form a single logical match — for example, the quads for each word in a multi-word phrase, or the quads spanning multiple lines. Grouped as a unit so the engine treats them as a single redaction target and can apply a single overlay across the group.
3. Normalization (normalize_plan)¶
User-facing RedactionPlan values are converted to NormalizedRedactionPlan before any engine work begins. Normalization:
- Validates page indices against the document page count.
- Validates that
Rectdimensions are positive and finite. - Validates that
QuadGrouplists are non-empty. - Converts each
Rectto a single-quadNormalizedPageTarget(four corners derived from x, y, width, height). - Converts each
QuadGroupto a singleNormalizedPageTargetcarrying the full quad list and a pre-computed axis-aligned union bounding box. - Rejects
overlay_textif present (replacement text is not implemented; returning an explicit error is safer than silently dropping the field). - Applies option defaults:
mode→Redactfill_color→ blackremove_annotations→truestrip_metadata→falsestrip_attachments→false
Normalization is the only place where input validation occurs. All downstream pipeline stages receive already-validated data.
4. Intersection testing¶
NormalizedPageTarget exposes two intersection methods used by the pipeline to decide which content overlaps a target:
intersects_rect(rect)— compares the axis-aligned bounding box of the target's quads against the provided rect. O(1).intersects_quad(quad)— compares the bounding box of the target's quads against the bounding box of the provided quad. O(1).
Both are AABB approximations. For axis-aligned text (the overwhelmingly common case), the approximation is exact. For rotated text, the bounding box is a conservative over-approximation: it may flag content as intersecting when the actual quads do not overlap. The consequence is a false-positive redaction in a rare case, which is the safe direction for a security tool.
Precise polygon intersection (e.g. SAT or Sutherland-Hodgman) would be more accurate for rotated quads but would complicate the implementation without meaningful benefit for the current supported text subset.
5. The three redaction modes¶
| Mode | Text bytes | Glyph positioning | Visual overlay | Use case |
|---|---|---|---|---|
Strip |
Removed from stream | Text reflows | None | Minimal output size; layout preservation not required |
Redact |
Kern-compensated in place | Preserved | Colored filled quad | Default; visually marks the redacted region with a solid rectangle |
Erase |
Kern-compensated in place | Preserved | None | Clean appearance; redacted areas become blank space without a visible marker |
Strip is simpler to implement but changes line lengths and can shift surrounding text. Redact and Erase both use kern compensation (see Redaction Application Pipeline) to hold the positions of surrounding glyphs constant, which is critical for legal and regulatory documents where layout integrity is a requirement.
6. Why it was coded this way¶
| Decision | Reason |
|---|---|
#[serde(tag = "kind")] on target variants |
Produces JSON like { "kind": "rect", ... } that is unambiguous and human-readable. Tag-based discrimination avoids wrapper objects and keeps the TypeScript SDK bindings clean. |
| Three distinct types instead of a generic polygon | Rect covers the dominant case with minimal input from callers. Quad and QuadGroup are progressively more expressive. Keeping Rect as a named type prevents callers from needing to compute quad corners for the common case. |
QuadGroup for multi-glyph matches |
A text search match spans many glyphs. Returning one target per glyph would produce hundreds of independent targets and complicate overlay generation. Grouping at the target level lets the overlay stage draw one visual region per match. |
Pre-computed union bounding box in NormalizedPageTarget |
Intersection tests are called once per glyph per target per page. Storing the bounding box avoids recomputing it on every call. |