Skip to content

Supported PDF Subset

This project intentionally targets a narrow, explicit MVP.

Supported now

  • Unencrypted PDFs, and PDFs secured by the Standard Security Handler in any of these configurations, under either the user password or the owner password (including the empty user password used by "encrypted to prevent editing but openable by anyone" documents):
    • V = 1 or 2 with R = 2 or 3 (RC4 up to 128-bit)
    • V = 4 with R = 4 and the /StdCF crypt filter using /CFM /V2 (RC4-128) or /CFM /AESV2 (AES-128-CBC, PKCS#7 padding, 16-byte IV prepended) for strings and streams; /Identity crypt filters are treated as pass-through
    • V = 5 with R = 5 or R = 6 and the /StdCF crypt filter using /CFM /AESV3 (AES-256-CBC, PKCS#7 padding, 16-byte IV prepended) — R = 5 uses plain SHA-256 for the password verifier (the Extension Level 3 form) and R = 6 runs the full ISO 32000-2 iterative Algorithm 2.B hash The Encrypt dictionary is parsed, the file key is derived per revision (Algorithm 2 + 4/5 for V=1/2/4; Algorithm 2.A + 2.B for V=5 with AES-256 unwrap of /OE / /UE; Algorithm 7 is used to recover the user password from /O when authenticating an owner password under V=1/2/4; Algorithm 2 step 5 appends 0xFFFFFFFF when /EncryptMetadata is explicitly false under V=4), each object's strings and stream data are decrypted with per-object keys for V=1/2/4 (Algorithm 1, with the sAlT suffix for AES-128) or the file key directly for V=5, and the in-memory document no longer carries the /Encrypt entry. Streams with /Type /Metadata skip decryption when the handler was opened under /EncryptMetadata false.
  • PDFs secured by the public-key handler (/Filter /Adobe.PubSec) with SubFilter /adbe.pkcs7.s4 (V=4 / AES-128) or /adbe.pkcs7.s5 (V=5 / AES-256). Authentication uses a recipient X.509 certificate (DER-encoded) plus its matching RSA private key (DER-encoded PKCS#8). The CMS EnvelopedData recipient list is parsed; the matching KeyTransRecipientInfo is identified by IssuerAndSerialNumber or SubjectKeyIdentifier; its encryptedKey is RSA-decrypted (PKCS1v15 or OAEP) to recover the AES-CBC content-encryption key; the seed (20 bytes) + permissions (4 bytes) are AES-CBC decrypted from the inner content; the file encryption key is derived as SHA-1(seed ‖ all_recipient_blobs ‖ perms)[..16] (s4) or SHA-256(seed ‖ all_recipient_blobs ‖ perms)[..32] (s5); thereafter the per-object decryption pipeline is identical to the Standard handler. Once the document is decrypted in place the /Encrypt entry and the recipient certificate dictionary are stripped from the in-memory state.
  • Classic xref tables, including incremental update chains (multiple xref sections linked via Prev)
  • PDF 1.5+ cross-reference streams (/Type /XRef) and the hybrid form where a legacy trailer carries an XRefStm pointer
  • Object streams (/Type /ObjStm) — compressed objects are materialized into the regular object store during parsing
  • Full-document rewrites on save with input-shape mirroring: classic-input PDFs save as classic xref + trailer; xref-stream-shaped inputs save as a Type /XRef stream with eligible objects (gen=0, non-stream values) packed into freshly-built Type /ObjStm containers. Incremental-update chains are always collapsed into a single revision on output (/Prev and /XRefStm are stripped) so pre-redaction bytes cannot leak via revision walking. The parser drops the Encrypt dictionary after decryption and the original ObjStm containers after materialisation so neither survives in saved bytes.
  • Unfiltered, FlateDecode, ASCII85Decode, ASCIIHexDecode, LZWDecode, and RunLengthDecode stream filters — including filter chains (e.g. [/ASCII85Decode /FlateDecode]) — with the TIFF predictor (/Predictor 2) and PNG predictors 10–15 (via DecodeParms /Predictor) applied to the final stage. LZWDecode honours DecodeParms /EarlyChange (0 or 1, defaulting to 1).
  • Page tree traversal with inherited resources, media boxes, crop boxes, and page rotation
  • Inline images (BI/ID/EI) are safely skipped during content stream parsing
  • Dictionary operands in content streams (e.g., BDC with <</MCID 0>>)
  • Common text operators
  • Common path, paint, and graphics-state operators (q, Q, cm, gs, w, J, j, M, d, ri, i)
  • Clipping path operators (W, W*)
  • Color operators for device and general color spaces (rg/RG, g/G, k/K, cs/CS, sc/SC, scn/SCN)
  • Curve segment operators (c, v, y) — all three are included in path bounds used by vector paint neutralization
  • Marked-content operators as safe pass-through (BMC, BDC, EMC, MP, DP)
  • Compatibility sections (BX / EX) — recognized operators inside the section are processed normally, and unrecognized operators are passed through per PDF § 7.8.2 instead of rejecting the page
  • ExtGState font entries (fonts set via gs operator)
  • Image XObject invocation detection
  • Type1 and TrueType fonts in the current text path, including ToUnicode CMap decoding, /Encoding /WinAnsiEncoding (full Windows-1252 repertoire), /Encoding /MacRomanEncoding (full Mac Roman repertoire), /Encoding /StandardEncoding (Adobe Standard PostScript encoding, including its quoteright / quoteleft treatment of 0x27 and 0x60), and /Encoding dictionaries with a /Differences array resolved through an Adobe Glyph List subset
  • Form XObjects (/Subtype /Form) traversed during text extraction and search, including the Form's Matrix, its own Resources.Font and Resources.ExtGState, and cycle-protected recursion
  • Type0 with Identity-H (two-byte CIDs + ToUnicode maps) and Adobe's predefined Unicode-keyed CJK CMaps UniGB-UCS2-H, UniKS-UCS2-H, UniJIS-UTF16-H, and UniCNS-UTF16-H — for the predefined CMaps the byte stream is decoded directly to Unicode (UCS-2 BE for the *-UCS2-H pair, UTF-16 BE for the *-UTF16-H pair, including surrogate-pair SMP scalars); ToUnicode overrides are still consulted for BMP code units when present, and glyph widths fall back to the descendant font's /DW. To avoid silently mis-positioning glyphs, predefined-CMap fonts whose descendant /W array overrides any CID's width away from /DW are rejected explicitly; ToUnicode entries with 4-byte source codes (UTF-16 surrogate pairs) are silently skipped so a single SMP entry doesn't tank the parse.
  • Rectangle, quad, and quad-group redaction targets
  • Three redaction modes: strip (remove bytes), redact (blank space + overlay), erase (blank space, no overlay)
  • overlayText support for redact mode — labels are stamped inside each overlay rectangle using Helvetica sized to fit, with contrast-aware black or white text color against the fill
  • Metadata stripping for supported document layouts
  • Attachment stripping for supported embedded-file layouts

Explicitly unsupported or incomplete

  • Adobe.PubSec SubFilter /adbe.pkcs7.s3 (V=1 RC4-40, deprecated since the early 2010s) and key-agreement (KeyAgreeRecipientInfo, ECDH) recipients — both rejected with Unsupported. Standard Security Handler V=1/2/4/5 (RC4 / AES-128 / AES-256) is in via the open_with_password / openPdfWithPassword entry points; PubSec /adbe.pkcs7.s4 (V=4 / AES-128) and /adbe.pkcs7.s5 (V=5 / AES-256) with key-transport (KeyTransRecipientInfo, RSA-PKCS1v15 or RSA-OAEP) recipients are in via the open_with_certificate / openPdfWithCertificate entry points
  • Incremental update preservation (output is always a flat rewrite; xref streams are rewritten as a classic xref table)
  • Documents whose catalog has /OCProperties with any layer off in the default configuration or with /BaseState /OFF / /Unchanged are rejected up front unless the caller opts in via sanitizeHiddenOcgs: true. The opt-in pass strips BDC /OC /<name> ... EMC content gated by hidden OCGs and clears the catalog's hidden-layer state on save, but OCG markers inside nested Form XObjects are not yet rewritten — a warning is emitted when a page with sanitizable content also has XObjects
  • Type3 fonts
  • Composite (Type0) fonts with encodings other than Identity-H and the four supported Unicode-keyed CMaps (UniGB-UCS2-H, UniKS-UCS2-H, UniJIS-UTF16-H, UniCNS-UTF16-H); vertical writing CMaps (-V variants) and registry-keyed CMaps such as 90ms-RKSJ-H are rejected with an explicit error
  • Partial Image XObject rewriting for Indexed, ICCBased, JBIG2Decode, JPXDecode, CCITTFaxDecode, or BitsPerComponent other than 8 — these images fall back to whole-invocation neutralization (the Do is replaced with n). Supported partial-mask formats are raw / FlateDecode / DCTDecode over DeviceGray / DeviceRGB / DeviceCMYK at 8 bpc; the engine masks the affected pixel region with the plan's fill_color and copy-on-writes the image stream so multi-page-shared images are unaffected.
  • Stream filters outside the FlateDecode / ASCII85Decode / ASCIIHexDecode / LZWDecode / RunLengthDecode set (notably DCTDecode, JBIG2Decode, JPXDecode, CCITTFaxDecode)

Failure model

When unsupported content affects correctness, the engine returns an explicit error instead of pretending to succeed.

Typical behavior:

  • unsupported operators on redacted pages return Unsupported (the operator allow-list covers common text, path, color, clipping, graphics-state, and marked-content operators)
  • unimplemented plan options return UnsupportedOption
  • malformed structure returns parse or corruption errors

Security posture

This subset is intentionally conservative. If the engine cannot safely rewrite the targeted content, it should fail rather than emit a misleading "sanitized" PDF.