Supported PDF Subset¶
This project intentionally targets a narrow, explicit MVP.
Supported now¶
- Unencrypted PDFs, and PDFs secured by the Standard Security Handler in any of these configurations, under either the user password or the owner password (including the empty user password used by "encrypted to prevent editing but openable by anyone" documents):
V= 1 or 2 withR= 2 or 3 (RC4 up to 128-bit)V= 4 withR= 4 and the/StdCFcrypt filter using/CFM /V2(RC4-128) or/CFM /AESV2(AES-128-CBC, PKCS#7 padding, 16-byte IV prepended) for strings and streams;/Identitycrypt filters are treated as pass-throughV= 5 withR= 5 orR= 6 and the/StdCFcrypt filter using/CFM /AESV3(AES-256-CBC, PKCS#7 padding, 16-byte IV prepended) —R= 5 uses plain SHA-256 for the password verifier (the Extension Level 3 form) andR= 6 runs the full ISO 32000-2 iterative Algorithm 2.B hash The Encrypt dictionary is parsed, the file key is derived per revision (Algorithm 2 + 4/5 for V=1/2/4; Algorithm 2.A + 2.B for V=5 with AES-256 unwrap of/OE//UE; Algorithm 7 is used to recover the user password from/Owhen authenticating an owner password under V=1/2/4; Algorithm 2 step 5 appends0xFFFFFFFFwhen/EncryptMetadatais explicitlyfalseunder V=4), each object's strings and stream data are decrypted with per-object keys for V=1/2/4 (Algorithm 1, with thesAlTsuffix for AES-128) or the file key directly for V=5, and the in-memory document no longer carries the/Encryptentry. Streams with/Type /Metadataskip decryption when the handler was opened under/EncryptMetadata false.
- PDFs secured by the public-key handler (
/Filter /Adobe.PubSec) with SubFilter/adbe.pkcs7.s4(V=4 / AES-128) or/adbe.pkcs7.s5(V=5 / AES-256). Authentication uses a recipient X.509 certificate (DER-encoded) plus its matching RSA private key (DER-encoded PKCS#8). The CMSEnvelopedDatarecipient list is parsed; the matchingKeyTransRecipientInfois identified byIssuerAndSerialNumberorSubjectKeyIdentifier; itsencryptedKeyis RSA-decrypted (PKCS1v15 or OAEP) to recover the AES-CBC content-encryption key; the seed (20 bytes) + permissions (4 bytes) are AES-CBC decrypted from the inner content; the file encryption key is derived asSHA-1(seed ‖ all_recipient_blobs ‖ perms)[..16](s4) orSHA-256(seed ‖ all_recipient_blobs ‖ perms)[..32](s5); thereafter the per-object decryption pipeline is identical to the Standard handler. Once the document is decrypted in place the/Encryptentry and the recipient certificate dictionary are stripped from the in-memory state. - Classic xref tables, including incremental update chains (multiple xref sections linked via
Prev) - PDF 1.5+ cross-reference streams (
/Type /XRef) and the hybrid form where a legacy trailer carries anXRefStmpointer - Object streams (
/Type /ObjStm) — compressed objects are materialized into the regular object store during parsing - Full-document rewrites on save with input-shape mirroring: classic-input PDFs save as classic xref + trailer; xref-stream-shaped inputs save as a
Type /XRefstream with eligible objects (gen=0, non-stream values) packed into freshly-builtType /ObjStmcontainers. Incremental-update chains are always collapsed into a single revision on output (/Prevand/XRefStmare stripped) so pre-redaction bytes cannot leak via revision walking. The parser drops the Encrypt dictionary after decryption and the original ObjStm containers after materialisation so neither survives in saved bytes. - Unfiltered,
FlateDecode,ASCII85Decode,ASCIIHexDecode,LZWDecode, andRunLengthDecodestream filters — including filter chains (e.g.[/ASCII85Decode /FlateDecode]) — with the TIFF predictor (/Predictor 2) and PNG predictors 10–15 (viaDecodeParms /Predictor) applied to the final stage.LZWDecodehonoursDecodeParms /EarlyChange(0 or 1, defaulting to 1). - Page tree traversal with inherited resources, media boxes, crop boxes, and page rotation
- Inline images (
BI/ID/EI) are safely skipped during content stream parsing - Dictionary operands in content streams (e.g.,
BDCwith<</MCID 0>>) - Common text operators
- Common path, paint, and graphics-state operators (
q,Q,cm,gs,w,J,j,M,d,ri,i) - Clipping path operators (
W,W*) - Color operators for device and general color spaces (
rg/RG,g/G,k/K,cs/CS,sc/SC,scn/SCN) - Curve segment operators (
c,v,y) — all three are included in path bounds used by vector paint neutralization - Marked-content operators as safe pass-through (
BMC,BDC,EMC,MP,DP) - Compatibility sections (
BX/EX) — recognized operators inside the section are processed normally, and unrecognized operators are passed through per PDF § 7.8.2 instead of rejecting the page - ExtGState font entries (fonts set via
gsoperator) - Image XObject invocation detection
Type1andTrueTypefonts in the current text path, includingToUnicodeCMap decoding,/Encoding /WinAnsiEncoding(full Windows-1252 repertoire),/Encoding /MacRomanEncoding(full Mac Roman repertoire),/Encoding /StandardEncoding(Adobe Standard PostScript encoding, including itsquoteright/quotelefttreatment of0x27and0x60), and/Encodingdictionaries with a/Differencesarray resolved through an Adobe Glyph List subset- Form XObjects (
/Subtype /Form) traversed during text extraction and search, including the Form'sMatrix, its ownResources.FontandResources.ExtGState, and cycle-protected recursion Type0withIdentity-H(two-byte CIDs +ToUnicodemaps) and Adobe's predefined Unicode-keyed CJK CMapsUniGB-UCS2-H,UniKS-UCS2-H,UniJIS-UTF16-H, andUniCNS-UTF16-H— for the predefined CMaps the byte stream is decoded directly to Unicode (UCS-2 BE for the*-UCS2-Hpair, UTF-16 BE for the*-UTF16-Hpair, including surrogate-pair SMP scalars);ToUnicodeoverrides are still consulted for BMP code units when present, and glyph widths fall back to the descendant font's/DW. To avoid silently mis-positioning glyphs, predefined-CMap fonts whose descendant/Warray overrides any CID's width away from/DWare rejected explicitly; ToUnicode entries with 4-byte source codes (UTF-16 surrogate pairs) are silently skipped so a single SMP entry doesn't tank the parse.- Rectangle, quad, and quad-group redaction targets
- Three redaction modes:
strip(remove bytes),redact(blank space + overlay),erase(blank space, no overlay) overlayTextsupport forredactmode — labels are stamped inside each overlay rectangle using Helvetica sized to fit, with contrast-aware black or white text color against the fill- Metadata stripping for supported document layouts
- Attachment stripping for supported embedded-file layouts
Explicitly unsupported or incomplete¶
- Adobe.PubSec SubFilter
/adbe.pkcs7.s3(V=1 RC4-40, deprecated since the early 2010s) and key-agreement (KeyAgreeRecipientInfo, ECDH) recipients — both rejected withUnsupported. Standard Security Handler V=1/2/4/5 (RC4 / AES-128 / AES-256) is in via theopen_with_password/openPdfWithPasswordentry points; PubSec/adbe.pkcs7.s4(V=4 / AES-128) and/adbe.pkcs7.s5(V=5 / AES-256) with key-transport (KeyTransRecipientInfo, RSA-PKCS1v15 or RSA-OAEP) recipients are in via theopen_with_certificate/openPdfWithCertificateentry points - Incremental update preservation (output is always a flat rewrite; xref streams are rewritten as a classic xref table)
- Documents whose catalog has
/OCPropertieswith any layer off in the default configuration or with/BaseState /OFF//Unchangedare rejected up front unless the caller opts in viasanitizeHiddenOcgs: true. The opt-in pass stripsBDC /OC /<name> ... EMCcontent gated by hidden OCGs and clears the catalog's hidden-layer state on save, but OCG markers inside nested Form XObjects are not yet rewritten — a warning is emitted when a page with sanitizable content also has XObjects - Type3 fonts
- Composite (Type0) fonts with encodings other than
Identity-Hand the four supported Unicode-keyed CMaps (UniGB-UCS2-H,UniKS-UCS2-H,UniJIS-UTF16-H,UniCNS-UTF16-H); vertical writing CMaps (-Vvariants) and registry-keyed CMaps such as90ms-RKSJ-Hare rejected with an explicit error - Partial Image XObject rewriting for
Indexed,ICCBased,JBIG2Decode,JPXDecode,CCITTFaxDecode, orBitsPerComponentother than 8 — these images fall back to whole-invocation neutralization (theDois replaced withn). Supported partial-mask formats are raw /FlateDecode/DCTDecodeoverDeviceGray/DeviceRGB/DeviceCMYKat 8 bpc; the engine masks the affected pixel region with the plan'sfill_colorand copy-on-writes the image stream so multi-page-shared images are unaffected. - Stream filters outside the
FlateDecode/ASCII85Decode/ASCIIHexDecode/LZWDecode/RunLengthDecodeset (notablyDCTDecode,JBIG2Decode,JPXDecode,CCITTFaxDecode)
Failure model¶
When unsupported content affects correctness, the engine returns an explicit error instead of pretending to succeed.
Typical behavior:
- unsupported operators on redacted pages return
Unsupported(the operator allow-list covers common text, path, color, clipping, graphics-state, and marked-content operators) - unimplemented plan options return
UnsupportedOption - malformed structure returns parse or corruption errors
Security posture¶
This subset is intentionally conservative. If the engine cannot safely rewrite the targeted content, it should fail rather than emit a misleading "sanitized" PDF.