Skip to content

Text System and Text Extraction

This document describes how this engine loads fonts, tracks text state, computes glyph geometry, and assembles the final TextGlyph and TextItem output. Read 05-graphics-state.md first; this document assumes familiarity with the matrix model and coordinate spaces described there.


1. Font loading

Font loading is performed once per page before content stream interpretation begins. The entry point is:

pub fn load_fonts(
    file: &PdfFile,
    resources: &PdfDictionary,
) -> (BTreeMap<String, LoadedFont>, ExtGStateFontMap)

The function returns two maps:

  • BTreeMap<String, LoadedFont>: the primary font map, keyed by the font resource name as it appears in the content stream (e.g., "F1").
  • ExtGStateFontMap: a secondary map for fonts referenced through ExtGState entries, described below.

Simple fonts (Type1, TrueType)

Simple fonts use a single byte per glyph code. The engine reads the FirstChar entry and the Widths array from the font dictionary. Glyph widths are indexed as Widths[code - FirstChar].

Byte-to-Unicode decoding checks three sources, in order:

  1. ToUnicode CMap. Many real-world simple fonts ship a ToUnicode stream even though the spec only mandates one for composite fonts. When present, the same CMap parser used for composite fonts produces a BTreeMap<u16, String> keyed by the byte value.
  2. /Differences overrides. When /Encoding is a dictionary with a /Differences array, the array is walked to produce a BTreeMap<u8, String> of byte → PDF glyph name overrides. Matching bytes are looked up through a compact Adobe Glyph List subset (the WinAnsi repertoire plus common Latin-1, Latin Extended-A, and the common Latin ligatures). Names that do not appear in the subset decode to U+FFFD.
  3. Named base encoding. The remaining bytes decode through the named base encoding. WinAnsiEncoding — by far the most common real-world case — covers the full Windows-1252 repertoire, including the Euro sign (0x80), smart quotes (0x91..=0x94), the bullet (0x95), en/em dashes (0x96, 0x97), and ISO-8859-1 accented letters (0xA0..=0xFF). MacRomanEncoding covers the full Mac Roman repertoire (e.g. 0xD2/0xD3 = U+201C/U+201D, 0xD0/0xD1 = en/em dash, 0xCA = non-breaking space, 0xF0 = the Apple logo mapped to U+F8FF). StandardEncoding covers Adobe's original PostScript encoding, including its historical treatment of 0x27 as quoteright (U+2019) and 0x60 as quoteleft (U+2018). MacExpertEncoding still falls back to the identity path (printable ASCII stays ASCII; everything else becomes U+FFFD).

This covers the large majority of Latin-script PDFs generated by standard tools, including Italian bank statements that rely on °, , , and accented letters, as well as PDFs that use /Encoding dictionaries with /Differences arrays to subset a built-in font.

Composite fonts (Type0)

Composite fonts use a multi-byte character code per glyph. The supported encodings are:

  • Identity-H — every two bytes form a big-endian CID. Unicode comes from the optional ToUnicode CMap on the Type 0 font dict; CIDs without a mapping fall back to ASCII (CID ≤ 0x7F) or U+FFFD.
  • UniGB-UCS2-H, UniKS-UCS2-H — Adobe predefined CMaps where each two-byte code is itself a UCS-2 BE Unicode scalar. Decoding bypasses the CID step entirely: the bytes are interpreted directly as a BMP code point (surrogate halves yield U+FFFD).
  • UniJIS-UTF16-H, UniCNS-UTF16-H — same idea, but the byte stream is UTF-16 BE, so a high-surrogate code unit consumes the next two bytes as the low half and composes a single SMP scalar (4 bytes per glyph). Orphaned low surrogates yield U+FFFD; truncated high surrogates return a Corrupt error.

For all four predefined CMaps a ToUnicode map, when present, takes precedence for BMP code units (PDF spec §9.10.2). It is not consulted for surrogate-pair glyphs because the in-tree ToUnicode parser only accepts 1- or 2-byte keys; ToUnicode entries with 4-byte source codes (e.g. <D840DC00>) are silently skipped during parsing, and the SMP scalar is composed directly from the raw UTF-16 bytes instead. Skipping (rather than failing) keeps the BMP entries usable inside the same map.

The ToUnicode CMap is parsed to build a BTreeMap<u16, String> used at decode time. Glyph widths come from the descendant /DW and /W entries; under Identity-H the /W table is keyed by the (two-byte) CID and is consulted as usual. Under the predefined CMaps no real CID is computed, so every glyph reports default_width (the descendant font's /DW, typically 1000 = full-em). To avoid silently mis-positioning glyphs when the descendant /W would override a CID's width away from default_width, the loader rejects predefined-CMap fonts whose /W array contains any non-uniform entry (per AGENTS.md: unsupported features must fail explicitly). Empty or all-DW-equal /W arrays are accepted because they cannot change geometry.

Vertical-mode CMaps (-V variants) and registry-keyed CMaps such as 90ms-RKSJ-H are rejected with an explicit Unsupported error rather than silently degrading.

ExtGState fonts

The gs operator in a content stream sets the current graphics state from a named ExtGState resource. ExtGState entries can include a Font array that overrides the current font and size. Because gs can appear anywhere in the stream — including before BT — fonts referenced this way must be pre-loaded.

At load time, the engine scans all ExtGState entries in the page resources. For each entry that contains a Font array, it loads the referenced font and stores it under a synthetic key of the form "__gs:GS1" (where GS1 is the ExtGState resource name). The __gs: prefix guarantees no collision with normal font resource names.

At interpretation time, when a gs operator is encountered, the engine looks up the ExtGState font map by the ExtGState name and, if a font entry is found, sets text_state.font to the synthetic key.

Form XObjects

When the interpreter hits a Do operator naming an XObject with /Subtype /Form, it recurses into the Form's own content stream. The Form is treated as an implicit q/Q pair: the current CTM and text state are saved, the Form's Matrix is pre-multiplied into the CTM, the Form's own Resources.Font and Resources.ExtGState are loaded (falling back to the caller's Resources for any names the Form does not declare), and the Form's operations are fed through the same per-operator match used for the page. When the Form returns, the saved CTM and text state are restored.

Cycles — a Form Do-ing itself, or two Forms referencing each other — are broken by a BTreeSet<ObjectRef> of Forms currently being rendered. Recursion is additionally capped at MAX_FORM_XOBJECT_DEPTH = 16 to keep pathological or adversarial documents from driving the interpreter into unbounded recursion.

Image XObjects and XObjects with an unknown subtype are skipped silently at this layer — they carry no text.

Redaction of pages that invoke a Form XObject still fails explicitly. Extracting text from a Form is safe because it is a read-only operation on the Form's bytes; rewriting a Form's content stream in place would change it for every other page that shares it, which is not acceptable for redaction, and copy-on-write rewriting has not been implemented yet.


2. Text state tracking (RuntimeTextState)

The RuntimeTextState struct holds all text-related parameters that affect glyph rendering:

Field Type Description
text_matrix Matrix Current text position and orientation
line_matrix Matrix Start of current line (used by Td/TD/T*)
font_size f64 Current font size in user space units
character_spacing f64 Extra spacing added after each glyph
word_spacing f64 Extra spacing added after space characters (0x20)
text_rise f64 Baseline shift for superscripts/subscripts
horizontal_scaling f64 Horizontal stretch factor, percent (default 100)
leading f64 Line spacing used by T* and TD
font Option<String> Current font resource name
text_render_mode i64 Rendering mode (3 = invisible)

Operator effects on text state

BT (Begin Text): Resets text_matrix and line_matrix to the identity matrix. Does not reset font, size, spacing, or any other field. This is the behavior specified in ISO 32000. Many implementations incorrectly reset the full text state on BT; this engine does not, because fonts set via gs before BT must survive into the text block.

Tf /FontName size: Sets font to FontName and font_size to size.

Tm a b c d e f: Sets both text_matrix and line_matrix to the given matrix. This is an absolute assignment, not a relative advance. Both matrices become equal.

Td tx ty: Advances the line matrix by the translation (tx, ty), then sets text_matrix = line_matrix. Relative to the current line origin.

TD tx ty: Same as Td but also sets leading = -ty.

T*: Equivalent to Td 0 -leading.

Tc value: Sets character_spacing.

Tw value: Sets word_spacing.

Tz percent: Sets horizontal_scaling.

TL value: Sets leading.

Ts value: Sets text_rise.

Tr mode: Sets text_render_mode.

gs name: Looks up the ExtGState entry. If it contains a font reference, sets font and font_size accordingly.


3. Glyph geometry computation

For each decoded glyph, the engine computes a quad in normalized page space. The steps are:

1. Compute glyph advance.

For simple fonts:

advance = ((width_units / 1000.0) * font_size + char_spacing + word_spacing_if_space)
          * (horizontal_scaling / 100.0)

width_units is the value from the Widths array (in 1/1000 of a text unit). For composite fonts, widths come from the DW/W entries in the CIDFont dictionary.

word_spacing is added only when the decoded character is the ASCII space (0x20).

2. Build the local rectangle.

local_rect = Rect {
    x:      0.0,
    y:      text_rise - font_size * 0.12,
    width:  advance,
    height: font_size * 0.8,
}

The height covers 80% of the font size, starting 12% below the baseline. This heuristic keeps bounding boxes from overlapping adjacent lines without requiring actual ascender/descender metrics from the font file. Parsing font metric tables (head, hhea, OS/2) is outside MVP scope.

3. Transform to page space.

let text_to_page = text_state.text_matrix.multiply(ctm).multiply(page_transform);
let quad = local_rect.to_quad().transform(text_to_page);

text_to_page is computed once per text-showing operation (not per glyph), since text_matrix and ctm are constant across a single Tj or TJ call.

4. Advance the text matrix.

text_state.text_matrix = text_state.text_matrix.multiply(Matrix::translate(advance, 0.0));

The text matrix accumulates horizontal advances. For TJ arrays with numeric kern adjustments, the advance is reduced by (kern / 1000.0) * font_size before the translation.


4. Invisible text (Tr=3)

Text render mode 3 (Tr 3) causes glyphs to be drawn with no fill and no stroke — they are invisible on the page. This mode is used extensively in OCR-scanned PDFs, where the original scanned image provides the visual appearance and the invisible text layer provides search and copy/paste capability.

Invisible glyphs are assigned visible: false in the TextGlyph struct. They are:

  • Included in the full glyph array returned by the extraction pass.
  • Excluded from visual display in the demo UI.
  • Excluded from search results shown to the user.
  • Included in redaction processing, because redacting a region of a scanned PDF requires neutralizing the invisible text layer as well as covering the image.

Failing to redact invisible text would leave searchable, selectable text under a black rectangle — a security violation under this engine's threat model.


5. Text items

A TextItem represents the output of a single text-showing operation (Tj, one element of a TJ array, ', or "). It contains:

  • text: the coalesced Unicode string for the entire operation.
  • bbox: the bounding box enclosing all glyphs in the operation, in normalized page space.
  • char_start / char_end: byte offsets into the page-level concatenated text string. These are used by the search index to map match offsets back to geometry.

TextItem objects are the primary output of text extraction. The search system operates on the concatenated text of all items on a page, then uses char_start/char_end to retrieve the corresponding quads for highlighting and redaction.


6. Content stream interpretation loop

The main function interpret_page_text iterates over content stream operations in document order. The key operator handlers are:

Operator Action
q Push (ctm, text_state.clone()) onto the stack
Q Pop and restore both ctm and text_state
cm Pre-multiply CTM: ctm = matrix.multiply(ctm)
BT Reset text_matrix and line_matrix to identity only
ET No-op (state is not reset; BT handles the reset)
Tf Set font and font_size
Tm Set text_matrix = matrix and line_matrix = matrix
Td Advance line matrix, set text_matrix = line_matrix
TD Same as Td, also set leading = -ty
T* Equivalent to Td 0 -leading
Tc Set character_spacing
Tw Set word_spacing
Tz Set horizontal_scaling
TL Set leading
Ts Set text_rise
Tr Set text_render_mode
Tj Decode and show a literal string
TJ Decode and show an array of strings and kern adjustments
' Advance one line (T*), then show string
" Set word_spacing and character_spacing, advance one line, then show string
gs Look up ExtGState; if it has a Font entry, update font state

All other operators are ignored. Unknown operators do not produce errors; content streams routinely contain drawing operators (m, l, S, f, Do, etc.) that are irrelevant to text extraction.


7. Why it was coded this way

BT resets only matrices, not font. The PDF specification (ISO 32000-1 §9.4.1) is explicit: BT establishes the text object and initializes the text matrix and line matrix. It does not reset the text state parameters (font, size, spacing, etc.), which are part of the graphics state and persist across text objects. Many implementations get this wrong. The specific bug this avoids: gs sets a composite font before BT; if BT resets the font, the subsequent Tj has no font and produces no output.

q/Q save and restore text state. The PDF specification (ISO 32000-1 §8.4.2) includes the text state parameters as part of the graphics state saved by q and restored by Q. Not saving them means a font or size change inside a q/Q block leaks out, or a font set before q is lost after Q. Either corruption produces wrong glyphs or missing text in extraction output.

80%/12% heuristic for ascent/descent. The geometrically correct approach would parse the font's hhea, OS/2, or FontBBox entries and use actual ascender/descender values. This is complex, requires handling multiple font formats (Type1, TrueType, CFF), and has a high implementation cost relative to MVP goals. The heuristic produces bounding boxes that are correct enough for search intersection and redaction coverage without false overlaps into adjacent lines for normal body text.

Synthetic __gs: prefix for ExtGState fonts. The Font resources and ExtGState resources in a PDF share no namespace collision in practice, but the engine uses a single font map. A string prefix is the simplest mechanism that avoids any possible collision with a real font name of the form GS1 (which a PDF generator could produce).


8. What would break

Change Consequence
BT resets full text state Fonts set via gs before BT are lost. Composite-font PDFs with pre-BT gs calls produce no text output.
Q does not restore text state Font state is corrupted across q/Q blocks. Glyphs after Q use whatever font was active inside the block, which may be wrong or absent.
Wrong ascent/descent values Glyph quads extend into adjacent text lines. Search queries match the wrong line; redaction covers text above or below the intended target.
Invisible text (Tr=3) excluded from redaction Searchable OCR text survives under a redaction rectangle. The redacted PDF appears clean visually but the text is extractable — a security defect.
char_start/char_end computed as character offsets instead of byte offsets Search index byte positions do not align with text item positions. Matches on non-ASCII text point to the wrong items; the wrong quads are highlighted and redacted. This was a real bug fixed in commit fc85fcf.