Skip to content

Text System and Text Extraction

This document describes how this engine loads fonts, tracks text state, computes glyph geometry, and assembles the final TextGlyph and TextItem output. Read 05-graphics-state.md first; this document assumes familiarity with the matrix model and coordinate spaces described there.


1. Font loading

Font loading is performed once per page before content stream interpretation begins. The entry point is:

pub fn load_fonts(
    file: &PdfFile,
    resources: &PdfDictionary,
) -> (BTreeMap<String, LoadedFont>, ExtGStateFontMap)

The function returns two maps:

  • BTreeMap<String, LoadedFont>: the primary font map, keyed by the font resource name as it appears in the content stream (e.g., "F1").
  • ExtGStateFontMap: a secondary map for fonts referenced through ExtGState entries, described below.

Simple fonts (Type1, TrueType)

Simple fonts use a single byte per glyph code. The engine reads the FirstChar entry and the Widths array from the font dictionary. Glyph widths are indexed as Widths[code - FirstChar]. Character decoding uses ASCII, so code point 65 maps to 'A'.

This covers the large majority of Latin-script PDFs generated by standard tools.

Composite fonts (Type0)

Composite fonts use two bytes per glyph code (big-endian CIDs). Support is limited to:

  • Encoding: Identity-H (CID equals the two-byte code directly).
  • Unicode mapping: ToUnicode CMap (required; CIDs without a mapping are skipped).

The ToUnicode CMap is parsed to build a BTreeMap<u16, char> used at decode time.

Composite fonts without Identity-H encoding or without a ToUnicode entry are not supported and produce an explicit error rather than a silent empty decode.

ExtGState fonts

The gs operator in a content stream sets the current graphics state from a named ExtGState resource. ExtGState entries can include a Font array that overrides the current font and size. Because gs can appear anywhere in the stream — including before BT — fonts referenced this way must be pre-loaded.

At load time, the engine scans all ExtGState entries in the page resources. For each entry that contains a Font array, it loads the referenced font and stores it under a synthetic key of the form "__gs:GS1" (where GS1 is the ExtGState resource name). The __gs: prefix guarantees no collision with normal font resource names.

At interpretation time, when a gs operator is encountered, the engine looks up the ExtGState font map by the ExtGState name and, if a font entry is found, sets text_state.font to the synthetic key.


2. Text state tracking (RuntimeTextState)

The RuntimeTextState struct holds all text-related parameters that affect glyph rendering:

Field Type Description
text_matrix Matrix Current text position and orientation
line_matrix Matrix Start of current line (used by Td/TD/T*)
font_size f64 Current font size in user space units
character_spacing f64 Extra spacing added after each glyph
word_spacing f64 Extra spacing added after space characters (0x20)
text_rise f64 Baseline shift for superscripts/subscripts
horizontal_scaling f64 Horizontal stretch factor, percent (default 100)
leading f64 Line spacing used by T* and TD
font Option<String> Current font resource name
text_render_mode i64 Rendering mode (3 = invisible)

Operator effects on text state

BT (Begin Text): Resets text_matrix and line_matrix to the identity matrix. Does not reset font, size, spacing, or any other field. This is the behavior specified in ISO 32000. Many implementations incorrectly reset the full text state on BT; this engine does not, because fonts set via gs before BT must survive into the text block.

Tf /FontName size: Sets font to FontName and font_size to size.

Tm a b c d e f: Sets both text_matrix and line_matrix to the given matrix. This is an absolute assignment, not a relative advance. Both matrices become equal.

Td tx ty: Advances the line matrix by the translation (tx, ty), then sets text_matrix = line_matrix. Relative to the current line origin.

TD tx ty: Same as Td but also sets leading = -ty.

T*: Equivalent to Td 0 -leading.

Tc value: Sets character_spacing.

Tw value: Sets word_spacing.

Tz percent: Sets horizontal_scaling.

TL value: Sets leading.

Ts value: Sets text_rise.

Tr mode: Sets text_render_mode.

gs name: Looks up the ExtGState entry. If it contains a font reference, sets font and font_size accordingly.


3. Glyph geometry computation

For each decoded glyph, the engine computes a quad in normalized page space. The steps are:

1. Compute glyph advance.

For simple fonts:

advance = ((width_units / 1000.0) * font_size + char_spacing + word_spacing_if_space)
          * (horizontal_scaling / 100.0)

width_units is the value from the Widths array (in 1/1000 of a text unit). For composite fonts, widths come from the DW/W entries in the CIDFont dictionary.

word_spacing is added only when the decoded character is the ASCII space (0x20).

2. Build the local rectangle.

local_rect = Rect {
    x:      0.0,
    y:      text_rise - font_size * 0.12,
    width:  advance,
    height: font_size * 0.8,
}

The height covers 80% of the font size, starting 12% below the baseline. This heuristic keeps bounding boxes from overlapping adjacent lines without requiring actual ascender/descender metrics from the font file. Parsing font metric tables (head, hhea, OS/2) is outside MVP scope.

3. Transform to page space.

let text_to_page = text_state.text_matrix.multiply(ctm).multiply(page_transform);
let quad = local_rect.to_quad().transform(text_to_page);

text_to_page is computed once per text-showing operation (not per glyph), since text_matrix and ctm are constant across a single Tj or TJ call.

4. Advance the text matrix.

text_state.text_matrix = text_state.text_matrix.multiply(Matrix::translate(advance, 0.0));

The text matrix accumulates horizontal advances. For TJ arrays with numeric kern adjustments, the advance is reduced by (kern / 1000.0) * font_size before the translation.


4. Invisible text (Tr=3)

Text render mode 3 (Tr 3) causes glyphs to be drawn with no fill and no stroke — they are invisible on the page. This mode is used extensively in OCR-scanned PDFs, where the original scanned image provides the visual appearance and the invisible text layer provides search and copy/paste capability.

Invisible glyphs are assigned visible: false in the TextGlyph struct. They are:

  • Included in the full glyph array returned by the extraction pass.
  • Excluded from visual display in the demo UI.
  • Excluded from search results shown to the user.
  • Included in redaction processing, because redacting a region of a scanned PDF requires neutralizing the invisible text layer as well as covering the image.

Failing to redact invisible text would leave searchable, selectable text under a black rectangle — a security violation under this engine's threat model.


5. Text items

A TextItem represents the output of a single text-showing operation (Tj, one element of a TJ array, ', or "). It contains:

  • text: the coalesced Unicode string for the entire operation.
  • bbox: the bounding box enclosing all glyphs in the operation, in normalized page space.
  • char_start / char_end: byte offsets into the page-level concatenated text string. These are used by the search index to map match offsets back to geometry.

TextItem objects are the primary output of text extraction. The search system operates on the concatenated text of all items on a page, then uses char_start/char_end to retrieve the corresponding quads for highlighting and redaction.


6. Content stream interpretation loop

The main function interpret_page_text iterates over content stream operations in document order. The key operator handlers are:

Operator Action
q Push (ctm, text_state.clone()) onto the stack
Q Pop and restore both ctm and text_state
cm Pre-multiply CTM: ctm = matrix.multiply(ctm)
BT Reset text_matrix and line_matrix to identity only
ET No-op (state is not reset; BT handles the reset)
Tf Set font and font_size
Tm Set text_matrix = matrix and line_matrix = matrix
Td Advance line matrix, set text_matrix = line_matrix
TD Same as Td, also set leading = -ty
T* Equivalent to Td 0 -leading
Tc Set character_spacing
Tw Set word_spacing
Tz Set horizontal_scaling
TL Set leading
Ts Set text_rise
Tr Set text_render_mode
Tj Decode and show a literal string
TJ Decode and show an array of strings and kern adjustments
' Advance one line (T*), then show string
" Set word_spacing and character_spacing, advance one line, then show string
gs Look up ExtGState; if it has a Font entry, update font state

All other operators are ignored. Unknown operators do not produce errors; content streams routinely contain drawing operators (m, l, S, f, Do, etc.) that are irrelevant to text extraction.


7. Why it was coded this way

BT resets only matrices, not font. The PDF specification (ISO 32000-1 §9.4.1) is explicit: BT establishes the text object and initializes the text matrix and line matrix. It does not reset the text state parameters (font, size, spacing, etc.), which are part of the graphics state and persist across text objects. Many implementations get this wrong. The specific bug this avoids: gs sets a composite font before BT; if BT resets the font, the subsequent Tj has no font and produces no output.

q/Q save and restore text state. The PDF specification (ISO 32000-1 §8.4.2) includes the text state parameters as part of the graphics state saved by q and restored by Q. Not saving them means a font or size change inside a q/Q block leaks out, or a font set before q is lost after Q. Either corruption produces wrong glyphs or missing text in extraction output.

80%/12% heuristic for ascent/descent. The geometrically correct approach would parse the font's hhea, OS/2, or FontBBox entries and use actual ascender/descender values. This is complex, requires handling multiple font formats (Type1, TrueType, CFF), and has a high implementation cost relative to MVP goals. The heuristic produces bounding boxes that are correct enough for search intersection and redaction coverage without false overlaps into adjacent lines for normal body text.

Synthetic __gs: prefix for ExtGState fonts. The Font resources and ExtGState resources in a PDF share no namespace collision in practice, but the engine uses a single font map. A string prefix is the simplest mechanism that avoids any possible collision with a real font name of the form GS1 (which a PDF generator could produce).


8. What would break

Change Consequence
BT resets full text state Fonts set via gs before BT are lost. Composite-font PDFs with pre-BT gs calls produce no text output.
Q does not restore text state Font state is corrupted across q/Q blocks. Glyphs after Q use whatever font was active inside the block, which may be wrong or absent.
Wrong ascent/descent values Glyph quads extend into adjacent text lines. Search queries match the wrong line; redaction covers text above or below the intended target.
Invisible text (Tr=3) excluded from redaction Searchable OCR text survives under a redaction rectangle. The redacted PDF appears clean visually but the text is extractable — a security defect.
char_start/char_end computed as character offsets instead of byte offsets Search index byte positions do not align with text item positions. Matches on non-ASCII text point to the wrong items; the wrong quads are highlighted and redacted. This was a real bug fixed in commit fc85fcf.