Skip to content

Search Geometry and Match Modeling

1. The search problem

PDF stores text in content-stream order, which may not match visual (reading) order. In a two-column layout, the content stream may interleave characters from the left column with characters from the right column. A simple left-to-right scan of the raw character sequence would produce nonsense strings and miss every real match.

Search must operate on visual text — the text a reader sees — not on raw content-stream text.

2. Visual display construction (build_visual_display)

The pipeline begins by converting the flat glyph list produced by the text extractor into a visually ordered sequence of characters.

build_visual_lines

Groups visible glyphs into lines by Y proximity:

  1. Pre-sort all glyphs by center_y descending, then by x ascending (top-to-bottom, left-to-right within each approximate row).
  2. Walk the sorted list. A glyph joins the current line if either condition holds:
  3. center_delta ≤ line_height * 0.55 (centers are close vertically), or
  4. vertical_overlap ≥ line_height * 0.35 (bounding boxes overlap meaningfully).
  5. If neither condition holds, the glyph starts a new line.

Building the display string

After grouping:

  1. Sort each line's glyphs by x ascending (left-to-right reading order).
  2. Sort lines by their average center_y descending (top-to-bottom in standard PDF coordinates where Y increases upward).
  3. Walk consecutive glyphs within each line. If the horizontal gap between them exceeds min(prev_height, curr_height) * 0.3, insert a synthetic space character.
  4. Insert a newline character between consecutive lines.

Output

build_visual_display returns a pair:

  • display_chars: Vec<char> — the visually ordered character sequence including synthetic spaces and newlines.
  • display_to_glyph: Vec<Option<usize>> — parallel index array mapping each display position to the originating glyph index (or None for synthetic characters).

3. Search normalization (build_search_index)

The normalized index is built from display_chars for case-insensitive, whitespace-normalized matching:

  1. Collapse all runs of whitespace (spaces and newlines) to a single space character.
  2. Lowercase every character using char::to_lowercase(), which handles multi-character Unicode foldings (e.g., the German "ß" lowercases to "ss").
  3. Build normalized_to_display: a mapping from byte positions in the normalized text to character positions in the display string.

4. The byte-vs-char offset bug (and fix)

Rust's str::find() returns byte offsets. str::len() returns byte length. But the original implementation built normalized_to_display indexed by character position — one entry per Unicode scalar value.

Any multi-byte UTF-8 character in the normalized text (e.g. an accented "é" from a European CV or a German umlauted name) caused str::find() to return a byte offset that landed in the middle of the character-indexed array. All match positions after the first multi-byte character were shifted, causing the engine to highlight wrong text.

Fix: normalized_to_display now has one entry per UTF-8 byte of the normalized text. When pushing a character c to the normalized string, we push c.len_utf8() copies of the corresponding display index into normalized_to_display. Byte offset i from str::find() maps directly to normalized_to_display[i].

5. Match finding

Given a query string, the match pipeline is:

  1. Normalize the query the same way (whitespace collapse, to_lowercase).
  2. Call str::find() on the normalized text to get a byte-offset range [start, end).
  3. Map start and end through normalized_to_display to get display character positions.
  4. Map display positions through display_to_glyph to get glyph indices (skipping None entries for synthetic characters).
  5. Retrieve the page-space quad from each matched glyph.
  6. Pass the quad set to coalesce_match_quads.

6. Quad coalescing

Individual per-glyph quads are merged into a smaller set of visually coherent highlight rectangles.

Sort order

Quads are sorted before merging:

  • Different visual lines: by descending Y center (top-to-bottom).
  • Same visual line (Y centers within 1.5 pt): by ascending X (left-to-right).

Merge condition

Two adjacent quads are merged if both:

  • vertical_overlap ≥ min_height * 0.45 — they share enough vertical extent to be on the same line.
  • horizontal_gap ≤ min_height * 0.8 — the gap between them is not too large.

Padding

Each final merged quad is expanded outward:

  • padding_x = max(height * 0.08, 0.6) points on each side horizontally.
  • padding_y = max(height * 0.12, 0.8) points on each side vertically.

This ensures the highlight visually covers ascenders, descenders, and tight sidebearings.

7. Why it was coded this way

Decision Reason
Visual reordering before search Content-stream order does not match reading order in multi-column or complex layouts.
Byte-indexed normalized_to_display Rust's str::find() returns byte offsets, not char indices. Byte indexing makes the mapping direct and correct for all UTF-8 input.
Synthetic spaces for gaps PDF glyph sequences contain no explicit inter-word spaces; gaps must be inferred from geometry.
Quad coalescing Per-character highlight boxes produce hundreds of tiny rectangles for a single match and look incorrect in viewers. Coalesced quads produce clean word or line highlights.
Empirical thresholds (0.3, 0.55, 0.35, 0.45, 0.8) Tuned against real-world PDFs with varying font sizes, leading, and column layouts. Hard boundaries do not exist in the PDF specification.

8. What would break

  • Per-character indexing instead of per-byte: Any UTF-8 multi-byte character in the matched text causes all subsequent match positions to shift, highlighting wrong glyphs.
  • No visual reordering: Search finds text at the wrong positions in multi-column and mixed-direction layouts.
  • Too-aggressive line merging (raising thresholds): Lines that are visually separate get merged; a query that spans the boundary of two paragraphs matches text that does not appear adjacent to the reader.
  • No quad coalescing: Each matched glyph produces an individual highlight rectangle, resulting in hundreds of overlapping boxes per word and incorrect visual output in PDF viewers.