Search Geometry and Match Modeling¶
1. The search problem¶
PDF stores text in content-stream order, which may not match visual (reading) order. In a two-column layout, the content stream may interleave characters from the left column with characters from the right column. A simple left-to-right scan of the raw character sequence would produce nonsense strings and miss every real match.
Search must operate on visual text — the text a reader sees — not on raw content-stream text.
2. Visual display construction (build_visual_display)¶
The pipeline begins by converting the flat glyph list produced by the text extractor into a visually ordered sequence of characters.
build_visual_lines¶
Groups visible glyphs into lines by Y proximity:
- Pre-sort all glyphs by
center_ydescending, then byxascending (top-to-bottom, left-to-right within each approximate row). - Walk the sorted list. A glyph joins the current line if either condition holds:
center_delta ≤ line_height * 0.55(centers are close vertically), orvertical_overlap ≥ line_height * 0.35(bounding boxes overlap meaningfully).- If neither condition holds, the glyph starts a new line.
Building the display string¶
After grouping:
- Sort each line's glyphs by
xascending (left-to-right reading order). - Sort lines by their average
center_ydescending (top-to-bottom in standard PDF coordinates where Y increases upward). - Walk consecutive glyphs within each line. If the horizontal gap between them exceeds
min(prev_height, curr_height) * 0.3, insert a synthetic space character. - Insert a newline character between consecutive lines.
Output¶
build_visual_display returns a pair:
display_chars: Vec<char>— the visually ordered character sequence including synthetic spaces and newlines.display_to_glyph: Vec<Option<usize>>— parallel index array mapping each display position to the originating glyph index (orNonefor synthetic characters).
3. Search normalization (build_search_index)¶
The normalized index is built from display_chars for case-insensitive, whitespace-normalized matching:
- Collapse all runs of whitespace (spaces and newlines) to a single space character.
- Lowercase every character using
char::to_lowercase(), which handles multi-character Unicode foldings (e.g., the German "ß" lowercases to "ss"). - Build
normalized_to_display: a mapping from byte positions in the normalized text to character positions in the display string.
4. The byte-vs-char offset bug (and fix)¶
Rust's str::find() returns byte offsets. str::len() returns byte length. But the original implementation built normalized_to_display indexed by character position — one entry per Unicode scalar value.
Any multi-byte UTF-8 character in the normalized text (e.g. an accented "é" from a European CV or a German umlauted name) caused str::find() to return a byte offset that landed in the middle of the character-indexed array. All match positions after the first multi-byte character were shifted, causing the engine to highlight wrong text.
Fix: normalized_to_display now has one entry per UTF-8 byte of the normalized text. When pushing a character c to the normalized string, we push c.len_utf8() copies of the corresponding display index into normalized_to_display. Byte offset i from str::find() maps directly to normalized_to_display[i].
5. Match finding¶
Given a query string, the match pipeline is:
- Normalize the query the same way (whitespace collapse,
to_lowercase). - Call
str::find()on the normalized text to get a byte-offset range[start, end). - Map
startandendthroughnormalized_to_displayto get display character positions. - Map display positions through
display_to_glyphto get glyph indices (skippingNoneentries for synthetic characters). - Retrieve the page-space quad from each matched glyph.
- Pass the quad set to
coalesce_match_quads.
6. Quad coalescing¶
Individual per-glyph quads are merged into a smaller set of visually coherent highlight rectangles.
Sort order¶
Quads are sorted before merging:
- Different visual lines: by descending Y center (top-to-bottom).
- Same visual line (Y centers within 1.5 pt): by ascending X (left-to-right).
Merge condition¶
Two adjacent quads are merged if both:
vertical_overlap ≥ min_height * 0.45— they share enough vertical extent to be on the same line.horizontal_gap ≤ min_height * 0.8— the gap between them is not too large.
Padding¶
Each final merged quad is expanded outward:
padding_x = max(height * 0.08, 0.6)points on each side horizontally.padding_y = max(height * 0.12, 0.8)points on each side vertically.
This ensures the highlight visually covers ascenders, descenders, and tight sidebearings.
7. Why it was coded this way¶
| Decision | Reason |
|---|---|
| Visual reordering before search | Content-stream order does not match reading order in multi-column or complex layouts. |
Byte-indexed normalized_to_display |
Rust's str::find() returns byte offsets, not char indices. Byte indexing makes the mapping direct and correct for all UTF-8 input. |
| Synthetic spaces for gaps | PDF glyph sequences contain no explicit inter-word spaces; gaps must be inferred from geometry. |
| Quad coalescing | Per-character highlight boxes produce hundreds of tiny rectangles for a single match and look incorrect in viewers. Coalesced quads produce clean word or line highlights. |
| Empirical thresholds (0.3, 0.55, 0.35, 0.45, 0.8) | Tuned against real-world PDFs with varying font sizes, leading, and column layouts. Hard boundaries do not exist in the PDF specification. |
8. What would break¶
- Per-character indexing instead of per-byte: Any UTF-8 multi-byte character in the matched text causes all subsequent match positions to shift, highlighting wrong glyphs.
- No visual reordering: Search finds text at the wrong positions in multi-column and mixed-direction layouts.
- Too-aggressive line merging (raising thresholds): Lines that are visually separate get merged; a query that spans the boundary of two paragraphs matches text that does not appear adjacent to the reader.
- No quad coalescing: Each matched glyph produces an individual highlight rectangle, resulting in hundreds of overlapping boxes per word and incorrect visual output in PDF viewers.