Search Geometry and Match Modeling¶

1. The search problem¶

PDF stores text in content-stream order, which may not match visual (reading) order. In a two-column layout, the content stream may interleave characters from the left column with characters from the right column. A simple left-to-right scan of the raw character sequence would produce nonsense strings and miss every real match.

Search must operate on visual text — the text a reader sees — not on raw content-stream text.

2. Visual display construction (`build_visual_display`)¶

The pipeline begins by converting the flat glyph list produced by the text extractor into a visually ordered sequence of characters.

`build_visual_lines`¶

Groups visible glyphs into lines by Y proximity:

Pre-sort all glyphs by center_y descending, then by x ascending (top-to-bottom, left-to-right within each approximate row).
Walk the sorted list. A glyph joins the current line if either condition holds:
center_delta ≤ line_height * 0.55 (centers are close vertically), or
vertical_overlap ≥ line_height * 0.35 (bounding boxes overlap meaningfully).
If neither condition holds, the glyph starts a new line.

Building the display string¶

After grouping:

Sort each line's glyphs by x ascending (left-to-right reading order).
Sort lines by their average center_y descending (top-to-bottom in standard PDF coordinates where Y increases upward).
Walk consecutive glyphs within each line. If the horizontal gap between them exceeds min(prev_height, curr_height) * 0.3, insert a synthetic space character.
Insert a newline character between consecutive lines.

Output¶

build_visual_display returns a pair:

display_chars: Vec<char> — the visually ordered character sequence including synthetic spaces and newlines.
display_to_glyph: Vec<Option<usize>> — parallel index array mapping each display position to the originating glyph index (or None for synthetic characters).

3. Search normalization (`build_search_index`)¶

The normalized index is built from display_chars for case-insensitive, whitespace-normalized matching:

Collapse all runs of whitespace (spaces and newlines) to a single space character.
Lowercase every character using char::to_lowercase(), which handles multi-character Unicode foldings (e.g., the German "ß" lowercases to "ss").
Build normalized_to_display: a mapping from byte positions in the normalized text to character positions in the display string.

4. The byte-vs-char offset bug (and fix)¶

Rust's str::find() returns byte offsets. str::len() returns byte length. But the original implementation built normalized_to_display indexed by character position — one entry per Unicode scalar value.

Any multi-byte UTF-8 character in the normalized text (e.g. an accented "é" from a European CV or a German umlauted name) caused str::find() to return a byte offset that landed in the middle of the character-indexed array. All match positions after the first multi-byte character were shifted, causing the engine to highlight wrong text.

Fix: normalized_to_display now has one entry per UTF-8 byte of the normalized text. When pushing a character c to the normalized string, we push c.len_utf8() copies of the corresponding display index into normalized_to_display. Byte offset i from str::find() maps directly to normalized_to_display[i].

5. Match finding¶

Given a query string, the match pipeline is:

Normalize the query the same way (whitespace collapse, to_lowercase).
Call str::find() on the normalized text to get a byte-offset range [start, end).
Map start and end through normalized_to_display to get display character positions.
Map display positions through display_to_glyph to get glyph indices (skipping None entries for synthetic characters).
Retrieve the page-space quad from each matched glyph.
Pass the quad set to coalesce_match_quads.

6. Quad coalescing¶

Individual per-glyph quads are merged into a smaller set of visually coherent highlight rectangles.

Sort order¶

Quads are sorted before merging:

Different visual lines: by descending Y center (top-to-bottom).
Same visual line (Y centers within 1.5 pt): by ascending X (left-to-right).

Merge condition¶

Two adjacent quads are merged if both:

vertical_overlap ≥ min_height * 0.45 — they share enough vertical extent to be on the same line.
horizontal_gap ≤ min_height * 0.8 — the gap between them is not too large.

Padding¶

Each final merged quad is expanded outward:

padding_x = max(height * 0.08, 0.6) points on each side horizontally.
padding_y = max(height * 0.12, 0.8) points on each side vertically.

This ensures the highlight visually covers ascenders, descenders, and tight sidebearings.

7. Why it was coded this way¶

Decision	Reason
Visual reordering before search	Content-stream order does not match reading order in multi-column or complex layouts.
Byte-indexed `normalized_to_display`	Rust's `str::find()` returns byte offsets, not char indices. Byte indexing makes the mapping direct and correct for all UTF-8 input.
Synthetic spaces for gaps	PDF glyph sequences contain no explicit inter-word spaces; gaps must be inferred from geometry.
Quad coalescing	Per-character highlight boxes produce hundreds of tiny rectangles for a single match and look incorrect in viewers. Coalesced quads produce clean word or line highlights.
Empirical thresholds (0.3, 0.55, 0.35, 0.45, 0.8)	Tuned against real-world PDFs with varying font sizes, leading, and column layouts. Hard boundaries do not exist in the PDF specification.

8. What would break¶

Per-character indexing instead of per-byte: Any UTF-8 multi-byte character in the matched text causes all subsequent match positions to shift, highlighting wrong glyphs.
No visual reordering: Search finds text at the wrong positions in multi-column and mixed-direction layouts.
Too-aggressive line merging (raising thresholds): Lines that are visually separate get merged; a query that spans the boundary of two paragraphs matches text that does not appear adjacent to the reader.
No quad coalescing: Each matched glyph produces an individual highlight rectangle, resulting in hundreds of overlapping boxes per word and incorrect visual output in PDF viewers.