width_calculator.py
width_calculator.py provides precision text-width measurement for candidate name matching.
Functions
get_text_widths(texts, font_name, font_size, force_uppercase, scale_factor, kerning, ligatures)
Calculates pixel widths for a list of text strings.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
texts | list[str] | — | Strings to measure |
font_name | str | "times.ttf" | Font filename |
font_size | int/float | 12 | Font size in points |
force_uppercase | bool | False | Convert text to uppercase before measuring |
scale_factor | float | 1.35 | Multiplier applied to the raw advance width |
kerning | bool | True | Enable OpenType kern feature |
ligatures | bool | True | Enable liga and clig features |
Output:
[{"text": "Jeffrey Epstein", "width": 89.472}, ...]
Font Resolution
The font is searched in this order:
- Direct path (
font_nameas-is) assets/fonts/{font_name}assets/fonts/{font_name}.ttf
System font directories are intentionally excluded to ensure consistent results across environments.
HarfBuzz Engine (Primary)
When uharfbuzz is available:
face = hb.Face(font_data)
font = hb.Font(face)
upem = face.upem # units per em
buf = hb.Buffer()
buf.add_str(text)
buf.guess_segment_properties()
hb.shape(font, buf, features)
total_advance = sum(pos.x_advance for pos in buf.glyph_positions)
pixel_width = (total_advance / upem) * font_size * scale_factor
Features controlled:
| Feature | Enabled | Disabled |
|---|---|---|
kern | Default | kerning=False |
liga | Default | ligatures=False |
clig | Default | ligatures=False |
dlig | Never | ligatures=False |
Pillow Fallback
If HarfBuzz fails or is not installed, falls back to ImageFont.truetype() with font.getlength(). This method does not support fine-grained kerning/ligature control.
get_available_fonts()
Scans the assets/fonts/ directory and returns a list of .ttf filenames.
Output: ["times.ttf", "arial.ttf", ...]
Used by the /fonts-list API endpoint to populate the frontend font dropdown.
Scale Factor
scale_factor is the multiplier that converts a raw typographic advance (in font points) into the image pixel width used by the redaction overlay coordinates.
Formula
pixel_width = (advance / upem) × font_size_pt × scale_factor
For the width to match a redaction box measured in the 816 × 1056 px embedded page images:
scale_factor = img_width_px / page_width_pt
= 816 / 612
= 4/3
≈ 1.3333
This is equivalent to converting from 72 dpi (PDF points) to 96 dpi (screen pixels): 96 / 72 = 4/3.
How the frontend sets scale_factor
The /analyze-pdf response includes suggested_scale (an integer percentage). views.py divides it by 100 before passing it to get_text_widths():
scale_factor = scale / 100.0 # e.g. 133 / 100 = 1.33
The auto-detected value suggested_scale = 133 corresponds to scale_factor ≈ 1.333, which correctly maps 12 pt Times New Roman to its pixel width in the embedded page images.
Note: The function signature's default
scale_factor=1.35is a legacy approximation of 4/3. In normal operation the frontend always supplies an explicit scale from thesuggested_scaleauto-detection, so the default is rarely used.
For a full derivation of the correct scale value and why the old formula ((median_size / 12) × (816/612)² × 100 ≈ 178) was incorrect, see Scale & Size Detection.