width_calculator.py

width_calculator.py provides precision text-width measurement for candidate name matching.

Functions

`get_text_widths(texts, font_name, font_size, force_uppercase, scale_factor, kerning, ligatures)`

Calculates pixel widths for a list of text strings.

Parameters:

Parameter	Type	Default	Description
`texts`	list[str]	—	Strings to measure
`font_name`	str	`"times.ttf"`	Font filename
`font_size`	int/float	`12`	Font size in points
`force_uppercase`	bool	`False`	Convert text to uppercase before measuring
`scale_factor`	float	`1.35`	Multiplier applied to the raw advance width
`kerning`	bool	`True`	Enable OpenType `kern` feature
`ligatures`	bool	`True`	Enable `liga` and `clig` features

Output:

[{"text": "Jeffrey Epstein", "width": 89.472}, ...]

Font Resolution

The font is searched in this order:

Direct path (font_name as-is)
assets/fonts/{font_name}
assets/fonts/{font_name}.ttf

System font directories are intentionally excluded to ensure consistent results across environments.

HarfBuzz Engine (Primary)

When uharfbuzz is available:

face = hb.Face(font_data)
font = hb.Font(face)
upem = face.upem   # units per em

buf = hb.Buffer()
buf.add_str(text)
buf.guess_segment_properties()

hb.shape(font, buf, features)

total_advance = sum(pos.x_advance for pos in buf.glyph_positions)
pixel_width = (total_advance / upem) * font_size * scale_factor

Features controlled:

Feature	Enabled	Disabled
`kern`	Default	`kerning=False`
`liga`	Default	`ligatures=False`
`clig`	Default	`ligatures=False`
`dlig`	Never	`ligatures=False`

Pillow Fallback

If HarfBuzz fails or is not installed, falls back to ImageFont.truetype() with font.getlength(). This method does not support fine-grained kerning/ligature control.

`get_available_fonts()`

Scans the assets/fonts/ directory and returns a list of .ttf filenames.

Output: ["times.ttf", "arial.ttf", ...]

Used by the /fonts-list API endpoint to populate the frontend font dropdown.

Scale Factor

scale_factor is the multiplier that converts a raw typographic advance (in font points) into the image pixel width used by the redaction overlay coordinates.

Formula

pixel_width = (advance / upem) × font_size_pt × scale_factor

For the width to match a redaction box measured in the 816 × 1056 px embedded page images:

scale_factor = img_width_px / page_width_pt
             = 816 / 612
             = 4/3
             ≈ 1.3333

This is equivalent to converting from 72 dpi (PDF points) to 96 dpi (screen pixels): 96 / 72 = 4/3.

How the frontend sets scale_factor

The /analyze-pdf response includes suggested_scale (an integer percentage). views.py divides it by 100 before passing it to get_text_widths():

scale_factor = scale / 100.0   # e.g. 133 / 100 = 1.33

The auto-detected value suggested_scale = 133 corresponds to scale_factor ≈ 1.333, which correctly maps 12 pt Times New Roman to its pixel width in the embedded page images.

Note: The function signature's default scale_factor=1.35 is a legacy approximation of 4/3. In normal operation the frontend always supplies an explicit scale from the suggested_scale auto-detection, so the default is rarely used.

For a full derivation of the correct scale value and why the old formula ((median_size / 12) × (816/612)² × 100 ≈ 178) was incorrect, see Scale & Size Detection.