Redaction Processing — Backend Logic

This folder documents the core logic modules distributed across the guesser_core, webgl_mask, and text_tool Django apps.

Module Pipeline

flowchart TD
    subgraph guesser_core
        PR["ProcessRedactions"]
        BD["BoxDetector"]
        SW["SurroundingWordWidth"]
    end
    
    subgraph webgl_mask
        AV["artifact_visualizer"]
    end
    
    subgraph text_tool
        WC["width_calculator"]
        EF["extract_fonts"]
    end

    PDF["PDF Bytes"] --> PR
    PR --> BD
    PR --> SW
    
    PDF --> AV
    AV -.->|"depends on core logic"| BD
    
    EF -.- PR
    
    style PR fill:#2d333b,stroke:#81c995
    style BD fill:#2d333b,stroke:#8ab4f8
    style SW fill:#2d333b,stroke:#f28b82
    style AV fill:#2d333b,stroke:#fdd663
    style WC fill:#2d333b,stroke:#c58af9
    style EF fill:#2d333b,stroke:#c58af9

Module Reference

App	Module	Description
guesser_core	BoxDetector	Row-scan detection of black rectangular boxes
guesser_core	SurroundingWordWidth	Refine box edges using positions of nearby words
guesser_core	ProcessRedactions	Orchestrator: coordinates detection + refinement
webgl_mask	artifact_visualizer	Async generation of grayscale mask PNGs
text_tool	width_calculator	HarfBuzz text shaping for width measurement
text_tool	extract_fonts	Dominant font detection and mapping

Processing Order

Receive PDF or image bytes from the Django view
Extract embedded page images from PDF using PyMuPDF (extract_page_image_bytes)
Detect black rectangular boxes in each image (BoxDetector)
Refine box edges by measuring gaps to surrounding text words (SurroundingWordWidth)
Return structured JSON with redaction coordinates, text spans, and base64 page images
On demand: Generate grayscale mask PNGs for individual pages (artifact_visualizer)
On demand: Measure pixel widths of candidate names using HarfBuzz (width_calculator)