Epstein Unredactor — Architecture Overview
A Django web application that analyzes scanned PDF documents to detect black redaction bars, measures their pixel widths, and helps users identify which names could fit under each redaction by matching text widths. The project uses a multi-app "Plugin" architecture to isolate different features.
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Web framework | Django 6.0 | URL routing, template rendering, API views |
| PDF parsing | PyMuPDF (fitz) | Extract embedded images and text spans from PDFs |
| Image analysis | OpenCV + NumPy | Detect black rectangular redaction boxes in page images |
| Text shaping | uHarfBuzz (+ Pillow fallback) | Measure precise pixel widths of candidate names accounting for kerning and ligatures |
| Mask generation | Pillow + NumPy | Create grayscale mask PNGs marking redacted regions |
| Frontend rendering | Vanilla JS, Fabric.js, WebGL | PDF page display, text overlays, GPU-accelerated mask tinting |
| Production server | Gunicorn + Nginx | WSGI app server behind a reverse proxy with SSL |
Directory Structure
EpsteinTool/
├── manage.py # Django entry point
├── requirements.txt # Python dependencies
├── setup.sh # Production server setup (Linux)
├── run_app.sh / run_app.bat # Local dev launchers
│
├── epstein_project/ # Django project config
│ ├── settings.py # INSTALLED_APPS (registers the 3 apps below)
│ ├── urls.py # Root URL conf
│ ├── wsgi.py / asgi.py
│
├── guesser_core/ # Core App (Base Viewer & Redaction Processing)
│ ├── views.py # Root /, /analyze-pdf
│ ├── urls.py
│ ├── logic/
│ │ ├── BoxDetector.py # Row-scan black box detection
│ │ ├── SurroundingWordWidth.py # Refine box edges using nearby text positions
│ │ └── ProcessRedactions.py # Orchestrator: PDF → boxes → refined redactions
│ ├── templates/ # Base index.html (dynamic hooks for plugins)
│ └── static/guesser_core/ # Base UI JS (pdf-viewer.js, app.js, api.js)
│
├── text_tool/ # Plugin App (Font logic & Typography)
│ ├── views.py # /widths, /fonts-list
│ ├── urls.py
│ ├── logic/
│ │ ├── width_calculator.py # HarfBuzz width measurement
│ │ └── extract_fonts.py # Dominant font detection
│ ├── templates/ # Toolbars injected into guesser_core UI
│ └── static/text_tool/ # text-tool.js (Fabric.js canvas wrapper)
│
├── webgl_mask/ # Plugin App (Visual GPU Masks)
│ ├── views.py # /webgl/masks
│ ├── urls.py
│ ├── logic/
│ │ └── artifact_visualizer.py # OpenCV -> grayscale mask PNG generator
│ ├── templates/ # Toolbars injected into guesser_core UI
│ └── static/webgl_mask/ # webgl-mask.js (WebGL renderer)
│
├── embedded_text_viewer/ # Plugin App (Standalone Inline Text Overlay)
│ ├── views.py # /embedded-text-viewer/, /embedded-text-viewer/api/analyze
│ ├── urls.py
│ ├── logic/
│ │ ├── dependency/ # PyMuPDF span text extraction
│ │ └── data/ # Formatting and Text overlay visualization
│ ├── templates/ # Toolbar link and Standalone index preview
│ └── static/
│ └── embedded_text_viewer/ # UI app.js and CSS
│
├── assets/
│ ├── fonts/ # .ttf font files for width calculation
│ ├── names/ # Pre-built candidate name lists
│ └── pdfs/ # Sample PDF documents
│
├── guide/ # Documentation (you are here)
└── tests/ # Test scripts
Data Flow
flowchart TD
A["User uploads PDF"] --> B["POST /analyze-pdf (guesser_core)"]
B --> C{"Is image?"}
C -->|Yes| D["process_image()"]
C -->|No| E["process_pdf()"]
E --> F["Extract embedded page images\n(PyMuPDF)"]
F --> G["BoxDetector\nfind_redaction_boxes_in_image()"]
G --> H["SurroundingWordWidth\nestimate_widths_for_boxes()"]
H --> I["Return JSON:\nredactions + page images"]
D --> G2["BoxDetector\nfind_redaction_boxes_in_image()"]
G2 --> I2["Return JSON:\nredactions + page image"]
I --> J["Frontend (pdf-viewer.js) renders pages"]
I2 --> J
J --> Y["Frontend calls async fetchMasksAsync()"]
Y --> O["POST /webgl/masks (webgl_mask)"]
O --> P["artifact_visualizer\ngenerate_all_masks()"]
P --> Q["webgl-mask.js renders mask tint on canvas"]
J --> K["User adds candidate names"]
K --> L["POST /widths (text_tool)\n(HarfBuzz text shaping)"]
L --> M["Compare widths vs\nredaction box widths"]
M --> N["Highlight matching names"]
Module Dependencies
graph TD
subgraph "Django Project"
urls["epstein_project/urls.py"]
end
subgraph "guesser_core (Core App)"
PR["ProcessRedactions.py"]
BD["BoxDetector.py"]
SW["SurroundingWordWidth.py"]
core_views["views.py"]
HTML["index.html"]
APP["app.js / pdf-viewer.js / api.js"]
end
subgraph "webgl_mask (Plugin)"
WGL_V["views.py"]
AV["artifact_visualizer.py"]
WGL_JS["webgl-mask.js"]
WGL_T["templates"]
end
subgraph "text_tool (Plugin)"
TXT_V["views.py"]
WC["width_calculator.py"]
TXT_JS["text-tool.js"]
TXT_T["templates"]
end
subgraph "embedded_text_viewer (Plugin)"
ETV_V["views.py"]
ETV_L["PyMuPDF logic"]
ETV_JS["app.js"]
ETV_T["templates"]
end
urls --> core_views
urls --> WGL_V
urls --> TXT_V
urls --> ETV_V
core_views --> PR
PR --> BD
PR --> SW
WGL_V --> AV
AV -.->|"reads from core"| BD
TXT_V --> WC
ETV_V --> ETV_L
HTML -.->|"dynamically includes"| WGL_T
HTML -.->|"dynamically includes"| TXT_T
HTML -.->|"dynamically includes"| ETV_T
APP -.->|"depends on"| WGL_JS
APP -.->|"depends on"| TXT_JS