Skip to content

Introduce OCR Handler for secret detection in images and videos#4863

Draft
amanfcp wants to merge 6 commits intomainfrom
hackathon/ocr-handler
Draft

Introduce OCR Handler for secret detection in images and videos#4863
amanfcp wants to merge 6 commits intomainfrom
hackathon/ocr-handler

Conversation

@amanfcp
Copy link
Copy Markdown
Contributor

@amanfcp amanfcp commented Apr 3, 2026

Problem Statement

Secret Leakage Through Visual Media is a Blind Spot in Secret Scanning

Secret scanning tools today operate exclusively on text-based content — source code, config files, logs, and documents. But credentials and secrets increasingly appear in visual media: screenshots of terminal sessions, screen recordings of deployments, documentation images showing API keys, and video tutorials where dashboards with tokens are briefly visible.

These secrets are completely invisible to current scanning pipelines because image and video files are treated as opaque binaries and skipped entirely. An AWS key pasted in a screenshot committed to a repo is just as dangerous as one in a .env file, but no scanner will catch it.

Our Solution

We extend TruffleHog's scanning pipeline with an OCR-powered handler that extracts text from images (PNG, JPG, JPEG) and video frames (MP4, MKV, WEBM), then feeds it through the existing secret detection engine. Same decoders, same detectors, same verification.

Team:

@mustansir14 @MuneebUllahKhan222 @amanfcp

Key design decisions:

  • Handler-level integration: Works for any source (filesystem, Git, S3, GCS) not coupled to a single source
  • Zero cgo, fully static binary: Executes OCR engines (e.g., Tesseract or other OCR) and FFmpeg via os/exec, preserving compatibility with TruffleHog’s CGO_ENABLED=0 static binary model while remaining flexible to multiple OCR providers.
  • Opt-in via feature flag (--enable-ocr): No performance impact or dependency burden when disabled
  • Video intelligence: Extracts frames at 1fps and OCRs each, catching secrets that appear even briefly
  • Config-driven OCR selection: OCR technology is fully configurable via the YAML file, allowing users to opt in and choose between providers such as OpenAI, Google Vision, Tesseract, or a custom endpoint.
  • Pluggable OCR backends (opt-in): In addition to Tesseract, other OCR providers (e.g., OpenAI, Google Vision, or custom endpoints) can be enabled via configuration, allowing users to trade off accuracy, cost, and performance as needed

Accuracy Improvements

Out-of-the-box tesseract struggles with monospaced IDE/terminal fonts. We've tuned the pipeline in several ways:

  • Image preprocessing: Images are converted to grayscale and upscaled 2x before OCR, improving accuracy on small or low-contrast text (common in screenshots)
  • PSM 6 (uniform text block): Tesseract's page segmentation is set to "single uniform block of text" mode, better suited for screenshots of terminals, config files, and dashboards than the default auto-layout analysis
  • DPI hint (300): Signals tesseract to treat input at print-quality resolution, improving character recognition
  • Monospace-aware spacing: preserve_interword_spaces=1 and textord_space_size_is_variable=0 tell tesseract that spacing is uniform — reduces spurious space insertion that breaks secret patterns

Usage

Scan a directory for secrets in images and videos

trufflehog filesystem --enable-ocr /path/to/scan

Requirements:

tesseract and ffmpeg must be installed and available in PATH when --enable-ocr is set. Images work with tesseract alone; video requires both.

Challenges / Constraints

  1. Character confusion: Tesseract can misread visually similar characters (0/O/Q, I/l/1, @/Q). This is inherent to OCR on rasterized text. Some secrets will be partially garbled, potentially causing missed detections
  2. Unintended spacing: OCR may insert extra spaces within tokens (e.g., AKIA IOSF instead of AKIAIOSF), which can break regex-based detector patterns
  3. Font sensitivity: Accuracy varies significantly by font. Monospaced IDE fonts (JetBrains Mono, Fira Code) generally OCR better than proportional or decorative fonts
  4. External tool dependency: Requires tesseract and ffmpeg as system-installed binaries. Not embedded in the Go binary

Future Improvements

  1. Prevent frame duplication: Deduplicate identical or near-identical video frames before OCR to avoid redundant processing and duplicate findings
  2. CI test coverage: Add tesseract and ffmpeg to CI environment so OCR tests run in the pipeline instead of being skipped
  3. Archive support: OCR images found inside archives (e.g., screenshots in a zip file)
  4. Additional format support: TIFF, BMP, GIF, WEBP for images; AVI, MOV for videos
  5. Custom tesseract models: Fine-tuned model trained on monospaced/IDE fonts for higher accuracy on code screenshots
  6. OCR text post-processing: Collapse whitespace and normalize common character confusions before feeding to detectors

Making It Production-Ready

  1. Standalone Dockerfile: Bundle tesseract-ocr and ffmpeg in the Docker image so --enable-ocr works out of the box without extra install steps
  2. Graceful degradation: Optionally warn instead of error when tools are missing, allowing image-only OCR when ffmpeg is absent
  3. Performance tuning: Parallel frame OCR for videos, configurable frame rate, memory-bounded processing for large media files

This closes a real gap in the secret scanning landscape, secrets don't stop being secrets just because they're in a screenshot.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Adds a new file-handling path that executes external binaries (Tesseract/FFmpeg) and can send media to remote OCR APIs, which may affect performance, reliability, and data egress depending on configuration.

Overview
Adds OCR-based secret scanning for image (PNG/JPEG) and video (MP4/MKV/WebM) files by extracting text and feeding it into the existing detection pipeline.

Introduces new CLI/config plumbing to enable OCR (--enable-ocr) or configure a remote provider via --ocr-config or an ocr: block in --config, plus a new pkg/ocr provider layer supporting Tesseract (default), Google Vision, OpenAI, and custom HTTP backends with env var expansion.

Updates file handler routing to dispatch supported media MIME types to a new ocrHandler (including image preprocessing and ffmpeg frame extraction), extends protobuf/YAML config schema to include OCR configuration (with generated validation), adds OCR-focused tests, and documents setup/usage in README.md.

Reviewed by Cursor Bugbot for commit 7b710ff. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Err: fmt.Errorf("%w: OCR processing error: %v", ErrProcessingWarning, err),
}
h.measureLatencyAndHandleErrors(ctx, start, err, dataOrErrChan)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCR error handler sends duplicate errors to channel

Medium Severity

When OCR processing fails, the error is sent to dataOrErrChan twice — once explicitly on line 80–82, and again inside measureLatencyAndHandleErrors on line 83, which also writes the error to the same channel. Every other handler (defaultHandler, arHandler, archiveHandler, apkHandler) relies solely on measureLatencyAndHandleErrors for error reporting. This causes duplicate error events for consumers of the channel. Worse, if the error is context.DeadlineExceeded, the second write wraps it differently and isFatal returns true, potentially causing unexpected early termination.

Fix in Cursor Fix in Web

const (
maxOCRImageSize = 50 * 1024 * 1024 // 50 MB
maxOCRVideoSize = 500 * 1024 * 1024 // 500 MB
frameIntervalSeconds = 1 // Extract 1 frame per second.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval constant incorrectly used as frame rate

Low Severity

The constant frameIntervalSeconds (named as a time interval) is passed directly to ffmpeg's fps filter, which expects a frame rate (frames per second). This works by coincidence because the value is 1 (1 fps = 1 second interval), but the semantics are inverted. If someone changes the value to 2 (intending a frame every 2 seconds), it would instead extract 2 frames per second — the exact opposite of the intent.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants