ppu-pdf

Easily extract text from digital PDF and Scanned PDF files with coordinate and font size included, and optionally group text by lines or render scanned pdf to canvas/png.

There are two class of PdfReader (uses mupdfjs) and PdfReaderLegacy uses (pdfjs-dist).

Live Demo →

Features

Text Extraction: Retrieve all text content from a PDF.
LLM-Friendly Text Extraction: Retrieve all text content while maintaining bbox and encode in Token Object Notation (TOON).
Coordinate Data: Get precise bounding box and dimension information for each text element.
Line Grouping: Merge individual text tokens into coherent lines.
Scanned PDF Detection: Determine if a PDF/individual page appears to be scanned or digitally generated.
Scanned PDF Canvas Rendering: Convert scanned pdf per page into a ready to processed canvas.
Scanned PDF to PNG Images: Convert and write all pdf pages to PNG images.
Scanned PDF Text Extraction: Retrieve all text content from a scanned PDF using ppu-paddle-ocr.

Differences

Indicator	PdfReader	PdfReaderLegacy
Library	mupdfjs	pdfjs-dist
Pages index start	0	1
open()	✅	✅
getTexts()	✅	✅
getTextsScanned()	✅	✅
isScanned()	✅	✅
isPageScanned()	✅	✅
getLinesFromTexts()	✅	✅
getCompactLinesFromTexts()	✅	✅
destroy()	✅	✅
destroyPage()	✅	❌
renderAll()	✅	✅
saveCanvasToPng()	✅	✅
dumpCanvasMap()	✅	✅
Resize viewport/Custom DPI	✅	✅
pdfReader.getLinesFromTextsInToon()	✅	✅

Benchmark

Both digital pdf and scanned pdf with a total 28 pages.

clk: ~4.02 GHz
cpu: 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz
runtime: bun 1.2.5 (x64-linux)

benchmark                                 avg (min … max) p75 / p99    (min … top 1%)
--------------------------------------------------------- -------------------------------
pdfReader.getTexts()                       461.74 ms/iter 459.53 ms    █
                                  (436.19 ms … 524.12 ms) 513.69 ms   ███
                                  (  2.29 mb … 107.16 mb)  27.86 mb █▁███▁█▁█▁▁▁▁▁▁▁▁▁▁▁█

pdfReaderLegacy.getTexts()                 217.58 ms/iter 233.21 ms   █
                                  (196.63 ms … 238.71 ms) 236.38 ms   █               ▅
                                  (  1.58 mb … 118.27 mb)  40.84 mb ▇▁█▁▁▁▇▁▁▁▁▁▁▁▁▁▇▁█▇▇

summary
  pdfReaderLegacy.getTexts()
   2.12x faster than pdfReader.getTexts()

--------------------------------------------------------- -------------------------------
pdfReader.getLinesFromTexts()                3.82 ms/iter   3.91 ms   ▆    █
                                      (3.52 ms … 4.95 ms)   4.53 ms  ▇█ ▂ ▄█▆
                                  (  0.00  b …   1.29 mb) 590.91 kb ▅██▇█▂███▇▅▃▂▂▂▁▂▂▂▂▂

pdfReaderLegacy.getLinesFromTexts()          4.77 ms/iter   4.87 ms  █▄  ▄▅▅
                                      (4.41 ms … 5.66 ms)   5.52 ms  ██▂ ███▅
                                  (  0.00  b …   1.29 mb) 647.72 kb ████▇████▆▇▆█▂▂▄▄▅▂▁▄

summary
  pdfReader.getLinesFromTexts()
   1.25x faster than pdfReaderLegacy.getLinesFromTexts()

--------------------------------------------------------- -------------------------------
pdfReader.getCompactLinesFromTexts()         3.87 ms/iter   3.98 ms  ▂█     ▄▃▂
                                      (3.57 ms … 4.60 ms)   4.40 ms  ██▃▂   ███▄▂
                                  (  0.00  b …   2.06 mb) 920.63 kb ▇█████▃▇█████▄▄▄▃▄▃▁▂

pdfReaderLegacy.getCompactLinesFromTexts()   4.62 ms/iter   4.71 ms  ▃▂   █▂
                                      (4.27 ms … 5.59 ms)   5.52 ms  ██  ▆██
                                  (  0.00  b …   1.55 mb) 876.22 kb ▆██▇▆███▇█▅▂▅▄▂▂▁▂▁▁▂

summary
  pdfReader.getCompactLinesFromTexts()
   1.19x faster than pdfReaderLegacy.getCompactLinesFromTexts()

--------------------------------------------------------- -------------------------------
pdfReader.open()                            14.30 ms/iter  19.97 ms █▄
                                     (8.56 ms … 31.51 ms)  27.49 ms ██
                                  (  0.00  b …  58.52 mb)  19.89 mb ██▅▁▃▃▁▃▇▁▃▁▅▁▇█▁▅▁▁▃

pdfReaderLegacy.open()                       6.19 ms/iter   6.51 ms  █
                                     (5.11 ms … 13.27 ms)  10.40 ms  ██▂ ▂
                                  (  0.00  b …  29.13 mb) 682.14 kb ██████▅▄▃▁▃▁▄▃▁▂▁▁▁▁▂

summary
  pdfReaderLegacy.open()
   2.31x faster than pdfReader.open()

--------------------------------------------------------- -------------------------------
pdfReader.renderAll()                         1.10 s/iter    1.12 s                     █
                                        (1.05 s … 1.13 s)    1.12 s █                 █ █
                                  ( 66.35 mb … 248.17 mb) 191.68 mb █▁▁▁▁▁▁▁▁██▁▁▁█▁▁▁███

pdfReaderLegacy.renderAll()                   1.68 s/iter    1.70 s             █
                                        (1.56 s … 1.84 s)    1.77 s ▅ ▅▅    ▅▅ ▅█▅ ▅    ▅
                                  (231.91 mb … 384.77 mb) 352.18 mb █▁██▁▁▁▁██▁███▁█▁▁▁▁█

summary
  pdfReader.renderAll()
   1.52x faster than pdfReaderLegacy.renderAll()

Installation

Using Bun is recommended

Install the package via npm:

npm install ppu-pdf ppu-paddle-ocr

Or using Yarn:

yarn add ppu-pdf ppu-paddle-ocr

Bun:

bun add ppu-pdf ppu-paddle-ocr

You can opt-out ppu-paddle-ocr if you are planning on not extracting text from scanned pdf.

Usage

Below is an example of how to use the library with Bun.

Digital PDF Example:

import { PdfReader } from "ppu-pdf";

const pdfReader = new PdfReader({ verbose: false });
const file = Bun.file("./assets/opposite-expectation.pdf");

const buffer = await file.arrayBuffer();
const pdf = pdfReader.open(buffer);

// remember it's a map
const texts = await pdfReader.getTexts(pdf);
const page0texts = texts.get(0);
console.log("texts: ", page0texts);

pdfReader.destroy(pdf);

const isScanned = pdfReader.isScanned(texts);
console.log("is pdf scanned: ", isScanned);

Scanned PDF Example:

import { join } from "path";
import { PdfReader } from "ppu-pdf";
import { PaddleOcrService } from "ppu-paddle-ocr";

const fonts = [
  {
    path: join(__dirname, "..", "fonts", "Arial.ttf"),
    name: "Arial",
  },
];

const pdfReader = new PdfReader({ verbose: false, fonts: fonts });
const ocr = new PaddleOcrService();

const fileScan = Bun.file("./assets/opposite-expectation-scan.pdf");
const bufferScan = await fileScan.arrayBuffer();

const pdfScan = pdfReader.open(bufferScan);
const canvasMap = await pdfReader.renderAll(pdfScan);
pdfReader.destroy(pdfScan); // Destroy early since we already hold the canvases

pdfReader.dumpCanvasMap(canvasMap, "my-dumped-pdf");
const texts = await pdfReader.getTextsScanned(ocr, canvasMap);
console.log("texts: ", texts.get(0));

Rebuilding a Scanned PDF into a Searchable PDF

You can rebuild a scanned (flattened) PDF, turning it into a fully searchable document by overlaying the invisible recognized text on top of the original images. For a complete script, refer to examples/rebuild-pdf.example.ts.

import { PdfReader } from "ppu-pdf";
import { PaddleOcrService } from "ppu-paddle-ocr";

const pdfReader = new PdfReader({ verbose: false });
const ocr = new PaddleOcrService();
await ocr.initialize();

const fileScan = Bun.file("./assets/test_japanese.pdf");
const bufferScan = await fileScan.arrayBuffer();

// 1. Open and Render the PDF into Canvases
const pdfScan = pdfReader.open(bufferScan);
const canvasMap = await pdfReader.renderAll(pdfScan);
pdfReader.destroy(pdfScan);

// 2. Extract OCR Texts
const texts = await pdfReader.getTextsScanned(ocr, canvasMap);

// 3. Rebuild Searchable PDF
// Note: We open a fresh instance of the PDF for rebuilding
const pdfForRebuild = pdfReader.open(bufferScan);
const rebuiltPdfBuffer = await pdfReader.rebuild(pdfForRebuild, texts);
pdfReader.destroy(pdfForRebuild);

// 4. Save onto disk
await Bun.write("./test_japanese_searchable.pdf", rebuiltPdfBuffer);

`PdfReaderOptions`

Configuration options for PdfReader, allowing customization of PDF text extraction behavior.

Option	Type	Default Value	Description
`verbose`	`boolean`	`false`	Enables logging for debugging purposes.
`excludeFooter`	`boolean`	`true`	Excludes detected footer text from the extracted content.
`excludeHeader`	`boolean`	`true`	Excludes detected header text from the extracted content.
`raw`	`boolean`	`false`	If `true`, returns raw text without additional processing.
`headerFromHeightPercentage`	`number`	`0.02`	Defines the height percentage from the top used to identify header text.
`footerFromHeightPercentage`	`number`	`0.95`	Defines the height percentage from the bottom used to identify footer text.
`mergeCloseTextNeighbor`	`boolean`	`true`	Merges text elements that are close to each other into a single entity.
`simpleSortAlgorithm`	`boolean`	`false`	Uses a simplified sorting algorithm for text positioning.
`scale`	`number`	`1`	The pdf document scale
`enableToon`	`boolean`	`false`	To enable pdf words extraction in TOON format

Usage Example:

const reader = new PdfReader({ verbose: true, excludeFooter: false });

These options allow fine-tuned control over how text is extracted and processed from PDFs.

Compiling

You can compile your project into a single binary, but remember to also copy node_modules/mupdf/dist/mupdf-wasm.wasm and place it alongside your binary. See package.json for an example.

Method Documentation

`PdfReader` Class

Constructor: `constructor(options?: Partial<PdfReaderOptions>)`

Creates an instance of PdfReader.

Parameters:
- options (optional): Partial options to override the defaults. Refer to the PdfReaderOptions interface for available options.

Method: `open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>`

Opens a PDF document.

Parameters:
- filename: The path to the PDF file or an ArrayBuffer containing the PDF data.
Returns: A promise that resolves with the PDFDocumentProxy.

Method: `getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>`

Extracts the text content from the PDF document.

Parameters:
- pdf: The PDFDocumentProxy instance.
Returns: A promise that resolves with a Map of page numbers to their corresponding PdfTexts.

Sample return:

// Map (1) starting index from 0
{
  "0": {
    "fullText": "Opposite Expectation: How to See the World as Two-Sided Lorem ipsum",
    "words": [
      {
        "text": "Opposite Expectation: How to See the World as Two-Sided",
        "bbox": {
          "x0": 72,
          "y0": 84,
          "x1": 464,
          "y1": 99
        },
        "dimension": {
          "width": 392,
          "height": 15
        },
        "metadata": {
          "writing": "horizontal",
          "direction": "",
          "font": {
            "name": "AAAAAA+Arial-BoldItalicMT",
            "family": "sans-serif",
            "weight": "bold",
            "style": "italic",
            "size": 14
          },
          "hasEOL": undefined,
          "pageNum": 0
        },
        "id": 0
      }
    ]
  }
}

Method: `getLinesFromTexts(pageTexts: PageTexts): PageLines`

Retrieves line information from the page texts.

Parameters:
- pageTexts: A Map of page numbers to their corresponding PdfTexts.
Returns: A Map of page numbers to an array of PdfLine objects.

Sample return:

// Map (1) starting index from 0
{
  "0": [
    {
      "bbox": {
        "x0": 72,
        "y0": 84,
        "x1": 464,
        "y1": 99
      },
      "averageFontSize": 14,
      "dimension": {
        "width": 392,
        "height": 15
      },
      "words": [
        {
          "text": "Opposite Expectation: How to See the World as Two-Sided",
          "bbox": {
            "x0": 72,
            "y0": 84,
            "x1": 464,
            "y1": 99
          },
          "dimension": {
            "width": 392,
            "height": 15
          },
          "metadata": {
            "writing": "horizontal",
            "direction": "",
            "font": {
              "name": "AAAAAA+Arial-BoldItalicMT",
              "family": "sans-serif",
              "weight": "bold",
              "style": "italic",
              "size": 14
            },
            "hasEOL": false,
            "pageNum": 0
          },
          "id": 0
        }
      ],
      "text": "Opposite Expectation: How to See the World as Two-Sided"
    }
  ]
}

Method: `getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines`

Retrieves a compact representation of line information from the page texts using the specified algorithm.

Parameters:
- pageTexts: A Map of page numbers to their corresponding PdfTexts.
- algorithm: An optional PdfCompactLineAlgorithm specifying the method for grouping lines. Defaults to middleY.
Returns: A Map of page numbers to an array of CompactPdfLine objects, where the line extraction method depends on the chosen algorithm.

Sample return:

// Map (1) starting index from 0
{
  "0": [
    {
      "bbox": {
        "x0": 72,
        "y0": 83.13183584999996,
        "x1": 461.4900053795799,
        "y1": 97.13183534999996
      },
      "words": [
        {
          "text": "Opposite Expectation: How to See the World as Two-Sided",
          "bbox": {
            "x0": 72,
            "y0": 83.13183584999996,
            "x1": 461.4900053795799,
            "y1": 97.13183534999996
          }
        }
      ],
      "text": "Opposite Expectation: How to See the World as Two-Sided"
    }
  ]
}

Method: `isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean`

Determines whether the PDF appears to be a scanned document.

Parameters:
- pageTexts: A Map of page numbers to their corresponding PdfTexts.
- options (optional): Thresholds for scanned detection. Defaults to CONSTANT.WORDS_PER_PAGE_THRESHOLD and CONSTANT.TEXT_LENGTH_THRESHOLD.
Returns: true if the PDF is considered scanned; otherwise, false.

For other methods I encourage you to try it out yourself.

Web / Browser Support

Starting from v6.0.0, ppu-pdf supports running directly in the browser! Import from ppu-pdf/web to use PdfReaderLegacyWeb — a browser-native PDF reader powered by pdfjs-dist.

Web Features

Feature	Node (`PdfReaderLegacy`)	Web (`PdfReaderLegacyWeb`)
`open()`	✅	✅
`getTexts()`	✅	✅
`getTextsScanned()`	✅	✅ (ppu-paddle-ocr/web)
`getLinesFromTexts()`	✅	✅
`getCompactLinesFromTexts()`	✅	✅
`getLinesFromTextsInToon()`	✅	✅
`isScanned()`	✅	✅
`isPageScanned()`	✅	✅
`renderAll()`	✅ (Node Canvas)	✅ (HTMLCanvasElement)
`destroy()`	✅	✅
`dumpCanvasMap()`	✅	❌ (no filesystem)
`saveCanvasToPng()`	✅	❌ (no filesystem)

Note: PdfReader (mupdf-based) is not available in the browser — it requires native bindings. Only PdfReaderLegacyWeb (pdfjs-dist) is supported on the web.

Web Installation

npm install ppu-pdf
# Optional for scanned PDF OCR in the browser:
npm install ppu-paddle-ocr

Web Usage (Bundler)

import { PdfReaderLegacyWeb } from "ppu-pdf/web";

const reader = new PdfReaderLegacyWeb({ verbose: false });

// From a file input or fetch
const response = await fetch("document.pdf");
const buffer = await response.arrayBuffer();

const pdf = await reader.open(buffer);
const texts = await reader.getTexts(pdf);
console.log(texts.get(1)?.fullText);

const lines = reader.getLinesFromTexts(texts);
const compactLines = reader.getCompactLinesFromTexts(texts);
const isScanned = reader.isScanned(texts);

await reader.destroy(pdf);

Web Usage (CDN / No Bundler)

<script type="importmap">
{
  "imports": {
    "pdfjs-dist": "https://cdn.jsdelivr.net/npm/pdfjs-dist@4.9.155/+esm",
    "pdfjs-dist/types/src/display/api": "https://cdn.jsdelivr.net/npm/pdfjs-dist@4.9.155/+esm",
    "pdfjs-dist/types/web/interfaces": "https://cdn.jsdelivr.net/npm/pdfjs-dist@4.9.155/+esm",
    "@toon-format/toon": "https://cdn.jsdelivr.net/npm/@toon-format/toon@2.1.0/+esm",
    "onnxruntime-web": "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.24.2/dist/ort.all.bundle.min.mjs",
    "ppu-paddle-ocr/web": "https://cdn.jsdelivr.net/npm/ppu-paddle-ocr@4/web/index.js",
    "ppu-ocv/web": "https://cdn.jsdelivr.net/npm/ppu-ocv@2/index.web.js"
  }
}
</script>
<script type="module">
  import { PdfReaderLegacyWeb } from "https://cdn.jsdelivr.net/npm/ppu-pdf@6/web/index.js";

  const reader = new PdfReaderLegacyWeb();
  // ... use as shown above
</script>

Scanned PDF OCR (Web)

For scanned PDF text extraction in the browser, explicitly load the engine in top-level await space (<script type="module">) instead of subscribing to DOMContentLoaded.

<script type="module">
  import { PdfReaderLegacyWeb } from "ppu-pdf/web";
  import { PaddleOcrService } from "ppu-paddle-ocr/web";

  // Eagerly pre-warm the WebAssembly engine
  const ocrService = new PaddleOcrService();
  await ocrService.initialize();

  // Create an adapter matching the exact interface `getTextsScanned` requires
  const ocrAdapter = {
    initialize: async () => {}, // Pre-initialized above to avoid stutter
    recognize: async (canvas) => {
      return await ocrService.recognize(canvas);
    }
  };

  const reader = new PdfReaderLegacyWeb();
  const pdf = await reader.open(buffer);
  
  // Render PDF explicitly into mapped canvas elements first
  const canvasMap = await reader.renderAll(pdf, 72);
  
  // Inject the mapped canvas and the OCR engine to extract words and bounding boxes
  const texts = await reader.getTextsScanned(ocrAdapter, canvasMap);
  console.log(texts.get(1)?.fullText);

  await reader.destroy(pdf);
</script>

Contributing

Contributions are welcome! If you would like to contribute, please follow these steps:

Fork the Repository: Create your own fork of the project.
Create a Feature Branch: Use a descriptive branch name for your changes.
Implement Changes: Make your modifications, add tests, and ensure everything passes.
Submit a Pull Request: Open a pull request to discuss your changes and get feedback.

Running Tests

This project uses Bun for testing. To run the tests locally, execute:

bun test

Ensure that all tests pass before submitting your pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

If you encounter any issues or have suggestions, please open an issue in the repository.

Happy coding!

Scripts

Recommended development environment is in linux-based environment. Library template: https://github.com/aquapi/lib-template

All script sources and usage.

bun task bench index # Run bench/index.bench.ts

To run the benchmark in node, add a --node parameter

bun task bench --node

bun task bench --node index # Run bench/index.bench.ts with node

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
assets		assets
bench		bench
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
index.html		index.html
jsr.json		jsr.json
package.json		package.json
tsconfig.json		tsconfig.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ppu-pdf

Features

Differences

Benchmark

Installation

Usage

Rebuilding a Scanned PDF into a Searchable PDF

PdfReaderOptions

Usage Example:

Compiling

Method Documentation

PdfReader Class

Constructor: constructor(options?: Partial<PdfReaderOptions>)

Method: open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>

Method: getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>

Method: getLinesFromTexts(pageTexts: PageTexts): PageLines

Method: getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines

Method: isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean

Web / Browser Support

Web Features

Web Installation

Web Usage (Bundler)

Web Usage (CDN / No Bundler)

Scanned PDF OCR (Web)

Contributing

Running Tests

License

Support

Scripts

Build

Publish

Bench

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`PdfReaderOptions`

`PdfReader` Class

Constructor: `constructor(options?: Partial<PdfReaderOptions>)`

Method: `open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>`

Method: `getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>`

Method: `getLinesFromTexts(pageTexts: PageTexts): PageLines`

Method: `getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines`

Method: `isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean`

Packages