OCR PDF

The PDF OCR tool fills the gap our Extract Text tool honestly admits to: scanned PDFs have no selectable text, so a normal text extractor returns nothing. This one renders each page to a canvas with pdfjs-dist, then runs Tesseract.js — a WASM port of the open-source Tesseract engine — to recognize characters from pixels. You get back a .txt file with the recognized text. It's not Acrobat's hybrid OCR — Acrobat layers ML on top of Tesseract and does better on messy scans — but for clean machine-printed scans at 200-300 DPI, Tesseract is good. Handwriting almost always fails; we say so. The PDF stays in your browser. The language model is fetched once from Tesseract's CDN and cached.

Built by Bob Article by Lace QA by Ben Shipped

🔒 The PDF stays in your browser — it never uploads. The OCR model is fetched once from Tesseract's CDN, then cached. Close the tab and your file is gone.

How to use

  1. 1

    Drop or pick a scanned PDF. Up to 50 MB and 100 pages.

  2. 2

    Pick the document language. English, Spanish, German, French, Portuguese, Italian, and Russian are supported today. The language model (~5-10 MB) downloads once per language and the browser caches it.

  3. 3

    Pick the render DPI. 200 DPI is recommended. 300 DPI gives the best accuracy but takes longer; 150 DPI is faster but only works well on clean scans.

  4. 4

    Click "Run OCR." The model loads (5-15 seconds the first time), then each page is rendered and recognized in order. Progress shows per page; you can cancel mid-run.

  5. 5

    Copy the recognized text or download it as a .txt named after your source (e.g., contract-scan.pdf → contract-scan.txt). The output is UTF-8.

Frequently asked questions

Ratings & Reviews

Rate this tool

Sign in to rate and review this tool.

Loading reviews…

What the OCR PDF tool actually does

A scanned PDF looks like text but isn't. It's a stack of pictures of text — bitmaps your camera or scanner captured, stuffed inside a PDF wrapper. When you hit Cmd-F to search, nothing happens. When you select a paragraph, the cursor sweeps over the pixels and grabs nothing. That's because there's no text in the file, only the appearance of text. The OCR PDF tool fixes that. It renders each page to a canvas with pdfjs-dist, then runs Tesseract.js — a WebAssembly port of the open-source Tesseract engine — to read the characters back out of the pixels. You get a real .txt file with the recognized words. Everything happens in your browser. Your scan never leaves the machine.

That last part is the whole point. OCR has been online for two decades; pick any one of OnlineOCR.net, SmallPDF, iLovePDF, Adobe Acrobat's web version — they'll all do it. Every one of them uploads your document first. That's fine for a flyer. It's not fine for a signed loan agreement, a medical record, a passport scan, an immigration form, a tax return, or any of the other things people actually OCR. Tesseract runs the same recognition locally; we just made the wrapper that lets a browser run it.

How to use the OCR PDF tool

The tool is one screen. Drop a scan, pick a language, click run.

  1. Drop or pick a scanned PDF. Up to 50 MB and 100 pages.
  2. Pick the document language. English, Spanish, German, French, Portuguese, Italian, and Russian are supported today. The language model (~5-10 MB) downloads once per language and the browser caches it.
  3. Pick the render DPI. 200 DPI is recommended. 300 DPI gives the best accuracy but takes longer; 150 DPI is faster but only works well on clean scans.
  4. Click Run OCR. The model loads (5-15 seconds the first time), then each page is rendered and recognized in order. Progress shows per page; you can cancel mid-run.
  5. Copy the recognized text or download it as a .txt named after your source (e.g., contract-scan.pdfcontract-scan.txt). The output is UTF-8.

The only network traffic during a run is the first language-model download from Tesseract's CDN. That request is a public static file fetch — the same one everyone gets — and contains no PDF data going in either direction. After the model lands in the browser's cache, the network tab is silent for the rest of the run, and silent forever for subsequent runs in the same language.

DPI and language — the two knobs that matter

Tesseract is a pattern-matcher trained on character shapes. Two things wreck pattern matching: not enough pixels, and the wrong alphabet. DPI controls the first. Language controls the second.

Render DPIPages per minuteAccuracy on clean scansBest for
150 DPI~30-5092-96%Already-sharp scans, single-column body text, quick drafts
200 DPI (default)~20-3096-99%Most documents — the sweet spot for speed and accuracy
300 DPI~10-1597-99%+Small fonts, fine print, footnotes, anything you'll trust without proofreading

Going below 150 DPI is a bad trade. Tesseract starts confusing similar glyphs once the character height drops below roughly 30 pixels — "rn" reads as "m," "cl" reads as "d," lowercase L looks like the number 1. Going above 300 DPI rarely helps and roughly doubles the runtime — by then you're rendering the page in more detail than the recognizer can use.

Language matters more than people expect. Tesseract loaded with the English model will try to recognize a Spanish page, but it doesn't know that ñ, í, or á are letters — it'll guess at them and miss. Pick the dominant language of the document. For a Spanish contract with a few English brand names sprinkled in, pick Spanish; Tesseract will get the brand names approximately right and the body text exactly right. For mixed-script documents (English next to Chinese, Arabic, Hindi), Tesseract isn't great at switching scripts on the fly — let us know if you need one of those and we'll add the model to the picker.

A worked example with real numbers

Take a real case: a 12-page scanned PDF of a 1980s university transcript — typewriter font, 200 DPI scan, no obvious damage, single column, English. The file is 4.2 MB.

At 200 DPI / English, the run takes 28 seconds end-to-end (after the model is already cached from a prior session). The output is a 6.8 KB .txt file. Spot-checking the output against the original: 1,247 words on the page, 14 OCR errors total. Most were "I" vs "l" vs "1" confusion in the student-ID column. Course names came out clean. Grades came out clean. The transcript header — the school crest area, where the scan caught some of the seal — was the only zone with real garbage. Manual cleanup took two minutes in a text editor.

Same document at 300 DPI: 51 seconds, 4 OCR errors. Same document at 150 DPI: 18 seconds, 38 errors — visibly worse, the digits got hit hard. 200 DPI was the right pick.

Now flip the input: a phone photo of a handwritten meeting note, exported as a PDF. Tesseract returned a mostly-blank file with a few stray characters. That's the honest answer — Tesseract was trained on machine-printed text, and handwriting is a different problem. We don't pretend.

How this compares to Adobe Acrobat, SmallPDF, iLovePDF

The honest comparison: Adobe Acrobat's OCR is better than ours on hard inputs, and we tell you that openly.

Acrobat runs Tesseract under the hood (or did historically — Adobe layered their own ML on top years ago) plus a stack of pre-processing: de-skew, contrast correction, line-detection, and a post-OCR language model that catches common recognition mistakes. On a crumpled receipt photographed under fluorescent light, Acrobat will produce usable text where Tesseract returns mush. That's worth ~$20/month if your job involves OCRing messy real-world inputs all day. On a clean 200-300 DPI machine-printed scan in a common language, the gap closes — Tesseract is often 97%+, Acrobat is often 99%+, and for most users the difference doesn't justify uploading the document.

SmallPDF and iLovePDF wrap server-side OCR (the same family of engines) behind a daily free-file quota and a recurring subscription nudge. They work. They also keep your PDF on their servers for at least a few hours, usually longer depending on the retention policy you didn't read. For sensitive scans, that's the wrong default. For a marketing flyer, it doesn't matter.

OnlineOCR.net and PDF24's OCR are the cheap-feeling end of this market — works, ad-laden, file-size-capped, output watermarked unless you sign up. They're what we mean when we point at Big Software's bottom shelf. We're the opposite trade: slower than nothing (Tesseract is single-threaded WASM, your CPU does the work), no upload, no account, no quota.

What clean OCR depends on

OCR is pattern-matching on pixels. The pattern has to be visible. Five things drive accuracy more than anything else:

  • Scan resolution. 200-300 DPI is the floor for reliable OCR. 100 DPI photos from a phone often look fine to a human and confuse Tesseract — the letter strokes are too few pixels wide for the recognizer to lock onto a shape.
  • Contrast. Black ink on white paper is best. Faded photocopies of photocopies, where letters blur into the page background, drop accuracy fast. Adjust contrast in your scanner software before saving the PDF if you can.
  • Skew. A page scanned at a 5° angle reads worse than a straight one. Acrobat de-skews automatically; Tesseract doesn't. Re-scan with a straighter page or rotate in a viewer first.
  • Font. Modern body fonts at 10-12pt OCR cleanly. Decorative fonts, blackletter, very thin or very bold faces, all-caps display type, and handwriting are all harder.
  • Background. Solid white or light-cream pages are easy. Heavily watermarked pages, security paper (the lined background on a check, the patterned background on a diploma), and aged paper with browning are all noise for the recognizer.

The single highest-leverage fix on a bad OCR run is rescanning at 300 DPI with the contrast bumped up. It's also free.

Render to text is not the same as extract embedded text

This is the one thing people get wrong consistently. There are two PDF-to-text jobs that sound identical and aren't.

OCR the page (what this tool does): each page is rendered as a high-resolution bitmap, then Tesseract recognizes characters from those pixels. Works on scanned PDFs, photo-of-page PDFs, anything where the text is part of an image. Slow, slightly inaccurate, the only option for image-only PDFs.

Extract the embedded text (different tool): reads the text objects stored inside the PDF directly. Works on any PDF born from a word processor, a browser's "Save as PDF," LaTeX, InDesign — anything that produced the PDF from real text. Instant, perfectly accurate, but returns nothing on scans because there's no embedded text to extract.

If your PDF already has selectable text — try to highlight a sentence in any PDF viewer — use Extract Text from PDF instead. It's faster and exact. OCR is for the case where there's no text to extract, only pixels to recognize.

What Tesseract is and isn't

Tesseract started as a research project at HP Labs in the 1980s, got open-sourced in 2005, and became Google's preferred OCR engine for the next decade. The community version we run is the same engine that ships inside countless desktop tools, Linux distributions, and document-processing pipelines. It's not new and it's not magic — it's mature, predictable, and free.

What it's good at: machine-printed text, modern Latin and Cyrillic alphabets, body fonts at normal sizes, books, contracts, articles, scanned receipts, typewritten documents. What it's mediocre at: small fonts under 8pt, mathematical notation, multi-column layouts (it tries, but sometimes interleaves columns), tables (rendered as text in reading order, not as a table). What it's bad at: handwriting (essentially undecipherable to it; very tidy hand-printing occasionally works but expect heavy errors), CAPTCHAs (intentionally), heavily stylized fonts, anything where the characters are deliberately hard to read.

The newer commercial OCRs — Google Cloud Vision, Microsoft Azure Document Intelligence, AWS Textract — beat Tesseract on hard inputs because they use modern transformer models trained on enormous datasets. They also charge per page and require you to ship your document to a cloud. For most documents, most days, Tesseract in the browser is the right trade.

Related PDF tools

The OCR PDF tool is one tile in a larger PDF toolset. A few neighbors that often come up:

  • Extract Text from PDF — use this first if your PDF already has selectable text. Instant and exact, no OCR needed.
  • Extract PDF Images — pull out the embedded photos from a PDF as individual files. Different job from OCR.
  • PDF to PNG — render each page as a lossless image. Useful when you want the page pictures alongside the recognized text.
  • Split PDF — break a long PDF into chunks before OCRing. The OCR tool caps at 100 pages per run; for longer documents, split first.
  • Compress PDF — if your scan is enormous, compressing it first won't hurt OCR accuracy at 200 DPI and may bring the page count back under the per-run limit.

Microapp ships every PDF tool browser-side, with the same trade-offs spelled out on each page. 10% of every dollar Microapp earns goes to charity, off the top, audited quarterly — so the tool you're using has to actually work without ads in the way.

Frequently asked questions

How does this compare to Adobe Acrobat's OCR?

Honestly: Acrobat is better on hard inputs. Acrobat uses Tesseract plus a layer of Adobe's own ML for de-skewing, contrast correction, and language model post-processing — it handles messy scans (bad lighting, rotated pages, low contrast, unusual fonts) more reliably. This tool is plain Tesseract. On clean machine-printed scans at 200-300 DPI in a supported language, the gap is small and you get the upside of not uploading your document. On hard scans, Acrobat wins. We'd rather tell you that than oversell.

Which languages does it support?

Today: English, Spanish, German, French, Portuguese, Italian, and Russian. Each language has its own ~5-10 MB Tesseract model that downloads on first use and is cached after. Tesseract itself supports 100+ languages — if you need one that isn't in the picker (Chinese, Japanese, Arabic, Hindi, etc.), let us know and we'll add it. Mixed-language documents work best if you pick the dominant language; Tesseract isn't great at switching scripts on the fly.

Does it work on handwriting?

Almost never. Tesseract was trained on machine-printed text — fonts, books, scanned typewritten documents, signage. Cursive handwriting is essentially undecipherable to it; very tidy hand-printed text occasionally works but you should expect heavy errors. For handwriting OCR you need a different model (Google Cloud Vision and Microsoft Azure Document Intelligence both ship handwriting-trained models). We won't pretend Tesseract can do it.

How fast is it?

Two phases. (1) Model load: 5-15 seconds the first time you pick a language (the ~5-10 MB .traineddata downloads). Cached after, so subsequent runs in the same browser are instant. (2) Recognition: ~1-3 seconds per page at 200 DPI on a modern laptop, ~3-8 seconds at 300 DPI. A 20-page scanned report at 200 DPI typically finishes in 30-60 seconds end-to-end after the first run. There's a Cancel button if you change your mind partway.

Is my PDF really private?

The PDF itself never leaves the browser. pdfjs-dist renders pages locally; Tesseract.js runs the OCR locally via WebAssembly. The only network request during a run is fetching the language model from Tesseract's CDN (jsdelivr) the first time — and that's just a public static file download, the same one everyone gets, with no PDF data in it. Check your browser's Network tab during recognition: after the model loads, zero outbound requests until you reload the page.

Why does the recognized text have mistakes?

OCR is inherently imperfect — it's pattern matching on pixels. Accuracy depends heavily on input quality: 300 DPI clean scan of a standard book typeface in good contrast = often 98%+. 150 DPI photo of a crumpled receipt under fluorescent light = much worse. Common issues: 'l' vs 'I' vs '1' confusion, 'O' vs '0', joined letters in old fonts, columns interleaving, footnotes mixed with body text. Always proofread OCR output before trusting it for legal, medical, or financial use.

Can I OCR a PDF that already has text?

You can, but it's the wrong tool. PDFs with embedded text (anything exported from Word, Google Docs, LaTeX, or 'Save as PDF' from a browser) already have selectable text — running OCR on them re-recognizes the rendered glyphs from scratch, which is slower and less accurate than just reading the text that's already there. For those, use the Extract Text from PDF tool — it's instant and exact.

Will you add a searchable-PDF output?

Yes, that's the v2 plan. The current output is a plain .txt file of the recognized words. A 'searchable PDF' would keep the original page images but add an invisible text layer on top, so you can highlight, select, and Ctrl-F inside the PDF like a normal text document. It's a more complex build (positioning each recognized word at the right x/y on the page) and we wanted to ship the honest .txt version first. The output picker shows 'Searchable PDF — coming soon' so you know it's planned.

What's the file size limit?

50 MB and 100 pages per run. OCR is much heavier than text extraction — every page gets rendered to a high-resolution canvas and processed through a WASM model — so the limits are tighter than our other PDF tools. For larger documents, split the PDF with our PDF Splitter and OCR the chunks separately. On a low-memory device (a phone, a Chromebook with 4 GB RAM), even 100 pages at 300 DPI may run out of memory; drop to 200 DPI or split smaller.