Extract Text from PDF

The PDF Text Extractor pulls the selectable text out of a PDF and gives you back a clean plain-text version — ready to paste into a doc, search, summarize with AI, or feed into a script. Works on every PDF that contains real text (so: anything exported from Word, Google Docs, LaTeX, a browser "Save as PDF", or most modern eBooks). It does NOT do OCR — scanned PDFs with no embedded text will return empty, and we surface that honestly rather than pretending. Runs entirely in your browser using Mozilla's PDF.js. No upload, no server, no clipboard logging.

Built by Bob Article by Lace QA by Ben Shipped May 12, 2026

How to use

1
Drop or pick your PDF. Up to 100 MB and 500 pages.
2
Click "Extract text." The tool reads each page in order, reconstructs line breaks from the layout, and concatenates everything with double-newlines between pages.
3
Copy the result to clipboard with one click, or download it as a .txt file named after your source (e.g., contract.pdf → contract.txt).
4
Edit the extracted text in any plain-text editor — Notepad, TextEdit, VS Code. The .txt is UTF-8 so Unicode characters (accents, em-dashes, smart quotes) round-trip cleanly.

🔄 PDF to PNG 📸 PDF to JPG ✂️ Split PDF 📎 PDF Merger

Frequently asked questions

Does this work on scanned PDFs?

No — and we say so clearly when it doesn't. Scanned PDFs are images of text, not text. To extract words from a scan, you need OCR (Optical Character Recognition), which is a fundamentally different operation: AI looking at pixels and guessing letters. This tool extracts text that's already there. For scans, run the PDF through an OCR tool first (Adobe Acrobat, macOS Preview, Tesseract, or one of several free online OCR services), save the result, then run it through this tool.

How does it handle layout — columns, headers, footnotes?

PDF.js returns text items in document order with x/y coordinates. We reconstruct line breaks from significant y-coordinate jumps, but we don't try to detect columns. Two-column documents will interleave columns in the output (line 1 of col 1, line 1 of col 2, line 2 of col 1...). Headers and footers appear on every page. For complex layouts, the cleanest fix is opening the PDF in a viewer, selecting the columns manually, and copying — or run the output through an LLM to re-flow it.

Are my PDFs really not uploaded?

Correct. PDF.js runs in the browser (it's the library that renders PDFs inside Firefox). Your bytes go from your file system to the browser's memory to the rendered text — never to a server. Check your browser's network tab during extract: zero outbound requests.

What's the character encoding?

UTF-8. Unicode characters in the PDF (smart quotes, em-dashes, accented letters, CJK characters, emoji, math symbols) come through correctly. If you open the .txt in an editor that defaults to a different encoding (some old Windows defaults), set it to UTF-8 to see characters correctly.

Why is my output missing some text?

Three common causes. (1) The PDF has embedded text that's flagged as 'image-only' — some old PDF generators do this to prevent copying. (2) The PDF uses a non-standard font with no Unicode mapping, so PDF.js can render it but can't extract recognizable characters. (3) The text is actually an image (a scanned page). Open the PDF, try to select text with your cursor — if you can't select it, this tool can't extract it.

What's the max file size?

100 MB and 500 pages per PDF. Text extraction is faster than rendering pages, so the page-count limit is generous. For multi-thousand-page documents (legal discovery, large manuscripts), split the PDF first with our PDF Splitter and extract in chunks.

Can I extract text from a password-protected PDF?

No — PDF.js refuses to open encrypted PDFs. Unlock the PDF first using a desktop reader (Adobe Acrobat: File → Properties → Security, then 'Save As' an unprotected copy; or macOS Preview: File → Export → uncheck 'Encrypt') and run the unlocked copy through this tool.

Why use this over copy-paste from a PDF viewer?

For one or two pages, copy-paste from your PDF reader is fine. For long documents, this tool is faster (one operation vs. dozens of selections) and the line-break handling is more consistent (PDF readers often paste with mid-paragraph line breaks every 80 characters). It's also useful when you need a .txt file for a downstream tool — diff, grep, an LLM prompt, a script.

Ratings & Reviews

Rate this tool

Loading reviews…

What the PDF Text Extractor actually does

It pulls the selectable text out of a PDF and hands it back to you as plain UTF-8 — copy to clipboard, or download as a .txt file named after the source. That's it. No conversion to Word, no AI summary, no formatting tricks. Just the words.

If you've ever tried to grab a few paragraphs out of a PDF with copy-paste, you know how badly that goes. PDF viewers paste with mid-paragraph line breaks every 80 characters. Selections jump across columns. Footnotes land in the middle of body text. You end up cleaning up the result by hand for ten minutes before you can actually use it. This tool does the whole document in one pass, with cleaner line breaks than copy-paste gives you, and writes the result to a file you can grep, diff, paste into ChatGPT, or feed to a script.

One thing up front, because it matters: this is not an OCR tool. If your PDF is a scan of a paper document — pictures of text rather than text itself — this tool will give you back nothing, and we say so honestly in the result panel. We don't pretend to extract characters we can't see. For scans, you need a real OCR step first (Adobe Acrobat, macOS Preview's built-in OCR since Ventura, Tesseract on the command line). Run the OCR, save the text-bearing PDF, then run that through here.

What "extract" means under the hood

The tool uses pdfjs-dist — Mozilla's PDF.js, the same library that renders PDFs inside Firefox. PDF.js parses each page and returns a list of text items, each with the actual characters plus an x/y coordinate on the page. We walk the list in document order and reconstruct line breaks by watching for significant y-coordinate jumps: when the next text item is several pixels lower than the previous one, that's a new line.

This is the same approach a PDF viewer uses when you select text with your cursor. The difference is we do the whole document at once and write a sensible output: lines that are actual lines, paragraphs separated by single newlines, pages separated by double newlines so you can tell them apart if you need to.

The whole pipeline runs in your browser. Your PDF goes from your file system to the browser's memory to the rendered text. It never touches our server, because there's no server step. Open your browser's network tab while you extract — you'll see zero outbound requests. That's not a privacy promise we're asking you to trust; it's a fact you can verify in 5 seconds.

How to use it

Drop the PDF onto the page, or pick it from a file dialog. Up to 100 MB and 500 pages.
Click "Extract text." PDF.js loads the file and reads every page in order. Progress shows for longer documents.
The extracted text appears in a panel. Copy it to your clipboard with one button, or download as a .txt file — the file is named after your source, so contract.pdf downloads as contract.txt.
Open the .txt in anything: Notepad, TextEdit, VS Code, Sublime, an editor on a remote server. UTF-8 means accented characters, smart quotes, em-dashes, CJK characters, emoji, and math symbols all round-trip cleanly.

There's no account, no watermark on the output, no daily quota, and no "upgrade to extract more than 3 pages" wall. Free is a fact here, not a slogan.

A worked example: a 14-page board memo

Concrete case. Imagine a 14-page board memo exported from Google Docs as PDF. Body text, two headings per page, a footer on every page with the page number and document title, no images, no columns.

Drop it in. The extract takes about 1.2 seconds on a modern laptop. The output is roughly 18,000 characters of plain text — every paragraph from the memo, the heading text inline above each section, and the page footer repeating 14 times. That last part is normal: PDF.js sees the footer as text on every page and so do we. If you want to strip repeated footer lines, that's a five-second find-and-replace in your editor (search the footer string, replace with nothing) and not something we want to do automatically because we'd guess wrong on documents where the "repeating" line is actually meaningful content.

From there: paste into Claude or ChatGPT for a summary, paste into a Notion page, grep for specific terms across the whole document with the kind of speed you can't get inside a PDF viewer, or diff against a previous draft if you have one as text.

Where it works well, where it doesn't

PDF type	Result	What you'll see
Word / Google Docs export	Excellent	Clean paragraphs, accurate line breaks, full Unicode
LaTeX (academic paper, one column)	Excellent	Body text, headings, references all extract
LaTeX (two-column journal article)	Mediocre	Columns interleave — line 1 of col 1, line 1 of col 2, line 2 of col 1...
Browser "Save as PDF"	Good	Text comes through; some sites embed CSS pseudo-elements that don't extract
EPUB-style eBook PDF	Good	Chapter text and headings, generally clean
Scanned document (no OCR)	Empty	We tell you the page contains no extractable text. Run OCR first.
Scanned + OCR'd (e.g., via Acrobat)	Good	Quality depends entirely on the OCR pass that preceded
Form-filled PDF (filled fields)	Good for body, mixed for fields	Body text fine; form field contents extract if they're real text and not annotations
Print-to-PDF from old apps	Variable	Some old PDF generators flag text as image-only — can't help that case

The single biggest limitation is column handling. PDF.js gives us text in document order, which on a two-column page means it walks down column 1 then column 2 in a single linear stream. We could try to detect columns from the x-coordinate distribution and re-order, but column detection is unreliable on real-world documents — sidebars, callout boxes, and tables all look like columns to a naive detector. We chose not to guess. If you have a heavily multi-column document, the cleanest workflow is: extract here, then paste the result into an LLM and ask it to reflow the columns.

Why use this over iLovePDF, SmallPDF, or Adobe Acrobat

The big-name PDF sites all do text extraction. Here's the honest comparison.

iLovePDF and SmallPDF upload your PDF to their servers. For a board memo, a contract draft, an internal report, a medical document — that's a no. Even if the company isn't doing anything sketchy with the file, the bytes briefly live on someone else's infrastructure. Both also impose page caps and ask you to sign up or pay to do more than 1–2 documents an hour. We have no upload step at all and no cap below 100 MB / 500 pages.

Convertio is the same story plus a more aggressive paywall — five free conversions, then upgrade. It works fine when it works; the cost is the upload and the queue.

Adobe Acrobat does extract text well, locally, on your machine. If you already pay for Acrobat Pro, use it. If you don't — Acrobat Pro is $20/month for the privilege of also getting features you don't need. This tool is one of those features unbundled.

Copy-paste from your PDF reader is fine for one paragraph. For a whole document it's tedious and produces worse output than this tool because the reader has no concept of "give me the document as cleanly broken text"; it gives you whatever you happened to select, with the line breaks the layout happened to have.

Related PDF tools

If you came here for text, you might also want:

PDF Page Counter — drop a PDF and see how many pages, plus all the embedded metadata (title, author, creator app, creation date). Two-second lookup, no transform.
Extract Images from PDF — pulls embedded photos, logos, and screenshots out as separate PNG files at their original resolution. Different from PDF to PNG, which renders whole pages.
Split PDF — if your document is too large to extract in one pass, split it into chunks first.
PDF Merger — the other direction: combine multiple PDFs into one before extracting text from the whole.
PDF to JPG — when you want page-level images, not the source content.

Frequently asked questions

Does this work on scanned PDFs?

No, and we tell you clearly when it doesn't. Scanned PDFs are images of text — there are no characters in the file, only pixels. Text extraction can only return characters that exist. For scans, run OCR first: Adobe Acrobat, macOS Preview (File → Export → check "Embed text" in Ventura and later), or Tesseract on the command line. Save the resulting text-bearing PDF, then come back here.

How does it handle columns, headers, and footnotes?

Document order. PDF.js reads top to bottom in the order the items appear in the PDF, and so do we. Two-column layouts interleave (line 1 of col 1, then line 1 of col 2, etc.). Headers and footers repeat per page. For a clean multi-column extraction, the workaround is paste the result into an LLM and ask it to re-flow.

Is my PDF really not uploaded?

Correct. PDF.js runs in your browser — it's the same library Firefox uses to display PDFs internally. The bytes go from your disk to memory to the rendered text without crossing the internet. Check your browser's network tab during the extract: zero outbound requests. The static page itself loaded from our CDN at first visit; after that, the extract is fully local.

What's the character encoding?

UTF-8. Accented characters, smart quotes, em-dashes, CJK characters, emoji, math symbols — all round-trip cleanly. If you open the .txt in an editor that defaults to a different encoding (some old Windows defaults to Windows-1252), set it to UTF-8 to see characters correctly.

Why is my output missing some text?

Three usual reasons. One, the PDF marks some text as "image-only" — a few old PDF generators do this to block copying. Two, the PDF uses a custom font with no Unicode mapping, so PDF.js can render glyphs but can't map them back to characters. Three, the apparent text is actually an image (a scanned page or a screenshot of text). The diagnostic: open the PDF in any viewer and try to select the missing text with your cursor. If you can't select it there, no extraction tool can pull it.

Can I extract text from a password-protected PDF?

Not directly — PDF.js refuses to open encrypted PDFs. Unlock it first in a desktop reader: Adobe Acrobat (File → Properties → Security, then Save As an unprotected copy), or macOS Preview (File → Export, uncheck "Encrypt"). Then run the unlocked copy through this tool.

What's the size limit?

100 MB and 500 pages per PDF. Text extraction is much lighter than rendering, so the page-count limit is generous. For documents larger than that — multi-thousand-page legal discovery, large manuscripts — split into chunks first with Split PDF and extract each chunk.

Why a .txt file instead of Word or Markdown?

Plain text is the format that travels best. It opens in everything, has no formatting that needs translating, and is what you'd want for the most common next steps: searching, diffing, feeding into an LLM, piping through a script, version-controlling. If you need Markdown, paste the .txt into an LLM and ask for Markdown formatting — much better result than any heuristic conversion would produce.

Extract Text from PDF

How to use

Related tools

Frequently asked questions

Ratings & Reviews

Rate this tool

What the PDF Text Extractor actually does

What "extract" means under the hood

How to use it

A worked example: a 14-page board memo

Where it works well, where it doesn't

Why use this over iLovePDF, SmallPDF, or Adobe Acrobat

Related PDF tools

Frequently asked questions

Does this work on scanned PDFs?

How does it handle columns, headers, and footnotes?

Is my PDF really not uploaded?

What's the character encoding?

Why is my output missing some text?

Can I extract text from a password-protected PDF?

What's the size limit?

Why a .txt file instead of Word or Markdown?