What the PDF Text Extractor actually does
It pulls the selectable text out of a PDF and hands it back to you as plain UTF-8 — copy to clipboard, or download as a .txt file named after the source. That's it. No conversion to Word, no AI summary, no formatting tricks. Just the words.
If you've ever tried to grab a few paragraphs out of a PDF with copy-paste, you know how badly that goes. PDF viewers paste with mid-paragraph line breaks every 80 characters. Selections jump across columns. Footnotes land in the middle of body text. You end up cleaning up the result by hand for ten minutes before you can actually use it. This tool does the whole document in one pass, with cleaner line breaks than copy-paste gives you, and writes the result to a file you can grep, diff, paste into ChatGPT, or feed to a script.
One thing up front, because it matters: this is not an OCR tool. If your PDF is a scan of a paper document — pictures of text rather than text itself — this tool will give you back nothing, and we say so honestly in the result panel. We don't pretend to extract characters we can't see. For scans, you need a real OCR step first (Adobe Acrobat, macOS Preview's built-in OCR since Ventura, Tesseract on the command line). Run the OCR, save the text-bearing PDF, then run that through here.
What "extract" means under the hood
The tool uses pdfjs-dist — Mozilla's PDF.js, the same library that renders PDFs inside Firefox. PDF.js parses each page and returns a list of text items, each with the actual characters plus an x/y coordinate on the page. We walk the list in document order and reconstruct line breaks by watching for significant y-coordinate jumps: when the next text item is several pixels lower than the previous one, that's a new line.
This is the same approach a PDF viewer uses when you select text with your cursor. The difference is we do the whole document at once and write a sensible output: lines that are actual lines, paragraphs separated by single newlines, pages separated by double newlines so you can tell them apart if you need to.
The whole pipeline runs in your browser. Your PDF goes from your file system to the browser's memory to the rendered text. It never touches our server, because there's no server step. Open your browser's network tab while you extract — you'll see zero outbound requests. That's not a privacy promise we're asking you to trust; it's a fact you can verify in 5 seconds.
How to use it
- Drop the PDF onto the page, or pick it from a file dialog. Up to 100 MB and 500 pages.
- Click "Extract text." PDF.js loads the file and reads every page in order. Progress shows for longer documents.
- The extracted text appears in a panel. Copy it to your clipboard with one button, or download as a .txt file — the file is named after your source, so
contract.pdfdownloads ascontract.txt. - Open the .txt in anything: Notepad, TextEdit, VS Code, Sublime, an editor on a remote server. UTF-8 means accented characters, smart quotes, em-dashes, CJK characters, emoji, and math symbols all round-trip cleanly.
There's no account, no watermark on the output, no daily quota, and no "upgrade to extract more than 3 pages" wall. Free is a fact here, not a slogan.
A worked example: a 14-page board memo
Concrete case. Imagine a 14-page board memo exported from Google Docs as PDF. Body text, two headings per page, a footer on every page with the page number and document title, no images, no columns.
Drop it in. The extract takes about 1.2 seconds on a modern laptop. The output is roughly 18,000 characters of plain text — every paragraph from the memo, the heading text inline above each section, and the page footer repeating 14 times. That last part is normal: PDF.js sees the footer as text on every page and so do we. If you want to strip repeated footer lines, that's a five-second find-and-replace in your editor (search the footer string, replace with nothing) and not something we want to do automatically because we'd guess wrong on documents where the "repeating" line is actually meaningful content.
From there: paste into Claude or ChatGPT for a summary, paste into a Notion page, grep for specific terms across the whole document with the kind of speed you can't get inside a PDF viewer, or diff against a previous draft if you have one as text.
Where it works well, where it doesn't
| PDF type | Result | What you'll see |
|---|---|---|
| Word / Google Docs export | Excellent | Clean paragraphs, accurate line breaks, full Unicode |
| LaTeX (academic paper, one column) | Excellent | Body text, headings, references all extract |
| LaTeX (two-column journal article) | Mediocre | Columns interleave — line 1 of col 1, line 1 of col 2, line 2 of col 1... |
| Browser "Save as PDF" | Good | Text comes through; some sites embed CSS pseudo-elements that don't extract |
| EPUB-style eBook PDF | Good | Chapter text and headings, generally clean |
| Scanned document (no OCR) | Empty | We tell you the page contains no extractable text. Run OCR first. |
| Scanned + OCR'd (e.g., via Acrobat) | Good | Quality depends entirely on the OCR pass that preceded |
| Form-filled PDF (filled fields) | Good for body, mixed for fields | Body text fine; form field contents extract if they're real text and not annotations |
| Print-to-PDF from old apps | Variable | Some old PDF generators flag text as image-only — can't help that case |
The single biggest limitation is column handling. PDF.js gives us text in document order, which on a two-column page means it walks down column 1 then column 2 in a single linear stream. We could try to detect columns from the x-coordinate distribution and re-order, but column detection is unreliable on real-world documents — sidebars, callout boxes, and tables all look like columns to a naive detector. We chose not to guess. If you have a heavily multi-column document, the cleanest workflow is: extract here, then paste the result into an LLM and ask it to reflow the columns.
Why use this over iLovePDF, SmallPDF, or Adobe Acrobat
The big-name PDF sites all do text extraction. Here's the honest comparison.
iLovePDF and SmallPDF upload your PDF to their servers. For a board memo, a contract draft, an internal report, a medical document — that's a no. Even if the company isn't doing anything sketchy with the file, the bytes briefly live on someone else's infrastructure. Both also impose page caps and ask you to sign up or pay to do more than 1–2 documents an hour. We have no upload step at all and no cap below 100 MB / 500 pages.
Convertio is the same story plus a more aggressive paywall — five free conversions, then upgrade. It works fine when it works; the cost is the upload and the queue.
Adobe Acrobat does extract text well, locally, on your machine. If you already pay for Acrobat Pro, use it. If you don't — Acrobat Pro is $20/month for the privilege of also getting features you don't need. This tool is one of those features unbundled.
Copy-paste from your PDF reader is fine for one paragraph. For a whole document it's tedious and produces worse output than this tool because the reader has no concept of "give me the document as cleanly broken text"; it gives you whatever you happened to select, with the line breaks the layout happened to have.
Related PDF tools
If you came here for text, you might also want:
- PDF Page Counter — drop a PDF and see how many pages, plus all the embedded metadata (title, author, creator app, creation date). Two-second lookup, no transform.
- Extract Images from PDF — pulls embedded photos, logos, and screenshots out as separate PNG files at their original resolution. Different from PDF to PNG, which renders whole pages.
- Split PDF — if your document is too large to extract in one pass, split it into chunks first.
- PDF Merger — the other direction: combine multiple PDFs into one before extracting text from the whole.
- PDF to JPG — when you want page-level images, not the source content.
Frequently asked questions
Does this work on scanned PDFs?
No, and we tell you clearly when it doesn't. Scanned PDFs are images of text — there are no characters in the file, only pixels. Text extraction can only return characters that exist. For scans, run OCR first: Adobe Acrobat, macOS Preview (File → Export → check "Embed text" in Ventura and later), or Tesseract on the command line. Save the resulting text-bearing PDF, then come back here.
How does it handle columns, headers, and footnotes?
Document order. PDF.js reads top to bottom in the order the items appear in the PDF, and so do we. Two-column layouts interleave (line 1 of col 1, then line 1 of col 2, etc.). Headers and footers repeat per page. For a clean multi-column extraction, the workaround is paste the result into an LLM and ask it to re-flow.
Is my PDF really not uploaded?
Correct. PDF.js runs in your browser — it's the same library Firefox uses to display PDFs internally. The bytes go from your disk to memory to the rendered text without crossing the internet. Check your browser's network tab during the extract: zero outbound requests. The static page itself loaded from our CDN at first visit; after that, the extract is fully local.
What's the character encoding?
UTF-8. Accented characters, smart quotes, em-dashes, CJK characters, emoji, math symbols — all round-trip cleanly. If you open the .txt in an editor that defaults to a different encoding (some old Windows defaults to Windows-1252), set it to UTF-8 to see characters correctly.
Why is my output missing some text?
Three usual reasons. One, the PDF marks some text as "image-only" — a few old PDF generators do this to block copying. Two, the PDF uses a custom font with no Unicode mapping, so PDF.js can render glyphs but can't map them back to characters. Three, the apparent text is actually an image (a scanned page or a screenshot of text). The diagnostic: open the PDF in any viewer and try to select the missing text with your cursor. If you can't select it there, no extraction tool can pull it.
Can I extract text from a password-protected PDF?
Not directly — PDF.js refuses to open encrypted PDFs. Unlock it first in a desktop reader: Adobe Acrobat (File → Properties → Security, then Save As an unprotected copy), or macOS Preview (File → Export, uncheck "Encrypt"). Then run the unlocked copy through this tool.
What's the size limit?
100 MB and 500 pages per PDF. Text extraction is much lighter than rendering, so the page-count limit is generous. For documents larger than that — multi-thousand-page legal discovery, large manuscripts — split into chunks first with Split PDF and extract each chunk.
Why a .txt file instead of Word or Markdown?
Plain text is the format that travels best. It opens in everything, has no formatting that needs translating, and is what you'd want for the most common next steps: searching, diffing, feeding into an LLM, piping through a script, version-controlling. If you need Markdown, paste the .txt into an LLM and ask for Markdown formatting — much better result than any heuristic conversion would produce.