How to OCR PDF Files on Linux with Tesseract

Tesseract is the most powerful open-source OCR engine available on Linux, and it's entirely free. Originally developed by HP and now maintained by Google, Tesseract handles text recognition with impressive accuracy across 100+ languages. Combined with other Linux command-line tools, it forms the backbone of a complete PDF OCR pipeline. But OCR on Linux isn't a single-command operation. PDFs aren't images — you need to convert PDF pages to images first, then run Tesseract on those images, then optionally recombine the results into a searchable PDF. This pipeline involves pdftoppm or Ghostscript for the PDF-to-image step, Tesseract for OCR, and optionally tools like img2pdf or Tesseract's PDF output mode to create the final searchable PDF. This guide walks you through the complete OCR pipeline on Linux step by step, covers Tesseract's most useful options, shows how to process multi-page PDFs and entire directories, and explains the difference between text extraction and searchable PDF creation. For quick, one-off OCR tasks where you don't want to set up the pipeline, LazyPDF's browser-based OCR tool (which also uses Tesseract under the hood) in Firefox or Chromium offers a fast alternative with a simple interface.

Setting Up Tesseract OCR on Linux

Tesseract installation is straightforward on all major Linux distributions. Installing language packs is equally important — without the right language data, Tesseract's accuracy drops significantly.

1Install Tesseract on Ubuntu/Debian: `sudo apt install tesseract-ocr`
2Install Tesseract on Fedora/RHEL: `sudo dnf install tesseract`
3Install English language data (usually included): `sudo apt install tesseract-ocr-eng`
4Install additional languages (e.g., French, German, Spanish): `sudo apt install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa`
5Verify installation and list available languages: `tesseract --version && tesseract --list-langs`

The Complete PDF OCR Pipeline on Linux

OCR on PDFs requires two main steps: converting PDF pages to images, then running Tesseract on those images. Here's the complete pipeline using pdftoppm and Tesseract. First, convert the PDF pages to high-resolution images: `pdftoppm -r 300 scanned.pdf /tmp/page` This creates /tmp/page-1.ppm, /tmp/page-2.ppm, etc. at 300 DPI — a high enough resolution for accurate OCR. Then run Tesseract on each page to extract text: ```bash for img in /tmp/page-*.ppm; do tesseract "$img" stdout >> extracted_text.txt done ``` Alternatively, create a searchable PDF directly from a single image: `tesseract page-1.ppm output_page pdf` This creates output_page.pdf which contains both the original scanned image and an invisible text layer — making it searchable while preserving the visual appearance. For a complete multi-page searchable PDF, the workflow uses Tesseract's built-in PDF output and then merges the pages: ```bash for img in /tmp/page-*.ppm; do name=$(basename "$img" .ppm) tesseract "$img" "/tmp/$name" pdf done pdftk /tmp/page-*.pdf cat output searchable.pdf ```

Optimizing Tesseract OCR Accuracy on Linux

Tesseract's accuracy depends heavily on image quality and configuration. Here are the most important techniques for improving OCR results on Linux. Use 300 DPI minimum when converting PDF pages to images with pdftoppm. Lower resolutions degrade OCR accuracy significantly — at 72 DPI, Tesseract struggles with many fonts. At 300 DPI, accuracy on clean documents is typically 98%+. Specify the correct language with the `-l` flag: `tesseract image.ppm output -l fra+eng` This uses French as the primary language with English as secondary — useful for bilingual documents. Use the OEM (OCR Engine Mode) flag for better accuracy. Mode 1 uses LSTM (neural network), which is Tesseract's most accurate engine: `tesseract image.ppm output -l eng --oem 1` Preprocess images for better results using ImageMagick before OCR: ```bash convert page-1.ppm -colorspace Gray -threshold 50% preprocessed.ppm tesseract preprocessed.ppm output ``` Grayscale conversion and thresholding can significantly improve accuracy on low-contrast or colored backgrounds.

Using OCRmyPDF for a Simpler Pipeline on Linux

If the manual pipeline feels complex, OCRmyPDF is a high-level tool that wraps Tesseract and handles the entire PDF OCR workflow in a single command. It's the most user-friendly OCR solution for Linux PDF files. Install OCRmyPDF: `sudo apt install ocrmypdf` Basic usage: `ocrmypdf input.pdf output.pdf` This adds a text layer to the PDF in place, making it searchable while preserving the original scanned images. The output is a standard PDF that opens in any viewer. OCRmyPDF handles multi-page PDFs automatically, includes image preprocessing (deskew, cleanup), and even repairs some malformed PDFs. It uses Tesseract internally but abstracts away all the intermediate steps. For more control: `ocrmypdf -l fra+eng --deskew --clean input.pdf output.pdf` The `--deskew` flag automatically corrects slightly tilted scans, and `--clean` removes background noise — both significantly improve OCR accuracy on real-world scanned documents.

Frequently Asked Questions

How accurate is Tesseract OCR on Linux?

On clean, high-resolution (300+ DPI) scans of printed text, Tesseract accuracy is typically 97–99% with the LSTM engine (`--oem 1`). Accuracy drops for low-quality scans, handwriting, unusual fonts, or complex layouts. For standard office documents scanned at proper resolution, Tesseract produces excellent results.

Can I OCR a PDF directly without converting to images first on Linux?

OCRmyPDF handles this automatically — you just provide the PDF directly. Tesseract itself works on images, not PDFs, which is why the manual pipeline requires a PDF-to-image conversion step. OCRmyPDF automates this and is the recommended approach for most users.

Can Tesseract OCR handwriting on Linux?

Tesseract is optimized for printed text and struggles with handwriting. The LSTM engine has some limited handwriting recognition capability, but results are unreliable for typical handwriting. For handwriting, specialized models or cloud-based handwriting recognition services produce much better results.

How do I batch OCR an entire directory of scanned PDFs on Linux?

With OCRmyPDF in a bash loop: `for f in *.pdf; do ocrmypdf "$f" "ocr_${f}"; done`. This processes each PDF and saves the searchable version with an 'ocr_' prefix. Add `--jobs 4` to process multiple files in parallel on multi-core systems.

Need OCR without the terminal setup? LazyPDF's browser-based OCR uses Tesseract — free and instant.

Try It Free

Industry Guides