How to Batch Process PDF Files on Linux with Shell Scripts
One of Linux's greatest advantages for PDF work is automation. While Windows and Mac users click through GUI dialogs one file at a time, Linux users can write a bash script and process hundreds or thousands of PDFs in a single unattended run. This capability is invaluable for document-heavy workflows in data analysis, legal document processing, academic research, publishing, and system administration. Batch processing PDFs on Linux involves combining several tools into scripts: Ghostscript for compression and conversion, pdftk for merging and splitting, Tesseract (or OCRmyPDF) for OCR, and pdftoppm for image extraction. These tools follow Unix conventions — they read from files, write to files, and compose naturally in shell scripts. This guide provides practical, working bash scripts for the most common PDF batch processing tasks on Linux. Each script is production-ready and includes error handling, progress reporting, and configuration options. You can use these scripts directly or adapt them for your specific needs. All scripts are tested on Ubuntu 22.04 LTS and should work on any modern Debian-based distribution. Minor adjustments may be needed for Fedora, Arch, or other distributions, mainly in package names and paths.
Batch Compress All PDFs in a Directory on Linux
This is the most commonly requested bash PDF script on Linux. It compresses every PDF in a directory using Ghostscript and reports the size savings for each file.
- 1Create the script file: `nano compress_pdfs.sh`
- 2Paste the following script content, then save with Ctrl+X, Y, Enter
- 3Make it executable: `chmod +x compress_pdfs.sh`
- 4Run it in a directory containing PDFs: `./compress_pdfs.sh /path/to/pdf/folder`
- 5Review the output showing original and compressed sizes for each file
Batch Compress Script Content and Explanation
Here is the complete batch compression script for Linux: ```bash #!/bin/bash # batch_compress.sh — Compress all PDFs in a directory INPUT_DIR="${1:-.}" OUTPUT_DIR="${INPUT_DIR}/compressed" QUALITY="${2:-/ebook}" mkdir -p "$OUTPUT_DIR" total=0 saved=0 for pdf in "$INPUT_DIR"/*.pdf; do [ -f "$pdf" ] || continue filename=$(basename "$pdf") output="$OUTPUT_DIR/$filename" echo -n "Processing: $filename ... " gs -dBATCH -dNOPAUSE -q \ -sDEVICE=pdfwrite \ -dCompatibilityLevel=1.4 \ -dPDFSETTINGS="$QUALITY" \ -sColorConversionStrategy=RGB \ -sOutputFile="$output" "$pdf" 2>/dev/null orig=$(stat -c%s "$pdf") comp=$(stat -c%s "$output") pct=$(( (orig - comp) * 100 / orig )) echo "${orig}B → ${comp}B (${pct}% reduction)" total=$((total + orig)) saved=$((saved + orig - comp)) done echo "" echo "Total: $((total/1024/1024))MB → $((( total-saved)/1024/1024))MB" echo "Total saved: $((saved/1024/1024))MB" ``` This script processes every PDF in the input directory, saves compressed versions to a `compressed/` subdirectory, and reports size savings per file and in total. Pass a quality level as the second argument to override the `/ebook` default: `./batch_compress.sh ./reports /screen`
Batch OCR All Scanned PDFs on Linux
For batch OCR of scanned PDFs, OCRmyPDF is the best tool — it handles the entire pipeline automatically and supports parallel processing: ```bash #!/bin/bash # batch_ocr.sh — Add searchable text layer to all scanned PDFs INPUT_DIR="${1:-.}" OUTPUT_DIR="${INPUT_DIR}/ocr_output" LANG="${2:-eng}" JOBS="${3:-4}" # parallel processes mkdir -p "$OUTPUT_DIR" for pdf in "$INPUT_DIR"/*.pdf; do [ -f "$pdf" ] || continue filename=$(basename "$pdf") output="$OUTPUT_DIR/$filename" echo "OCR processing: $filename" ocrmypdf -l "$LANG" \ --deskew \ --clean \ --output-type pdfa \ "$pdf" "$output" 2>&1 | tail -1 done echo "OCR complete. Searchable PDFs in: $OUTPUT_DIR" ``` Run: `./batch_ocr.sh ./scans fra+eng` The `--deskew` flag corrects tilted scans automatically. The `--clean` flag removes background noise. `--output-type pdfa` creates PDF/A files for long-term archival. These flags significantly improve both OCR accuracy and output quality.
Batch Merge PDFs by Group on Linux
A common need is merging PDFs that belong together — for example, multiple chapters of a book, or monthly expense receipts that should be combined by month. This script merges files by filename prefix: ```bash #!/bin/bash # batch_merge_by_prefix.sh — Merge PDFs that share a filename prefix # Files named 2026-01-*.pdf merge into 2026-01_merged.pdf, etc. INPUT_DIR="${1:-.}" OUTPUT_DIR="${INPUT_DIR}/merged" mkdir -p "$OUTPUT_DIR" # Find unique prefixes (everything before the first underscore after the date) declare -A groups for pdf in "$INPUT_DIR"/*.pdf; do [ -f "$pdf" ] || continue filename=$(basename "$pdf") prefix=$(echo "$filename" | cut -d'_' -f1) groups["$prefix"]=1 done for prefix in "${!groups[@]}"; do files=$(ls "$INPUT_DIR"/${prefix}_*.pdf 2>/dev/null | sort) count=$(echo "$files" | wc -l) echo "Merging $count files with prefix '$prefix'..." pdftk $files cat output "$OUTPUT_DIR/${prefix}_merged.pdf" done echo "Merge complete. Results in: $OUTPUT_DIR" ``` This pattern scales from merging a dozen files to processing entire archives of thousands of PDFs automatically.
Frequently Asked Questions
How do I process PDFs in parallel on Linux to use multiple CPU cores?
Use GNU Parallel: `sudo apt install parallel`. Then: `parallel gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -sColorConversionStrategy=RGB -sOutputFile=compressed/{} {} ::: *.pdf`. This runs Ghostscript on all PDFs simultaneously, using all available CPU cores.
How do I track which PDFs have already been processed in a batch script on Linux?
Check if the output file already exists before processing: `if [ -f "$output" ]; then echo "Skipping $filename (already done)"; continue; fi`. Add this check at the start of your loop to make scripts resumable — if they're interrupted, they pick up where they left off.
Can I run PDF batch scripts as a cron job on Linux?
Yes. Add to crontab with `crontab -e`. Example to run batch compression every night at 2am: `0 2 * * * /home/user/scripts/batch_compress.sh /home/user/documents/incoming >> /home/user/logs/compress.log 2>&1`. Use absolute paths in cron jobs since cron has a minimal PATH.
How do I handle errors gracefully in PDF batch scripts on Linux?
Check exit codes after each command and log failures: `if ! gs ... ; then echo "ERROR: Failed to compress $filename" >> error.log; continue; fi`. This skips failed files and logs them without aborting the entire batch. Review the error log after the run to investigate failures.