[Staffslug] OCR Script for PDF Pages

Saturday, 7 December 2013

Hi: Using Debian, I found I had several OCR Packages. They're probably on DVD 
#2 still. I tried them on a big PDF file with illustrations. "Tesseract" was 
best, but the result would still need manual editing with a Spelling Checker. 
"gocr" and "ocrad" were available too. Note that with all these you
only get 
the text. The illustrations were nowhere to be seen.

And looking again, I remembered I'd written a little bash script for OCR. The 
reason for this is that you probably need to burst the PDF into separate pages 
using "pdftk", then covert them into another format (using "convert")
to get 
the type of input file required by your OCR Package.

======================================
#!/bin/bash
for page in *.pdf
do
convert -density 300x300 $page ${page%pdf}pbm
ocrad -o ${page%pdf}txt ${page%pdf}pbm
rm ./*.pbm
done
======================================

My script got edited several times, ending up with the "ocrad" version as you 
see, using PBM files as input. You need to run this within a directory full of 
PDF pages, numbered with leading zeros so the processing runs in the correct 
order. If you try to process a big PDF as one unit, you may get nowhere, 
whereas with separate pages you can see how your OCR is progressing.

Yours truly: Frank Mitchell

On Friday 06 Dec 2013 14:25:26 Jonothon Nihill wrote:
...
 Does anyone have any recomendations for ocr type software for Linux
It's
 something I've never really had to do under linux before but I have a few
 scanned documents in a .jpg format I could do with editing.
 _______________________________________________
 Staffslug mailing list
 Staffslug(a)staffslug.org.uk
 http://server.bytestream.co.uk/mailman/listinfo/staffslug 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Staffslug] OCR Script for PDF Pages