Hi: Using Debian, I found I had several OCR Packages. They're probably on DVD
#2 still. I tried them on a big PDF file with illustrations. "Tesseract" was
best, but the result would still need manual editing with a Spelling Checker.
"gocr" and "ocrad" were available too. Note that with all these you
only get
the text. The illustrations were nowhere to be seen.
And looking again, I remembered I'd written a little bash script for OCR. The
reason for this is that you probably need to burst the PDF into separate pages
using "pdftk", then covert them into another format (using "convert")
to get
the type of input file required by your OCR Package.
======================================
#!/bin/bash
for page in *.pdf
do
convert -density 300x300 $page ${page%pdf}pbm
ocrad -o ${page%pdf}txt ${page%pdf}pbm
rm ./*.pbm
done
======================================
My script got edited several times, ending up with the "ocrad" version as you
see, using PBM files as input. You need to run this within a directory full of
PDF pages, numbered with leading zeros so the processing runs in the correct
order. If you try to process a big PDF as one unit, you may get nowhere,
whereas with separate pages you can see how your OCR is progressing.
Yours truly: Frank Mitchell
On Friday 06 Dec 2013 14:25:26 Jonothon Nihill wrote:
Does anyone have any recomendations for ocr type software for Linux
It's
something I've never really had to do under linux before but I have a few
scanned documents in a .jpg format I could do with editing.
_______________________________________________
Staffslug mailing list
Staffslug(a)staffslug.org.uk
http://server.bytestream.co.uk/mailman/listinfo/staffslug