Batch OCR from iPhoto with Tesseract

tesseract-logo-droplet.png

Tesseract is an open source, idiot-proof command line OCR engine (if there ever was one). Unfortunately it has no useful front ends (brrrrr...that word is like people referring to any kind of dress-up as "vanity") on the Mac making it less useful in practice. So I made not one but two – though they share the same code base. Pretty convenient way to process scanned/photographed/downloaded images of text.

This package consists of two AppleScripts. One is designed for iPhoto, you select it in the scripts menu and it does OCR on selected photos. The other is a straightforward droplet which will process any image file dropped on it, all else equal. Since you can scan into iPhoto using Image capture, and your photos end up there, this makes a pretty comprehensive OCR workflow. Code is BSD licensed so go ahead and adapt this to other apps if you like. Download at end of page.

Usage

  1. Select images in iPhoto containing text to be recognized
  2. Run script from scripts menu
  3. Choose language, (or unknown although this will yield suboptimal results)
  4. In the background, the script will convert images and pass them to tesseract
  5. After a brief wait, a text file for each image will appear in Desktop folder

Droplet works same way, you just drag and drop any image file(s) on it.

Installation

  1. Download and build Tesseract (requires minor command line savvy, easy-to-follow instructions here) Note that you can skip the part about extra image libraries, the scripts convert any QuickTime-supported image, including PDF, into the .tif required by Tesseract.
  2. Place droplet in Applications folder, and if you want drag an alias to the dock, or to your Finder toolbar
  3. Place the script in Library/Scripts/iPhoto
  4. Open script editor and enter Preferences to enable script menu (if necessary)

Please comment on bugs or if I omitted anything road-blocking, so I can make an update!

AttachmentSize
Tesseract scripts r3.zip94.49 KB

Online OCR Services

Though not as good as commercial softwares, some online OCR services are not bad, for example, site free ocr.

Installation went just fine, but…

…but I've had really poor results from Tesseract. It seems like photos have to be straightened and pre-contrast-adjusted in order to OCR properly (which is not the case with most peoples’ iPhoto libraries). Was looking at this as a better version of Evernote—I’ll post again if I can find a better OCR engine to integrate into your script. Also, I’ve adjusted the script to save the OCR text into the image comments instead of saving a text file, for Spotlight purposes…

That's interesting, in my

That's interesting, in my tests I found that pre-contrast adjustment did nothing. Straightening might be worthwhile, I have some bent (photographed) pages, they sometimes come up garbled. Another thing is that - not surprisingly - choosing the right language makes a significant difference. I've had really useful results from the Swedish texts I've processed. Not perfect, however but can't imagine English would be less supported. I'm sure some commercial OCRs are better, just like in every other category…

The comment thing is a quite clever improvement. Do you post your mods somewhere?

Something good

GoodOCR.com is an online application of optical character recognition for English. The only limitation is the file to upload cannot exceed 2MB, which is more than enough for family use!