Tesseract is an open source, idiot-proof command line OCR engine (if there ever was one). Unfortunately it has no useful front ends (brrrrr...that word is like people referring to any kind of dress-up as "vanity") on the Mac making it less useful in practice. So I made not one but two – though they share the same code base. Pretty convenient way to process scanned/photographed/downloaded images of text.
This package consists of two AppleScripts. One is designed for iPhoto, you select it in the scripts menu and it does OCR on selected photos. The other is a straightforward droplet which will process any image file dropped on it, all else equal. Since you can scan into iPhoto using Image capture, and your photos end up there, this makes a pretty comprehensive OCR workflow. Code is BSD licensed so go ahead and adapt this to other apps if you like. Download at end of page.
- Select images in iPhoto containing text to be recognized
- Run script from scripts menu
- Choose language, (or unknown although this will yield suboptimal results)
- In the background, the script will convert images and pass them to tesseract
- After a brief wait, a text file for each image will appear in Desktop folder
Droplet works same way, you just drag and drop any image file(s) on it.
- Download and build Tesseract (requires minor command line savvy, easy-to-follow instructions here) Note that you can skip the part about extra image libraries, the scripts convert any QuickTime-supported image, including PDF, into the .tif required by Tesseract.
- Place droplet in Applications folder, and if you want drag an alias to the dock, or to your Finder toolbar
- Place the script in Library/Scripts/iPhoto
- Open script editor and enter Preferences to enable script menu (if necessary)
Please comment on bugs or if I omitted anything road-blocking, so I can make an update!
|Tesseract scripts r3.zip||94.49 KB|