Installing tesseract

Tesseract is used for OCR. In addition to installing the main package, you need to install one or more language packs – possibly even including English, which may not come by default.

Heroku

On heroku-22 stack, we can succesfully install tesseract 4.x (4.1.3) via the Aptfile with the heroku apt buildpack. tesseract 4 is included in the apt repo for Ubuntu 22. While there should always be some way to do a custom compilation, getting a custom install of tesseract 5 installed on heroku-22 is not something we were able to figure out. (Custom apt repo did not work).

Aptfile has to include tesseract-ocr, plus relevant language pack packages, plus the dependency libarchive13 that for some reason wasn’t being picked up by heroku apt installation as a dependency (this happens sometimes with heroku apt buildpack).

tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-spa libarchive13

Note these are tesseract FAST models, not BEST models – if you wanted the BEST models, you’d have to copy the files yourself, they aren’t available as apt packages. That should be do-able though if we decide we do.

Additionally we need to set TESSDATA_PREFIX

Additionally, the heroku config var TESSDATA_PREFIX needs to be set, to set the ENV variable in our heroku processes – to point at the correct tesseract data directory, which apparently is not where tesseract expects after install.

On a heroku build that includes tesseract, you can find the data directory by logging into a dyno with heroku run bash and running: find ~+ -iname tessdata. Which is probably:

heroku config:set TESSDATA_PREFIX= /app/.apt/usr/share/tesseract-ocr/4.00/tessdata

 

This solution from: https://towardsdatascience.com/deploy-python-tesseract-ocr-on-heroku-bbcc39391a8d

Verify tesseract install

(These checks should also be covered by our system_env_spec)

In a heroku dyno with heroku run bash:

~ $ tesseract --version tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 ~ $ tesseract --list-langs List of available languages (5): deu eng fra osd spa ~ $

 

On MacOS Development machine

The easiest way to get tesseract on your MacOS development machine, you wind up with a different version of Tesseract.

brew doesn’t keep old versions, and brew will give you tessract 5 while we have tesseract 4 on production. In addition, brew packages all the language packs as one brew package, so you get ALL tesseract languages.

It’s not ideal that dev tesseract doesn’t match production tesseract, but that’s what we have for now.