Installing tesseract
Tesseract is used for OCR. In addition to installing the main package, you need to install one or more language packs – possibly even including English, which may not come by default.
Heroku
On heroku-22 stack, we can succesfully install tesseract 4.x (4.1.3) via the Aptfile
with the heroku apt buildpack. tesseract 4 is included in the apt repo for Ubuntu 22. While there should always be some way to do a custom compilation, getting a custom install of tesseract 5 installed on heroku-22 is not something we were able to figure out. (Custom apt repo did not work).
Aptfile has to include tesseract-ocr
, plus relevant language pack packages, plus the dependency libarchive13
that for some reason wasn’t being picked up by heroku apt installation as a dependency (this happens sometimes with heroku apt buildpack).
tesseract-ocr
tesseract-ocr-eng
tesseract-ocr-deu
tesseract-ocr-fra
tesseract-ocr-spa
libarchive13
Note these are tesseract FAST
models, not BEST
models – if you wanted the BEST
models, you’d have to copy the files yourself, they aren’t available as apt
packages. That should be do-able though if we decide we do.
Additionally we need to set TESSDATA_PREFIX
Additionally, the heroku config var TESSDATA_PREFIX needs to be set, to set the ENV variable in our heroku processes – to point at the correct tesseract data directory, which apparently is not where tesseract expects after install.
On a heroku build that includes tesseract, you can find the data directory by logging into a dyno with heroku run bash
and running: find ~+ -iname tessdata
. Which is probably:
heroku config:set TESSDATA_PREFIX= /app/.apt/usr/share/tesseract-ocr/4.00/tessdata
This solution from: https://towardsdatascience.com/deploy-python-tesseract-ocr-on-heroku-bbcc39391a8d
Verify tesseract install
(These checks should also be covered by our system_env_spec)
In a heroku dyno with heroku run bash
:
~ $ tesseract --version
tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
~ $ tesseract --list-langs
List of available languages (5):
deu
eng
fra
osd
spa
~ $
On MacOS Development machine
The easiest way to get tesseract on your MacOS development machine, you wind up with a different version of Tesseract.
brew
doesn’t keep old versions, and brew will give you tessract 5 while we have tesseract 4 on production. In addition, brew packages all the language packs as one brew package, so you get ALL tesseract languages.
It’s not ideal that dev tesseract doesn’t match production tesseract, but that’s what we have for now.