WW2CP.org update – OCR!?!

I had some time to get back to this project, so I decided to make some updates. My updates were not to the website, but to the data underscoring the website. I needed to show that I could take data (pages scanned as gifs – really) available at the National Archives and read that data into a database. I pulled a page from here (https://www.archives.gov/research/military/ww2/army-casualties). It’s not really a random page — it’s a page from U. S. Army casualties from Los Angeles County.

I discovered I could not open the file using Python 3 and OpenCV because it can’t open GIF files. (???) I was able to open the file using the Pillow module from Python 3. I can also convert the file to a PNG file and resave the file into a different folder. Once the file was saved, I could read the file using OpenCV and could get some OCR. The results were very bad, but it’s the first try.

The page is set up to show three columns of casualties. I need to crop the page to show three separate columns and read each column. Additionally, I need to boost the contrast to see if that makes the text in the image easier to read. I’ll need to review  Pillow to see how I can change contrast, but it does not look hard.

Progress!!!

Advertisements

Python updates and OpenCV

I recently updated my laptop to run Python 3.7.0. I forgot that updates knocks out all the Python virtual environments that expect Python 3.6. I’ve worked my way through the virtual environments, removed them, recreated them and restarted them.

> rm -rf venv

> python3 -m venv venv

> source venv/bin/activate

I noticed that I don’t have a requirements.txt file to make it simple to reload Python modules. I’ve been creating that file as needed. I also noticed I was having trouble with loading the modules needed for an OpenCV project.

i installed OpenCV with homebrew and I knew it was outdated, so I upgraded it (and other dependent files) with brew. I reloaded all the related Python modules, but still could not get the sample file to run.

> brew outdated

> brew upgrade opencv

Instructions here show full steps for installing OpenCV. Note step 6 where a symlink is set up to point from the openCV libraries installed by home-brew and pointing back to the virtual environment

> cd venv/lib/python3.7/site-packages/

> ln -s /usr/local/opt/opencv3/lib/python3.7/site-packages/cv2.cpython-37m-darwin.so cv2.so

Note that file paths now show 3.7. If and when Python 3.8 or Python 4 show up, the paths will probably need to be updated to show the Python version.

installation of OpenCV, finally

I mentioned in another post about my plans for WorldWarIICasualtyProject.org. In short, the U.S. National Archives has scanned pages out of books that list American casualties that took place in World War II. I was curious to find out more, but discovered that those records did not exist in searchable form. I thought it would be interesting to figure out how to scan them (gif files, really!) and read the data as OCR.

I decided to go ahead on my initial plan: use a Python OCR module to read the scan. However … the Python module I tracked down (pytesseract) also required PIL (another module, part of the dependencies) and strongly suggested I install the python science packages. I figured I would need them at some point, so I installed numpy, scipy, matplotlib, scikit-image, scikit-learn, ipython, and pandas. ( https://www.learnopencv.com/install-opencv3-on-macos/)

At this point, I paused. I found several pages that suggested OpenCV be installed with Homebrew. That’s not a big deal because I use Homebrew for python 2/3. It gets confusing here. At one time, OpenCV was kept in a specialized area named “homebrew/science” but was moved to “homebrew/core”. I’m told “homebrew/science” is empty, so there should be no reason to link to it. We’ll see.

Note: use ‘> brew tap’ to list all taps connected for homebrew

Also note: opencv3 does not exist anymore. I think it has been renamed to opencv. Opencv2 has been renamed ‘opencv@2’. … So confusing …

Then there’s the question of linking OpenCV to “… Homebrew Python’s site-packages directory”. What? See https://www.learnopencv.com/install-opencv3-on-macos/

I’m sticking with these instructions: https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/ except for the part where I tap into homebrew/science (it doesn’t exist any more) and I install opencv3. (It’s been renamed to opencv).

2018-02-21 update

I installed OpenCV through homebrew. Lots of dependencies were installed. Interestingly enough, I can see opencv through the default homebrew python3 install, but not in virtual environment I created for custom work. In other words:

> python3
>>> import cv2
>>> cv2.__version__
‘3.4.0’

However, when I go to the virtual environment set up for ww2cp, I don’t see it.

> source ww2venv/bin/activate
(ww2venv) > python3
>>> import cv2
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ModuleNotFoundError: No module named ‘cv2’

So, following the instructions here: (https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/), I set up a symbolic link between homebrew’s openCV install and the site-packages inside the ww2 venv folder.

Now it works!