Odd results with OpenCV

Continuing on the WW2CP project, I found some test gifs that look like casualty lists by county and state. I pulled down one sample page to see if I could get OpenCV to read the text on the page. No luck at first. The page was black text on gray paper in a gif, so I thought that fixing the brightness and contrast and cropping the page might help, but no.

The pages are organized by sets of columns that list the last name, first name, middle initial, serial number, rank and fate (KIA, DNB, etc). These sets of columns are then organized as three sets of columns on a page. With review of the Pillow documentation, I figured out how to crop the columns to make them one column only. I also figured out how to change the brightness of the page (to make the gray background turn whiter) and the contrast (to make the letters look more like letters). I ran OpenCV on the modified pages and I still could not get OCR to work properly. 

I probably need to look at Natural Language Processing to make this work the way I want. It sounds interesting, but it could also be a rabbit hole of never-ending tasks. Apparently, I may need a corpus (?) of names, I think. I should probably read a book first. (Natural Language Processing with Python).

Advertisements

WW2CP.org update – OCR!?!

I had some time to get back to this project, so I decided to make some updates. My updates were not to the website, but to the data underscoring the website. I needed to show that I could take data (pages scanned as gifs – really) available at the National Archives and read that data into a database. I pulled a page from here (https://www.archives.gov/research/military/ww2/army-casualties). It’s not really a random page — it’s a page from U. S. Army casualties from Los Angeles County.

I discovered I could not open the file using Python 3 and OpenCV because it can’t open GIF files. (???) I was able to open the file using the Pillow module from Python 3. I can also convert the file to a PNG file and resave the file into a different folder. Once the file was saved, I could read the file using OpenCV and could get some OCR. The results were very bad, but it’s the first try.

The page is set up to show three columns of casualties. I need to crop the page to show three separate columns and read each column. Additionally, I need to boost the contrast to see if that makes the text in the image easier to read. I’ll need to review  Pillow to see how I can change contrast, but it does not look hard.

Progress!!!

Python updates and OpenCV

I recently updated my laptop to run Python 3.7.0. I forgot that updates knocks out all the Python virtual environments that expect Python 3.6. I’ve worked my way through the virtual environments, removed them, recreated them and restarted them.

> rm -rf venv

> python3 -m venv venv

> source venv/bin/activate

I noticed that I don’t have a requirements.txt file to make it simple to reload Python modules. I’ve been creating that file as needed. I also noticed I was having trouble with loading the modules needed for an OpenCV project.

i installed OpenCV with homebrew and I knew it was outdated, so I upgraded it (and other dependent files) with brew. I reloaded all the related Python modules, but still could not get the sample file to run.

> brew outdated

> brew upgrade opencv

Instructions here show full steps for installing OpenCV. Note step 6 where a symlink is set up to point from the openCV libraries installed by home-brew and pointing back to the virtual environment

> cd venv/lib/python3.7/site-packages/

> ln -s /usr/local/opt/opencv3/lib/python3.7/site-packages/cv2.cpython-37m-darwin.so cv2.so

Note that file paths now show 3.7. If and when Python 3.8 or Python 4 show up, the paths will probably need to be updated to show the Python version.

installation of OpenCV, finally

I mentioned in another post about my plans for WorldWarIICasualtyProject.org. In short, the U.S. National Archives has scanned pages out of books that list American casualties that took place in World War II. I was curious to find out more, but discovered that those records did not exist in searchable form. I thought it would be interesting to figure out how to scan them (gif files, really!) and read the data as OCR.

I decided to go ahead on my initial plan: use a Python OCR module to read the scan. However … the Python module I tracked down (pytesseract) also required PIL (another module, part of the dependencies) and strongly suggested I install the python science packages. I figured I would need them at some point, so I installed numpy, scipy, matplotlib, scikit-image, scikit-learn, ipython, and pandas. ( https://www.learnopencv.com/install-opencv3-on-macos/)

At this point, I paused. I found several pages that suggested OpenCV be installed with Homebrew. That’s not a big deal because I use Homebrew for python 2/3. It gets confusing here. At one time, OpenCV was kept in a specialized area named “homebrew/science” but was moved to “homebrew/core”. I’m told “homebrew/science” is empty, so there should be no reason to link to it. We’ll see.

Note: use ‘> brew tap’ to list all taps connected for homebrew

Also note: opencv3 does not exist anymore. I think it has been renamed to opencv. Opencv2 has been renamed ‘opencv@2’. … So confusing …

Then there’s the question of linking OpenCV to “… Homebrew Python’s site-packages directory”. What? See https://www.learnopencv.com/install-opencv3-on-macos/

I’m sticking with these instructions: https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/ except for the part where I tap into homebrew/science (it doesn’t exist any more) and I install opencv3. (It’s been renamed to opencv).

2018-02-21 update

I installed OpenCV through homebrew. Lots of dependencies were installed. Interestingly enough, I can see opencv through the default homebrew python3 install, but not in virtual environment I created for custom work. In other words:

> python3
>>> import cv2
>>> cv2.__version__
‘3.4.0’

However, when I go to the virtual environment set up for ww2cp, I don’t see it.

> source ww2venv/bin/activate
(ww2venv) > python3
>>> import cv2
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ModuleNotFoundError: No module named ‘cv2’

So, following the instructions here: (https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/), I set up a symbolic link between homebrew’s openCV install and the site-packages inside the ww2 venv folder.

Now it works!

Flask validation OR that would have been nice to know!

I’m building some sample sites to display my python skills. In one site, I’m building a app using Flask as the framework to show the web pages. In this page, I need to enter a username that will be used to collect some data from another site. Before I start writing the function to collect and massage the data, I want to be sure I could capture the username entered. I think “No problem”. The Flask module has several examples and tutorials on how to build a form and validate the data entered. I set it up … and it does not work.

I want to set up the same form in two places: as a form field in a navigation bar and as a separate form page, in case there was a problem using the original form field. In the navigation bar, I set up a simple form with one field, the username. The form would send the HTTP request to a submit page, where the form would be validated and sent back to the original form page with an additional  message OR stop and display the full form again in the same separate page (in case someone clicked the submit button in the nav bar without filling in the form field). In both forms, I was sent to the submit page, even after filling out the form field. The form data was never validated in the submit page, so that submit page was redrawn. Why?

I thought there was something wrong with the HTML generated by the flask templates, so I reviewed that HTML. It was fine. I figured out how to include bootstrap classes into flask forms, so that review was helpful. However, I was still stuck on the submit page. I thought there was something wrong with the HTML names and ids of the forms and form fields, but no. Everything was named properly. I reviewed my flask tutorial and the FlaskForm and flask-wtf module documentation pages. My forms.py and views.py pages were set up properly. Wait … What’s this bit about CSRF (cross site request forgery) tokens included in flask forms? That’s nice to use eventually, but I’m doing local testing. Why should I care about CSRF checking?

It turns out that the hidden CSRF field generated by a flask form is what is used to validate the form itself. If the CSRF field is not included in the original form, the form will never validate. That would have been nice to know.

I found one line in the (old) tutorial I used from Miguel Grinberg that says “The SECRET_KEY setting is only needed when CSRF is enabled, and is used to create a cryptographic token that is used to validate a form.” That line does not appear in sections describing form templates or form views, but at the top section describing configuration. I read nothing that confirms this in the flask docs or the flask-wtf docs. However, it appears to be true. The flask CSRF token is required to validate the form. No CSRF token:  no validation. A day and a half wasted.

nice to know: python imports

I am writing a Python command line app that cleans up data scraped from a web page and imports that data into associated tables inside a MySQL database. The top level of the directory holding the python files is getting crowded. I wanted to move the dependent modules into a directory and import those modules into the main file. I find out that I need to add that child directory into a sys.path list before I can import the modules. That’s good to know, but it’s too much for a command line script.

See here for more info:

error on part 0 of Ansible install

I’ve mentioned that I decided to use Ansible for my server configuration management. This installation has finally bubbled up as the first item on my to-do list. I looked around to see how to install it. Since I already have Homebrew installed on my mac, I saw two options:

  1. > brew install ansible
  2. > pip3 install ansible

I did not know any reason why these would be different, so I went ahead and ran ‘> brew install ansible’. That was a mistake. Ansible lists Python 2 as a dependency, which is not included in my Homebrew installation. Homebrew installs Python 2.7.13. I also have Python 3 running (3.6.2). I also found out there’s a third version of Python (2.7.10) which is part of the default installation in my laptop. 3 versions of Python on one machine. Wonderful!

I created a /etc/ansible/hosts file and did a test ping, which returns UNREACHABLE. That makes no sense, but I think I have to use some command line options to use the correct account.

Also, I don’t know which version of ansible is running. I may want to remove the Homebrew version and figure out how to use the python3 version. What a mess.

After a day of thought, I decided to check a few things:

> ansible —version

> ansible 2.3.2.0 (good)

… python version = 3.6.2 (what? Why not 2.7.13 or 2.7.10? Interesting.)

I checked the documents page at ansible.com. They say that ansible can run with python3 in one of two ways:

  1. > python3 <path/to/ansible> localhost -m ping
  2. > ansible localhost -m ping -e ‘ansible_python_interpreter=<path/to/python3>

To see what happens, I also tried

  1. > ansible localhost -m ping

It turns out all return good pings from localhost, so … I guess my installation is OK. Even so, next time, avoid using Homebrew for ansible installations.