WW2CP.org update – OCR!?!

I had some time to get back to this project, so I decided to make some updates. My updates were not to the website, but to the data underscoring the website. I needed to show that I could take data (pages scanned as gifs – really) available at the National Archives and read that data into a database. I pulled a page from here (https://www.archives.gov/research/military/ww2/army-casualties). It’s not really a random page — it’s a page from U. S. Army casualties from Los Angeles County.

I discovered I could not open the file using Python 3 and OpenCV because it can’t open GIF files. (???) I was able to open the file using the Pillow module from Python 3. I can also convert the file to a PNG file and resave the file into a different folder. Once the file was saved, I could read the file using OpenCV and could get some OCR. The results were very bad, but it’s the first try.

The page is set up to show three columns of casualties. I need to crop the page to show three separate columns and read each column. Additionally, I need to boost the contrast to see if that makes the text in the image easier to read. I’ll need to review  Pillow to see how I can change contrast, but it does not look hard.

Progress!!!

Advertisements

Python updates and OpenCV

I recently updated my laptop to run Python 3.7.0. I forgot that updates knocks out all the Python virtual environments that expect Python 3.6. I’ve worked my way through the virtual environments, removed them, recreated them and restarted them.

> rm -rf venv

> python3 -m venv venv

> source venv/bin/activate

I noticed that I don’t have a requirements.txt file to make it simple to reload Python modules. I’ve been creating that file as needed. I also noticed I was having trouble with loading the modules needed for an OpenCV project.

i installed OpenCV with homebrew and I knew it was outdated, so I upgraded it (and other dependent files) with brew. I reloaded all the related Python modules, but still could not get the sample file to run.

> brew outdated

> brew upgrade opencv

Instructions here show full steps for installing OpenCV. Note step 6 where a symlink is set up to point from the openCV libraries installed by home-brew and pointing back to the virtual environment

> cd venv/lib/python3.7/site-packages/

> ln -s /usr/local/opt/opencv3/lib/python3.7/site-packages/cv2.cpython-37m-darwin.so cv2.so

Note that file paths now show 3.7. If and when Python 3.8 or Python 4 show up, the paths will probably need to be updated to show the Python version.

installation of OpenCV, finally

I mentioned in another post about my plans for WorldWarIICasualtyProject.org. In short, the U.S. National Archives has scanned pages out of books that list American casualties that took place in World War II. I was curious to find out more, but discovered that those records did not exist in searchable form. I thought it would be interesting to figure out how to scan them (gif files, really!) and read the data as OCR.

I decided to go ahead on my initial plan: use a Python OCR module to read the scan. However … the Python module I tracked down (pytesseract) also required PIL (another module, part of the dependencies) and strongly suggested I install the python science packages. I figured I would need them at some point, so I installed numpy, scipy, matplotlib, scikit-image, scikit-learn, ipython, and pandas. ( https://www.learnopencv.com/install-opencv3-on-macos/)

At this point, I paused. I found several pages that suggested OpenCV be installed with Homebrew. That’s not a big deal because I use Homebrew for python 2/3. It gets confusing here. At one time, OpenCV was kept in a specialized area named “homebrew/science” but was moved to “homebrew/core”. I’m told “homebrew/science” is empty, so there should be no reason to link to it. We’ll see.

Note: use ‘> brew tap’ to list all taps connected for homebrew

Also note: opencv3 does not exist anymore. I think it has been renamed to opencv. Opencv2 has been renamed ‘opencv@2’. … So confusing …

Then there’s the question of linking OpenCV to “… Homebrew Python’s site-packages directory”. What? See https://www.learnopencv.com/install-opencv3-on-macos/

I’m sticking with these instructions: https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/ except for the part where I tap into homebrew/science (it doesn’t exist any more) and I install opencv3. (It’s been renamed to opencv).

2018-02-21 update

I installed OpenCV through homebrew. Lots of dependencies were installed. Interestingly enough, I can see opencv through the default homebrew python3 install, but not in virtual environment I created for custom work. In other words:

> python3
>>> import cv2
>>> cv2.__version__
‘3.4.0’

However, when I go to the virtual environment set up for ww2cp, I don’t see it.

> source ww2venv/bin/activate
(ww2venv) > python3
>>> import cv2
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ModuleNotFoundError: No module named ‘cv2’

So, following the instructions here: (https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/), I set up a symbolic link between homebrew’s openCV install and the site-packages inside the ww2 venv folder.

Now it works!

Flask validation OR that would have been nice to know!

I’m building some sample sites to display my python skills. In one site, I’m building a app using Flask as the framework to show the web pages. In this page, I need to enter a username that will be used to collect some data from another site. Before I start writing the function to collect and massage the data, I want to be sure I could capture the username entered. I think “No problem”. The Flask module has several examples and tutorials on how to build a form and validate the data entered. I set it up … and it does not work.

I want to set up the same form in two places: as a form field in a navigation bar and as a separate form page, in case there was a problem using the original form field. In the navigation bar, I set up a simple form with one field, the username. The form would send the HTTP request to a submit page, where the form would be validated and sent back to the original form page with an additional  message OR stop and display the full form again in the same separate page (in case someone clicked the submit button in the nav bar without filling in the form field). In both forms, I was sent to the submit page, even after filling out the form field. The form data was never validated in the submit page, so that submit page was redrawn. Why?

I thought there was something wrong with the HTML generated by the flask templates, so I reviewed that HTML. It was fine. I figured out how to include bootstrap classes into flask forms, so that review was helpful. However, I was still stuck on the submit page. I thought there was something wrong with the HTML names and ids of the forms and form fields, but no. Everything was named properly. I reviewed my flask tutorial and the FlaskForm and flask-wtf module documentation pages. My forms.py and views.py pages were set up properly. Wait … What’s this bit about CSRF (cross site request forgery) tokens included in flask forms? That’s nice to use eventually, but I’m doing local testing. Why should I care about CSRF checking?

It turns out that the hidden CSRF field generated by a flask form is what is used to validate the form itself. If the CSRF field is not included in the original form, the form will never validate. That would have been nice to know.

I found one line in the (old) tutorial I used from Miguel Grinberg that says “The SECRET_KEY setting is only needed when CSRF is enabled, and is used to create a cryptographic token that is used to validate a form.” That line does not appear in sections describing form templates or form views, but at the top section describing configuration. I read nothing that confirms this in the flask docs or the flask-wtf docs. However, it appears to be true. The flask CSRF token is required to validate the form. No CSRF token:  no validation. A day and a half wasted.

nice to know: python imports

I am writing a Python command line app that cleans up data scraped from a web page and imports that data into associated tables inside a MySQL database. The top level of the directory holding the python files is getting crowded. I wanted to move the dependent modules into a directory and import those modules into the main file. I find out that I need to add that child directory into a sys.path list before I can import the modules. That’s good to know, but it’s too much for a command line script.

See here for more info:

error on part 0 of Ansible install

I’ve mentioned that I decided to use Ansible for my server configuration management. This installation has finally bubbled up as the first item on my to-do list. I looked around to see how to install it. Since I already have Homebrew installed on my mac, I saw two options:

  1. > brew install ansible
  2. > pip3 install ansible

I did not know any reason why these would be different, so I went ahead and ran ‘> brew install ansible’. That was a mistake. Ansible lists Python 2 as a dependency, which is not included in my Homebrew installation. Homebrew installs Python 2.7.13. I also have Python 3 running (3.6.2). I also found out there’s a third version of Python (2.7.10) which is part of the default installation in my laptop. 3 versions of Python on one machine. Wonderful!

I created a /etc/ansible/hosts file and did a test ping, which returns UNREACHABLE. That makes no sense, but I think I have to use some command line options to use the correct account.

Also, I don’t know which version of ansible is running. I may want to remove the Homebrew version and figure out how to use the python3 version. What a mess.

After a day of thought, I decided to check a few things:

> ansible —version

> ansible 2.3.2.0 (good)

… python version = 3.6.2 (what? Why not 2.7.13 or 2.7.10? Interesting.)

I checked the documents page at ansible.com. They say that ansible can run with python3 in one of two ways:

  1. > python3 <path/to/ansible> localhost -m ping
  2. > ansible localhost -m ping -e ‘ansible_python_interpreter=<path/to/python3>

To see what happens, I also tried

  1. > ansible localhost -m ping

It turns out all return good pings from localhost, so … I guess my installation is OK. Even so, next time, avoid using Homebrew for ansible installations.

Server automation, part 0

My infrastructure is slowly getting bigger, in spite of everything I do. I decided to research the current CI tools to decide what would be useful to use and what a potential employer would find useful. I’m trying to avoid learning another language. Puppet and Chef were ideas I was considering, but the need to become familiar with another language (Ruby) made me look at other options. On the other hand, I need to relearn Ruby anyways if I’m going to use Capistrano for deployment to remote servers. One thing at a time.

So far, I’m going round and round with Ansible and Salt, both in the Python universe. I’ve also run across something called StackStorm, which could be a possibility. (IFTTT for servers OR “event-driven automation”.) I’ll need to look at that another time to make valid decisions about that.

So far, it looks like Ansible would be easier to use, except for one issue. I’m still unclear if Ansible is useful with Python 3. I know that Ansible 2.2+ does run with Python 3, but it’s unclear if any related Ansible modules I might use are also compatible with Python 3.

I should probably try out at least two applications. For now, I’m going to try out Ansible.