Odd results with OpenCV

Continuing on the WW2CP project, I found some test gifs that look like casualty lists by county and state. I pulled down one sample page to see if I could get OpenCV to read the text on the page. No luck at first. The page was black text on gray paper in a gif, so I thought that fixing the brightness and contrast and cropping the page might help, but no.

The pages are organized by sets of columns that list the last name, first name, middle initial, serial number, rank and fate (KIA, DNB, etc). These sets of columns are then organized as three sets of columns on a page. With review of the Pillow documentation, I figured out how to crop the columns to make them one column only. I also figured out how to change the brightness of the page (to make the gray background turn whiter) and the contrast (to make the letters look more like letters). I ran OpenCV on the modified pages and I still could not get OCR to work properly. 

I probably need to look at Natural Language Processing to make this work the way I want. It sounds interesting, but it could also be a rabbit hole of never-ending tasks. Apparently, I may need a corpus (?) of names, I think. I should probably read a book first. (Natural Language Processing with Python).

Advertisements

WW2CP.org update – OCR!?!

I had some time to get back to this project, so I decided to make some updates. My updates were not to the website, but to the data underscoring the website. I needed to show that I could take data (pages scanned as gifs – really) available at the National Archives and read that data into a database. I pulled a page from here (https://www.archives.gov/research/military/ww2/army-casualties). It’s not really a random page — it’s a page from U. S. Army casualties from Los Angeles County.

I discovered I could not open the file using Python 3 and OpenCV because it can’t open GIF files. (???) I was able to open the file using the Pillow module from Python 3. I can also convert the file to a PNG file and resave the file into a different folder. Once the file was saved, I could read the file using OpenCV and could get some OCR. The results were very bad, but it’s the first try.

The page is set up to show three columns of casualties. I need to crop the page to show three separate columns and read each column. Additionally, I need to boost the contrast to see if that makes the text in the image easier to read. I’ll need to review  Pillow to see how I can change contrast, but it does not look hard.

Progress!!!

Python updates and OpenCV

I recently updated my laptop to run Python 3.7.0. I forgot that updates knocks out all the Python virtual environments that expect Python 3.6. I’ve worked my way through the virtual environments, removed them, recreated them and restarted them.

> rm -rf venv

> python3 -m venv venv

> source venv/bin/activate

I noticed that I don’t have a requirements.txt file to make it simple to reload Python modules. I’ve been creating that file as needed. I also noticed I was having trouble with loading the modules needed for an OpenCV project.

i installed OpenCV with homebrew and I knew it was outdated, so I upgraded it (and other dependent files) with brew. I reloaded all the related Python modules, but still could not get the sample file to run.

> brew outdated

> brew upgrade opencv

Instructions here show full steps for installing OpenCV. Note step 6 where a symlink is set up to point from the openCV libraries installed by home-brew and pointing back to the virtual environment

> cd venv/lib/python3.7/site-packages/

> ln -s /usr/local/opt/opencv3/lib/python3.7/site-packages/cv2.cpython-37m-darwin.so cv2.so

Note that file paths now show 3.7. If and when Python 3.8 or Python 4 show up, the paths will probably need to be updated to show the Python version.

installation of OpenCV, finally

I mentioned in another post about my plans for WorldWarIICasualtyProject.org. In short, the U.S. National Archives has scanned pages out of books that list American casualties that took place in World War II. I was curious to find out more, but discovered that those records did not exist in searchable form. I thought it would be interesting to figure out how to scan them (gif files, really!) and read the data as OCR.

I decided to go ahead on my initial plan: use a Python OCR module to read the scan. However … the Python module I tracked down (pytesseract) also required PIL (another module, part of the dependencies) and strongly suggested I install the python science packages. I figured I would need them at some point, so I installed numpy, scipy, matplotlib, scikit-image, scikit-learn, ipython, and pandas. ( https://www.learnopencv.com/install-opencv3-on-macos/)

At this point, I paused. I found several pages that suggested OpenCV be installed with Homebrew. That’s not a big deal because I use Homebrew for python 2/3. It gets confusing here. At one time, OpenCV was kept in a specialized area named “homebrew/science” but was moved to “homebrew/core”. I’m told “homebrew/science” is empty, so there should be no reason to link to it. We’ll see.

Note: use ‘> brew tap’ to list all taps connected for homebrew

Also note: opencv3 does not exist anymore. I think it has been renamed to opencv. Opencv2 has been renamed ‘opencv@2’. … So confusing …

Then there’s the question of linking OpenCV to “… Homebrew Python’s site-packages directory”. What? See https://www.learnopencv.com/install-opencv3-on-macos/

I’m sticking with these instructions: https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/ except for the part where I tap into homebrew/science (it doesn’t exist any more) and I install opencv3. (It’s been renamed to opencv).

2018-02-21 update

I installed OpenCV through homebrew. Lots of dependencies were installed. Interestingly enough, I can see opencv through the default homebrew python3 install, but not in virtual environment I created for custom work. In other words:

> python3
>>> import cv2
>>> cv2.__version__
‘3.4.0’

However, when I go to the virtual environment set up for ww2cp, I don’t see it.

> source ww2venv/bin/activate
(ww2venv) > python3
>>> import cv2
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ModuleNotFoundError: No module named ‘cv2’

So, following the instructions here: (https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/), I set up a symbolic link between homebrew’s openCV install and the site-packages inside the ww2 venv folder.

Now it works!

Message digests OR I should have known that

I have lots of old SQL dumps stored in backups. I wanted to find a way to check to see if I was storing the same files over and over again. I did not want to check them line by line, because it would take too long. I remembered that message digests are a way to check to see if a file has been tampered. So, if I create a message digest of two files that I think are the same, a matching digest should (ideally) prove that they are the same.

By the way, what is a message digest? It’s “ … a cryptographic hash function containing a string of digits created by a one-way hashing formula”. ( https://www.techopedia.com/definition/4024/message-digest ). In other words, it is the result of sending a file or string through a one-way function and outputting the result. Ideally, it can be used to check to see if a file has been modified. If two files are related, but slightly different, they will generate two different message digests.

Back to digests. I like the idea of taking the sql dumps and generating a message digest. However, I noticed that the SQL dumps usually have a timestamp showing when the dump was created listed inside the SQL comments. This will automatically create a different digest. Can I remove the SQL comments and create a digest from that?

It turns out that I can. It works nicely.

> grep –regexp=“^–.*” <path-to-sql-dump>

shows all the SQL comments in the file

> grep –regexp=“^[^–.*]” <path-to-sql-dump>

shows everything but the SQL comments. Pipe that result into a digest function

> grep –regexp=“^[^–.*]” <path-to-sql-dump> | md5

shows the resulting digest using md5. Similarly, using “openssl sha1”, “shasum”, “shasum -a 512”, “shasum -a 512224” and “shasum -a 512256” will generate different digests, which can all be used to compare SQL commands in a SQL dump file.

I’m a little sad that “shasum” did not work completely. It adds the file name after the digest and hyphen, allowing storage of the digests. However, since the file is piped into the command, there is no file name to add to the end of the file. I’m sure there’s a way to add it to a file, though. Maybe something like?

> grep –regexp=“^[^–.*]” <path-to-sql-dump> | shasum; echo !!:2

then search for ‘- line-break’ and replace with ‘- ‘. … Maybe, maybe …

My adventure with AWS, part 1

I decided to try out Amazon AWS to find out what it could do for me. As it turns out, I also has a project (WorldWarIICasualtyProject.org or ww2cp for short) that I wanted to host cheaply. Amazon AWS promises to do that.

First things first, I had to create an account. It’s not hard provided you have a credit card handy. Amazon tests the card to see if it has money (in case it’s a debit card). I chose to use a debit card first because there are no default brakes on spending. (Debit card runs out of money. I assume AWS stops services, but am I going to test it?) You have to protect your accounts to prevent others from spending your money. You also have to watch what you do decide to activate because there are no auto-shutdowns if you spend too much money. Live dangerously? Not really. Just figure out how to set alarms and stay on top of them.

AWS also recommends not using your root (or first) account for daily use, in case it gets compromised. I detoured over to IAM (Identity and Agent Management) and created a separate account that I would use every day. Oddly, it’s possible to assign almost all root powers to any child account, so, once again, be careful.

Once I was satisfied with the child accounts, I started testing S3 (Simple Storage Service). They operate on the concept of “buckets”, that hold pretty much everything. Amazon has built a pseudo-folder structure to allow some organization, but really everything goes in one big bucket.

One cool thing about S3 buckets is that they get mirrored to other nodes within a region. The idea is that this should make it easier to pull the data from the bucket regardless of where a browser is within a region. This becomes important when using Route53.

Route53 is Amazon’s version of DNS. I bought ww2cp from NameCheap.com and used them as the DNS to a placeholder while I figured out what I was going to do with the website. I discovered that I could have the S3 buckets I created earlier serve as a website, provided I let Route53 handle the DNS. Coolness!

Weird fact: Route53 assigned four name servers to resolve ww2cp. When I used nslookup to check for the correct IP address for the website, I would get a revolving set of four “web” servers instead of the one (parking) IP address I used before. I bet that has to do with the S3 mirrors I mentioned above.

Setting up Route53 to handle DNS is not hard. (There once was a time when AWS documentation was cryptic and undecipherable. If you read the same docs often enough, they make sense.) Anyway, I set up Route53 to handle the DNS services required to make the S3 bucket host the files for the website. I updated the name server information over at NameCheap … and nothing happened. For some odd reason, my changes to the name servers over at NameCheap kept reverting to their original settings. Eventually, I had to get NameCheap tech support involved to get the name server changes to stick, but it did.

Adventure with Laravel Homestead, part 1

I thought it would be interesting to set up a Laravel instance to try it out. I did have a new project I wanted to develop, so I thought “How hard can it be to set up Laravel”? Well …

I have started using Vagrant to run some VMs (instead of VirtualBox by itself). I found the homestead box in the Vagrant Cloud, so I set that up to download and install.

> mkdir ~/Vagrant/laravel

> cd !!:$

> vagrant box add laravel/homestead

I wait a long while to download, but it finally arrives. I start it and look around, but I don’t see any of the programs I’m told will be associated with Homestead (mysql, nginx, etc). That’s odd.

The Laravel homestead documents also suggest cloning a repository. That did not make any sense, since I had a running Vagrant box. Since nothing was happening, I thought “Why not?”. The clone is downloaded into the same folder as the Laravel Vagrantfile. I follow the instructions to set up a Homestead.yaml file and look inside the Homestead folder to see if it’s there. It is, along with lots of others stuff, including … another Vagrantfile … That’s weird.

I make some simple changes to homestead.yaml and reload the original Vagrant box.

> vagrant reload –provision

I don’t see my changes. After some time, I wonder … What if I went inside the cloned folder and ran that Vagrantfile? It turns out that second Vagrantfile is the one that runs the homestead install. It seems weird to have a homestead box available in the Vagrant cloud and NOT have it be the one to use, but that’s what happened.

I make some changes in the homestead.yaml to set up a simple test site using the classic php test file.

<?php phpinfo(); ?>

Nothing displays. My first error message says “No input file specified”. I fixed that by pointing the map section inside homestead.yaml to the correct folder. Next error: “403 prohibited”. I thought it might be an nginx error, but I did not want to mess around too much with that just yet. However, I did check /var/log/nginx/homestead.test-error.log and I noticed something interesting:

“Unable to open primary script: /home/vagrant/code/test01/public/index.php (No such file or directory)”

OK. I guess I must wrap the simple php test page inside a Laravel template to make it display. Not a big deal, but one more thing I need to figure out. That will be for next time.