I updated to Mojave. I remembered (again) that /etc/apache2/httpd.conf gets reset. This means that Apache may serve up local web pages, but local php sites may stop working. As always, I follow the instructions here to review how to get PHP working again. It’s an easy fix and PHP now runs again. What’s new? PHP 7 on the local machine as default. Hmm … I know I’m behind. It’s been a busy year.
Continuing on the WW2CP project, I found some test gifs that look like casualty lists by county and state. I pulled down one sample page to see if I could get OpenCV to read the text on the page. No luck at first. The page was black text on gray paper in a gif, so I thought that fixing the brightness and contrast and cropping the page might help, but no.
The pages are organized by sets of columns that list the last name, first name, middle initial, serial number, rank and fate (KIA, DNB, etc). These sets of columns are then organized as three sets of columns on a page. With review of the Pillow documentation, I figured out how to crop the columns to make them one column only. I also figured out how to change the brightness of the page (to make the gray background turn whiter) and the contrast (to make the letters look more like letters). I ran OpenCV on the modified pages and I still could not get OCR to work properly.
I probably need to look at Natural Language Processing to make this work the way I want. It sounds interesting, but it could also be a rabbit hole of never-ending tasks. Apparently, I may need a corpus (?) of names, I think. I should probably read a book first. (Natural Language Processing with Python).
I had some time to get back to this project, so I decided to make some updates. My updates were not to the website, but to the data underscoring the website. I needed to show that I could take data (pages scanned as gifs – really) available at the National Archives and read that data into a database. I pulled a page from here (https://www.archives.gov/research/military/ww2/army-casualties). It’s not really a random page — it’s a page from U. S. Army casualties from Los Angeles County.
I discovered I could not open the file using Python 3 and OpenCV because it can’t open GIF files. (???) I was able to open the file using the Pillow module from Python 3. I can also convert the file to a PNG file and resave the file into a different folder. Once the file was saved, I could read the file using OpenCV and could get some OCR. The results were very bad, but it’s the first try.
The page is set up to show three columns of casualties. I need to crop the page to show three separate columns and read each column. Additionally, I need to boost the contrast to see if that makes the text in the image easier to read. I’ll need to review Pillow to see how I can change contrast, but it does not look hard.
I recently updated my laptop to run Python 3.7.0. I forgot that updates knocks out all the Python virtual environments that expect Python 3.6. I’ve worked my way through the virtual environments, removed them, recreated them and restarted them.
> rm -rf venv
> python3 -m venv venv
> source venv/bin/activate
I noticed that I don’t have a requirements.txt file to make it simple to reload Python modules. I’ve been creating that file as needed. I also noticed I was having trouble with loading the modules needed for an OpenCV project.
i installed OpenCV with homebrew and I knew it was outdated, so I upgraded it (and other dependent files) with brew. I reloaded all the related Python modules, but still could not get the sample file to run.
> brew outdated
> brew upgrade opencv
Instructions here show full steps for installing OpenCV. Note step 6 where a symlink is set up to point from the openCV libraries installed by home-brew and pointing back to the virtual environment
> cd venv/lib/python3.7/site-packages/
> ln -s /usr/local/opt/opencv3/lib/python3.7/site-packages/cv2.cpython-37m-darwin.so cv2.so
Note that file paths now show 3.7. If and when Python 3.8 or Python 4 show up, the paths will probably need to be updated to show the Python version.
I mentioned in another post about my plans for WorldWarIICasualtyProject.org. In short, the U.S. National Archives has scanned pages out of books that list American casualties that took place in World War II. I was curious to find out more, but discovered that those records did not exist in searchable form. I thought it would be interesting to figure out how to scan them (gif files, really!) and read the data as OCR.
I decided to go ahead on my initial plan: use a Python OCR module to read the scan. However … the Python module I tracked down (pytesseract) also required PIL (another module, part of the dependencies) and strongly suggested I install the python science packages. I figured I would need them at some point, so I installed numpy, scipy, matplotlib, scikit-image, scikit-learn, ipython, and pandas. ( https://www.learnopencv.com/install-opencv3-on-macos/)
At this point, I paused. I found several pages that suggested OpenCV be installed with Homebrew. That’s not a big deal because I use Homebrew for python 2/3. It gets confusing here. At one time, OpenCV was kept in a specialized area named “homebrew/science” but was moved to “homebrew/core”. I’m told “homebrew/science” is empty, so there should be no reason to link to it. We’ll see.
Note: use ‘> brew tap’ to list all taps connected for homebrew
Also note: opencv3 does not exist anymore. I think it has been renamed to opencv. Opencv2 has been renamed ‘opencv@2’. … So confusing …
Then there’s the question of linking OpenCV to “… Homebrew Python’s site-packages directory”. What? See https://www.learnopencv.com/install-opencv3-on-macos/
I’m sticking with these instructions: https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/ except for the part where I tap into homebrew/science (it doesn’t exist any more) and I install opencv3. (It’s been renamed to opencv).
I installed OpenCV through homebrew. Lots of dependencies were installed. Interestingly enough, I can see opencv through the default homebrew python3 install, but not in virtual environment I created for custom work. In other words:
>>> import cv2
However, when I go to the virtual environment set up for ww2cp, I don’t see it.
> source ww2venv/bin/activate
(ww2venv) > python3
>>> import cv2
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ModuleNotFoundError: No module named ‘cv2’
So, following the instructions here: (https://robferguson.org/blog/2017/10/06/how-to-install-opencv-and-python-using-homebrew-on-macos-sierra/), I set up a symbolic link between homebrew’s openCV install and the site-packages inside the ww2 venv folder.
Now it works!
I have lots of old SQL dumps stored in backups. I wanted to find a way to check to see if I was storing the same files over and over again. I did not want to check them line by line, because it would take too long. I remembered that message digests are a way to check to see if a file has been tampered. So, if I create a message digest of two files that I think are the same, a matching digest should (ideally) prove that they are the same.
By the way, what is a message digest? It’s “ … a cryptographic hash function containing a string of digits created by a one-way hashing formula”. ( https://www.techopedia.com/definition/4024/message-digest ). In other words, it is the result of sending a file or string through a one-way function and outputting the result. Ideally, it can be used to check to see if a file has been modified. If two files are related, but slightly different, they will generate two different message digests.
Back to digests. I like the idea of taking the sql dumps and generating a message digest. However, I noticed that the SQL dumps usually have a timestamp showing when the dump was created listed inside the SQL comments. This will automatically create a different digest. Can I remove the SQL comments and create a digest from that?
It turns out that I can. It works nicely.
> grep –regexp=“^–.*” <path-to-sql-dump>
shows all the SQL comments in the file
> grep –regexp=“^[^–.*]” <path-to-sql-dump>
shows everything but the SQL comments. Pipe that result into a digest function
> grep –regexp=“^[^–.*]” <path-to-sql-dump> | md5
shows the resulting digest using md5. Similarly, using “openssl sha1”, “shasum”, “shasum -a 512”, “shasum -a 512224” and “shasum -a 512256” will generate different digests, which can all be used to compare SQL commands in a SQL dump file.
I’m a little sad that “shasum” did not work completely. It adds the file name after the digest and hyphen, allowing storage of the digests. However, since the file is piped into the command, there is no file name to add to the end of the file. I’m sure there’s a way to add it to a file, though. Maybe something like?
> grep –regexp=“^[^–.*]” <path-to-sql-dump> | shasum; echo !!:2
then search for ‘- line-break’ and replace with ‘- ‘. … Maybe, maybe …
I decided to try out Amazon AWS to find out what it could do for me. As it turns out, I also has a project (WorldWarIICasualtyProject.org or ww2cp for short) that I wanted to host cheaply. Amazon AWS promises to do that.
First things first, I had to create an account. It’s not hard provided you have a credit card handy. Amazon tests the card to see if it has money (in case it’s a debit card). I chose to use a debit card first because there are no default brakes on spending. (Debit card runs out of money. I assume AWS stops services, but am I going to test it?) You have to protect your accounts to prevent others from spending your money. You also have to watch what you do decide to activate because there are no auto-shutdowns if you spend too much money. Live dangerously? Not really. Just figure out how to set alarms and stay on top of them.
AWS also recommends not using your root (or first) account for daily use, in case it gets compromised. I detoured over to IAM (Identity and Agent Management) and created a separate account that I would use every day. Oddly, it’s possible to assign almost all root powers to any child account, so, once again, be careful.
Once I was satisfied with the child accounts, I started testing S3 (Simple Storage Service). They operate on the concept of “buckets”, that hold pretty much everything. Amazon has built a pseudo-folder structure to allow some organization, but really everything goes in one big bucket.
One cool thing about S3 buckets is that they get mirrored to other nodes within a region. The idea is that this should make it easier to pull the data from the bucket regardless of where a browser is within a region. This becomes important when using Route53.
Route53 is Amazon’s version of DNS. I bought ww2cp from NameCheap.com and used them as the DNS to a placeholder while I figured out what I was going to do with the website. I discovered that I could have the S3 buckets I created earlier serve as a website, provided I let Route53 handle the DNS. Coolness!
Weird fact: Route53 assigned four name servers to resolve ww2cp. When I used nslookup to check for the correct IP address for the website, I would get a revolving set of four “web” servers instead of the one (parking) IP address I used before. I bet that has to do with the S3 mirrors I mentioned above.
Setting up Route53 to handle DNS is not hard. (There once was a time when AWS documentation was cryptic and undecipherable. If you read the same docs often enough, they make sense.) Anyway, I set up Route53 to handle the DNS services required to make the S3 bucket host the files for the website. I updated the name server information over at NameCheap … and nothing happened. For some odd reason, my changes to the name servers over at NameCheap kept reverting to their original settings. Eventually, I had to get NameCheap tech support involved to get the name server changes to stick, but it did.