Confidential OCR

Posted on 20 November 2024 by John

A client emailed me a screenshot of a table rather than pasting the table as text into an email.

I thought about using an LLM to convert it to text, but the table is confidential client information and so I shouldn’t upload it anywhere.

I searched for a command line utility to do OCR and found tesseract. I installed it with

    sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

and ran it with the default settings

    tesseract screenshot.png textfile

It worked remarkably well. I had to change a C to a U, but otherwise I didn’t have to add or change any text, but I did have to delete a few extraneous parentheses generated by the software.

I work locally in part out of habit; it was the only way to work when I started using a computer. It has numerous advantages, such as being able to keep working when a hurricane knocks out my internet connection, but above all it is private.

I pay more attention to privacy than is convenient because I work in data privacy. And aside from my privacy, I have to protect our clients’ privacy.

Update: According to the comments, ChatGPT uses tesseract. Assuming that’s true, using tesseract directly is better than ChatGPT because it does exactly what you want. No ambiguity as far as what expected. No potential for tinkering with your results before you see them.

Resolving a mysterious problem with find

Posted on 12 November 2024 by John

Suppose you want to write a shell script searches the current directory for files that have a keyword in the name of the file or in its contents. Here’s a first attempt.

find . -name '*.py' -type f -print0 | grep -i "$1"
find . -name '*.py' -type f -print0 | xargs -0 grep -il "$1"

This works well for searching file contents but behaves unexpectedly when searching for file names.

If I have a file named frodo.py in the directory, the script will return

grep: (standard input): binary file matches

Binary file matches?! I wasn’t searching binary files. I was searching files with names consisting entirely of ASCII characters. Where is a binary file coming from?

If we cut off the pipe at the end of the first line of the script and run

find . -name '*.py' -type f -print0

we get something like

.elwing.py/.frodo.py/gandalf.py

with no apparent non-ASCII characters. But if we pipe the output through xxd to see a hex dump, we see that there are invisible null characters after each file name.

One way to fix our script would be to add a -a option to the call to grep, telling to treat the input as ASCII. But this will return the same output as above. The output of find is treated as one long (ASCII) string, which matches the regular expression.

Another possibility would be to add a -o flag to direct grep to return just the match. But this is less than ideal as well. If you were looking for file names containing a Q, for example, you’d get Q as your output, which doesn’t tell you the full file name.

There may be better solutions [1], but my solution was to insert a call to strings in the pipeline:

find . -name '*.py' -type f -print0 | strings | grep -i "$1"

This will extract the ASCII strings out of the input it receives, which has the effect of splitting the string of file names into individual names.

By default the strings command defines an ASCII string to be a string of 4 or more consecutive ASCII characters. A file with anything before the .py extension will necessarily have at least four characters, but the analogous script to search C source files would overlook a file named x.c. You could fix this by using strings -n 3 to find sequences of three or more ASCII characters.

If you don’t have the strings command installed, you could use sed to replace the null characters with newlines.

find . -name '*.py' -type f -print0 | sed 's/\x0/\n/g' | grep -i "$1"

Note that the null character is denoted \x0 rather than simply \0.

[1] See the comments for better solutions. I really appreciate your feedback. I’ve learned a lot over the years from reader comments.

Data file character frequencies

Posted on 30 May 2023 by John

I have a little script that will print the frequency of the most common characters in a file and the number of lines. All numbers are displayed along with their factorizations. It also prints the number of non-ASCII characters.

CSV files

These simple statistics are surprisingly useful. For example, when I ran it on an CSV file that I downloaded recently I got the following.

    ,   397907424  =  2^5 3^3 19 24239 
    0   58200944  =  2^4 1699 2141 
    2   52955465  =  5 467 22679 
    1   46413310  =  2 5 23 201797 
    3   34811225  =  5^2 1392449 

    Num lines:  1745208  =  2^3 3^2 24239 

    All ASCII characters

This strongly implies that the CSV file really is a CSV (comma-separated value) file. Sometimes you’ll get a file with a .csv extension but the separator is a tab, a pipe, or some other character.

The number of commas is a multiple of the number of lines. That’s a good sign. Apparently this is a CSV file with 12×19 columns and 1,745,208 rows [1]. If the number of separators is not a multiple of the number of lines, maybe some lines are incomplete. Or maybe your file separator appears inside a quoted string. This is not necessarily a problem, but it means the most naive parsing won’t work.

In the file above, the most common characters, other than commas, are digits, so the file probably contains mostly numeric data.

If your file contains quotation marks, better hope it contains an even number. Even better, an even multiple of the number of lines. If not, you have some troubleshooting to do.

Incidentally, whenever the subject of CSV files comes up, someone will say “Why are you using CSV files?! Don’t you know there are better formats?” My reply is that I’m a consultant and I take data in whatever format I can get it, and that most often means a delimiter-separated text file. That works fine, except, of course, when it doesn’t.

Unicode characters

A file with lots of non-ASCII characters is not a problem. A file with one non-ASCII character very often is a problem.

A single non-ASCII character could be an invisible character that will gum up parsing. This can be maddening to find if you’re relying on visual inspection. But if you know there’s a non-ASCII character where it shouldn’t be, such as in a file of digits and commas, then you can simply delete it.

JSON

If you’re inspecting a JSON file, you’d expect to see lots of braces. Hopefully you have an equal number of open and close braces. But if not, you know where to being troubleshooting. You should also expect a lot of colons. Knowing the number of braces and colons gives you a clue to the structure of the file.

Troubleshooting and guessing

When a file has no complications, the stats above tell you things you’d know from looking at the first line or two of the file. However, when there are complications, the stats can be useful.

The stats could also be useful in a context where it’s OK to make guesses. For example, you might have a script that guesses the structure of a file and proceeds accordingly. That’s fine when wrong guesses lead to obviously wrong output. It’s hard to imagine, for example, that mistaking an XML file for a CVS file would produce a subtle error.

[1] The number of fields is one more than the number of separators. Or maybe not: there may be a trailing separator after the last field. So in this case there may be 228 or 229 columns.

Arbitrary precision math in gawk

Posted on 6 December 2022 by John

The idea of using awk for any math beyond basic arithmetic is kinda strange, and yet it has some nice features.

Awk was designed for file munging, a task it does well with compact syntax. GNU awk (gawk) supports the original minimalist version of awk and adds more features. It supports arbitrary precision arithmetic by wrapping the GMP and MPFR libraries. (When I refer to “awk” in this post I mean the features common to awk and gawk.)

If you just want extended precision, it would be easier to use bc. But if you’ve written an awk script and want to do some extended precision calculation in place rather than in a different tool, you could take advantage of gawk’s extension.

Here’s a one-liner to compute π to 100 decimal places using gawk.

    gawk -M -v PREC=333 'BEGIN {printf("%0.100f\n", 4*atan2(1,1))}'

The -M flag tells gawk to use arbitrary precision arithmetic.

NB: Your version of gawk may not support arbitrary precision. If so, running the code above will give the following warning:

    gawk: warning: -M ignored: MPFR/GMP support not compiled in

The -v flag tells gawk you’re about to set a variable, namely PREC variable for precision. You could set PREC in the body of your code instead.

Precision is measured in bits, not digits, and so for 100 digits we require 333 bits [1].

Awk is line-oriented. There’s an implicit loop around an awk program that loops over every line of input files. But you can specify code to run before the loop in a BEGIN block, and code to run after the loop in an END block. Here we just need the BEGIN block, no loop and no END.

We compute π as 4 times the arctangent of 1. Awk has an atan2 function rather than a atan function. The function atan(z) returns an angle between -π/2 and π/2 whose tangent is z. The function atan2(y, x) returns an angle between -π and π, in the same quadrant as x and y, whose tangent is y/x. The atan2 function is more general than atan since atan(z) equals atan2(z, 1). This is one advantage gawk has over bc, since bc has only an atan function (which it calls a).

***

[1] Where did 333 bits come from? It’s log₂ 10¹⁰⁰, rounded up to the next integer. You could compute the logarithm in awk using

    awk 'BEGIN {print 100*log(10)/log(2)}'

Note that this uses awk; you don’t need gawk for this calculation. Awk doesn’t have a function to compute log base 2, only natural logs. So you use the fact that

log₂(x) = log(x)/log(2).

Exploring bad passwords

Posted on 16 November 2022 by John

If your password is in the file rockyou.txt then it’s a bad password. Password cracking software will find it instantly. (Use long, randomly generated passwords; staying off the list of worst passwords is necessary but not sufficient for security.)

The rockyou.txt file currently contains 14,344,394 bad passwords. I poked around in the file and this post reports some things I found.

To make things more interesting, I made myself a rule that I could only use command line utilities.

Pure numeric passwords

I was curious how many of these passwords consisted only of digits so I ran the following.

    grep -P '^\d+$' rockyou.txt | wc -l

This says 2,346,744 of the passwords only contain digits, about 1 in 6.

Digit distribution

I made a file of digits appearing in the passwords

    grep -o -P '\d' rockyou.txt > digits

and looked at the frequency of digits.

    for i in 0 1 2 3 4 5 6 7 8 9
    do 
        grep -c $i digits
    done

This is what I got:

The digits are distributed more evenly than I would have expected. 1’s are more common than other digits, but only about twice as common as the least common digits.

Longest bad passwords

How long is the longest bad password? The command

    wc -L rockyou.txt

shows that one line in the file is 285 characters long. What is this password? The command

    grep -P '.{285}' rockyou.txt

shows that it’s some HTML code. Nice try whoever thought of that, but you’ve been pwned.

A similar search for all-digit passwords show that the longest numeric passwords are 255 digits long. One of these is a string of 255 zeros.

Dictionary words

A common bit of advice is to not choose passwords that can be found in a database. That’s good advice as far as it goes, but it doesn’t go very far.

I used the comm utility to see how many bad passwords are not in the dictionary by running

    comm -23 sorted dict | wc -l

and the answer was 14,310,684. Nearly all the bad passwords are not in a dictionary!

(Here sorted is a sorted version of the rockyou.txt file; I believe the file is initially sorted by popularity, worst passwords first. The comm utility complained that my system dictionary isn’t sorted, which I found odd, but I sorted it to make comm happy and dict is the sorted file.)

Curiously, the command

    comm -13 sorted dict | wc -l

shows there are 70,624 words in the dictionary (specifically, the american-english file on my Linux box) that are not on the bad password list.

Smallest ‘good’ numeric password

What is the smallest number not in the list of pure numeric passwords? The following command strips leading zeros from purely numeric passwords, sorts the results as numbers, removes duplicates, and stores the results in a file called nums.

    grep -P '^\d+$' rockyou.txt | sed 's/^0\+//' | sort -n | uniq > nums

The file nums begins with a blank. I removed this with sed.

    sed -i 1d nums

Next I used awk to print instances where the line number does not match the line in the file nums.

    awk '{if (NR-$0 < 0) print $0 }' nums | less

The first number this prints is 61. This means that the first line is 1, the second line is 2, and so on, but the 60th line is 61. That means 60 is missing. The file rockyou.txt does not contain 60. You can verify this: the command

    grep '^60$' rockyou.txt

returns nothing. 60 is the smallest number not in the bad password file. There are passwords that contain ’60’ as a substring, but just 60 as a complete password is not in the file.

Permutations at the command line

Posted on 25 January 2022 by John

Yesterday I wrote about how you could use the tr utility to find the dual of a modal logic expression. Today I’ll show how you could use tr to work with permutations.

This post is a hack, and that’s the fun of it.

In the logic post we were using tr to swap characters. More generally, tr can permute characters. As mentioned before, it’s key that replacements are conceptually simultaneous. Otherwise none of this would work.

Problem 1

Here’s a homework problem from Abstract Algebra by Dummit and Foote.

Let σ be the permutation

1 → 3, 2 → 4, 3 → 5, 4 → 2, 5 → 1

and let τ be the permutation

1 → 5, 2 → 3, 3 →2, 4 → 4, 5 → 1.

Find the cycle decompositions of each of the following permutations: σ, τ, σ², στ, τσ, and τ²σ.

We won’t do all the parts of this problem, because that would be tedious. But we’ll do σ and στ.

A decomposition of a permutation factors the permutation into smaller permutations if possible. The algorithm to decompose a permutation starts by picking any element and finding how it cycles. Then it picks another element not in that cycle and finds its cycle. This repeats until we run out of elements. The original permutation is then equal to the product of these cyclic permutations.

Permutation σ

We might as well start with 1. Let’s see why. In the problem above, σ sends the digits 12345 to the digits 34521.

    echo 1 | tr 12345 34521

This prints 3. No surprise: 1 goes to 3.

Now where does 3 go?

    echo 1 | tr 12345 34521 | tr 12345 34521

This says 5. And where does 5 go?

    echo 1 | tr 12345 34521 | tr 12345 34521 | tr 12345 34521

It goes back to 1. So our first cycle is

(1 3 5)

Now let’s pick a number not in this cycle, say 2.

    echo 2 | tr 12345 34521

So 2 goes to 4, and there’s no place left for 4 to go except back to 2. This means σ has the decomposition

(1 3 5)(2 4)

Permutation στ

Now let’s do στ. This means do τ then do σ. At least that’s how I’m reading it, like function composition; some folks use the opposite convention.

Composing permutations corresponds to piping one tr command into another. So we could see where 1 goes under στ as follows.

    echo 1 | tr 1235 5321 | tr 12345 34521

This returns 1, so 1 is a fixed point. So our decomposition starts with

(1).

OK, so what about 2? We could find out with

    echo 2 | tr 1235 5321 | tr 12345 34521 | | tr 1235 5321 | tr 12345 34521

but our commands are getting kinda long, and could get longer still.

Another approach would be to compute στ by seeing what it does to all numbers, not must one at a time.

    echo 12345 | tr 1235 5321 | tr 12345 34521

returns 15423. So we could see where 2 goes using

    echo 2 | tr 12345 15423

which tells us 2 goes to 5. We could see where 5 goes by adding another copy of our pipe to tr on the end, or by changing echo 2 to echo 5. In any case 5 goes to 3, and 3 goes to 4.

So στ has the decomposition

(1)(2 5 3 4).

Problem 2

The next homework problem in Dummit and Foote ask for the decomposition of a permutation of size 15:

1 → 13, 2 → 2, 3 → 15, …

Our trick for using tr won’t work without some modification because

    tr 123… 13215…

would send 1 to 1, 2 to 3, 3 to 2, etc. However, we could do the problem by representing our numbers in hexadecimal.

    tr 123… D2F…

In fact, the full permutation sends the numbers 1 through F to D2FEA6C341795B8.

Let’s define a shell alias to make things easier:

    alias t='tr 123456789ABCDEF D2FEA6C341795B8'

Now we can find the cycle of 1 with the following commands.

    echo 1 | t
    echo 1 | t | t
    echo 1 | t | t | t
    echo 1 | t | t | t | t

which tells us 1 is part of the cycle

(1 D 5 A)

Similarly we can find that 2 is a fixed point and 3 belongs to the cycle

(3 F 8)

and the final decomposition is

(1 D 5 A)(2)(3 F 8)(4 E B 7 C 9 4)

Recursive grep

Posted on 29 March 2021 by John

The regular expression search utility grep has a recursive switch -R, but it may not work like you’d expect.

Suppose want to find the names of all .org files in your current directory and below that contain the text “cheese.”

You have four files, two in the working directory and two below, that all contain the same string: “I like cheese.”

    $ ls -R
    .:
    rootfile.org  rootfile.txt  sub
 
    ./sub:
    subfile.org  subfile.txt

It seems that grep -R can either search all files of the form *.org in the current directory, ignoring the -R switch, or search all files recursively if you don’t give it a file glob, but it can’t do both.

    $ grep -R -l cheese *.org
    rootfile.org
 
    $ grep -R -l cheese .
    ./rootfile.org
    ./rootfile.txt
    ./sub/subfile.org
    ./sub/subfile.txt

One way to solve this is with find and xargs:

    $ find . -name '*.org' | xargs grep -l cheese 
    ./rootfile.org                                                           
    ./sub/subfile.org

I was discussing this with Chris Toomey and he suggested an alternative using a subshell that seems more natural:

    grep -l cheese $(find . -name '*.org')

Now the code reads more like an ordinary call to grep. From left to right, it essentially says “Search for ‘cheese’ in files ending in .org” whereas the version with find reads like “Find files whose names end in .org and search them for ‘cheese.'” It’s good to understand how both approaches work.

A shell one-liner to search directories

Posted on 19 April 2020 by John

I started this post by wanting to look at the frequency of LaTeX commands, but then thought that some people mind find the code to find the frequencies more interesting than the frequencies themselves.

So I’m splitting this into two posts. This post will look at the shell one-liner to find command frequencies, and the next post will look at the actual frequencies.

I want to explore LaTeX files, so I’ll start by using find to find such files.

    find . -name "*.tex"

This searches for files ending in .tex, starting with the current directory (hence .) and searching recursively into subdirectories. The find command explores subdirectories by default; you have to tell it not to if that’s not what you want.

Next, I want to use grep to search the LaTeX files. If I pipe the output of find to grep it will search the file names, but I want it to search the file contents. The xargs command takes care of this, receiving the file names and passing them along as file names, i.e. not as text input.

    find . -name "*.tex" | xargs grep ...

LaTeX commands have the form of a backslash followed by letters, so the regular expression I’ll pass is \\[a-z]+. This says to look for a literal backslash followed by one or more letters.

I’ll give grep four option flags. I’ll use -i to ask it to use case-insensitive matching, because LaTeX commands can begin contain capital letters. I’ll use -E to tell it I want to use extended regular expressions [1].

I’m after just the commands, not the lines containing commands, and so I use the -o option to tell grep to return just the commands, one per line. But that’s not enough. It would be enough if we were only search one file, but since we’re searching multiple files, the default behavior is for grep to return the file name as well. The -h option tells it to only return the matches, no file names.

So now we’re up to this:

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+'

Next I want to count how many times each command occurs, and I need to sort the output first so that uniq will count correctly.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c

And finally I want to sort the output by frequency, in descending order. The -n option tells sort to sort numerically, and -r says to sort in descending order than the default ascending order. This produces a lot of output, so I pipe everything to less to view it one screen at a time.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c | sort -rn | less

That’s my one-liner. In the next post I’ll look at the results.

Doing a database join with CSV files

Posted on 31 December 2019 by John

relational database

It’s easy to manipulate CSV files with basic command line tools until you need to do a join. When your data is spread over two different files, like two tables in a normalized database, joining the files is more difficult unless the two files have the same keys in the same order. Fortunately, the xsv utility is just the tool for the job. Among other useful features, xsv supports database-like joins.

Suppose you want to look at weights broken down by sex, but weights are in one file and sex is in another. The weight file alone doesn’t tell you whether the weights belong to men or women.

Suppose a file weight.csv has the following rows:

    ID,weight
    123,200
    789,155
    999,160

and a file person.csv has the following:

    ID,sex
    123,M
    456,F
    789,F

Note that the two files have different ID values: 123 and 789 are in both files, 999 is only in weight.csv and 456 is only in person.csv. We want to join the two tables together, analogous to the JOIN command in SQL.

The command

    xsv join ID person.csv ID weight.csv

does just this, producing

    ID,sex,ID,weight
    123,M,123,200
    789,F,789,155

by joining the two tables on their ID columns.

The command includes ID twice, once for the field called ID in person.csv and once for the field called ID in weight.csv. The fields could have different names. For example, if the first column of person.csv were renamed Key, then the command

    xsv join Key person.csv ID weight.csv

would produce

    Key,sex,ID,weight
    123,M,123,200
    789,F,789,155

We’re not interested in the ID columns per se; we only want to use them to join the two files. We could suppress them in the output by asking xsv to select the second and fourth columns of the output

    xsv join Key person.csv ID weight.csv | xsv select 2,4

which would return

    sex,weight
    M,200
    F,155

We can do other kinds of joins by passing a modifier to join. For example, if we do a left join, we will include all rows in the left file, person.csv, even if there isn’t a match in the right file, weight.csv. The weight will be missing for such records, and so

    $ xsv join --left Key person.csv ID weight.csv

produces

    Key,sex,ID,weight
    123,M,123,200
    456,F,,
    789,F,789,155

Right joins are analogous, including every record from the second file, and so

    xsv join --right Key person.csv ID weight.csv

produces

    Key,sex,ID,weight
    123,M,123,200
    789,F,789,155
    ,,999,160

You can also do a full join, with

    xsv join --full Key person.csv ID weight.csv

producing

    Key,sex,ID,weight
    123,M,123,200
    456,F,,
    789,F,789,155
    ,,999,160

Exporting Excel files to CSV with in2csv

Posted on 30 December 2019 by John

This post shows how to export an Excel file to a CSV file using in2csv from the csvkit package.

You could always use Excel itself to export an Excel file to CSV but there are several reasons you might not want to. First and foremost, you might not have Excel. Another reason is that you might want to work from the command line in order to automate the process. Finally, you might not want the kind of CSV format that Excel exports.

For illustration I made a tiny Excel file. In order to show how commas are handled, populations contain commas but areas do not.

See post content for text dump of data

When I ask Excel to export the file I get

    State,Population,Area
    CA,"39,500,000",163695
    TX,"28,300,000",268596
    FL,"31,000,000",65758

Note that areas are exported as plain integers, but populations are exported as quoted strings containing commas.

Using csvkit

Now install csvkit and run in2csv.

    $ in2csv states.xlsx
    State,Population,Area
    CA,39500000,163695
    TX,28300000,268596
    FL,31000000,65758

The output goes to standard out, though of course you could redirect the output to a file. All numbers are exported as numbers, with no thousands separators. This makes the output easier to use from a program that does crude parsing [1]. For example, suppose we save states.xlsx to states.csv using Excel then ask cut for the second column. Then we don’t get what we want.

    $ cut -d, -f2 states.csv
    Population
    "39
    "28
    "31

But if we use in2csv to create states.csv then we get what we’d expect.

    cut -d, -f2 states.csv
    Population
    39500000
    28300000
    31000000

Multiple sheets

So far we’ve assumed our Excel file has a single sheet. I added a second sheet with data on US territories. The sheet doesn’t have a header row just to show that the header row isn’t required.

$PR\t3,300,000\t5325\nGuam\t161,700\t571$

I named the two sheets “States” and “Territories” respectively.

States, Territories

Now if we ask in2csv to export our Excel file as before, it only exports the first sheet. But if we specify the Territories sheet, it will export that.

    $ in2csv --sheet Territories states.xlsx
    PR,3300000,5325
    Guam,161700,571

Command line

Confidential OCR

Related posts

Resolving a mysterious problem with find

Related posts

Data file character frequencies

CSV files

Unicode characters

JSON

Troubleshooting and guessing

Related posts

Arbitrary precision math in gawk

Exploring bad passwords

Pure numeric passwords

Digit distribution

Longest bad passwords

Dictionary words

Smallest ‘good’ numeric password

Related posts

Permutations at the command line

Problem 1

Permutation σ

Permutation στ

Problem 2

Related posts

Recursive grep

Related posts

A shell one-liner to search directories

More command line posts

Doing a database join with CSV files

More data wrangling posts

Exporting Excel files to CSV with in2csv

Using csvkit

Multiple sheets

More CSV posts