Minimizing context switching between shell and Python

Posted on 30 December 2019 by John

Sometimes you’re in the flow using the command line and you’d like to briefly switch over to Python without too much interruption. Or it could be the other way around: you’re in the Python REPL and need to issue a quick shell command.

One solution would be to run your shell and Python session in different terminals, but let’s assume that for whatever reason you’re working in just one terminal. For example, maybe you want the output of a shell command to be visually close when you run Python, or vice versa.

Calling Python from shell

You can run a Python one-liner from the shell by calling Python with the -c option. For example,

    $ python -c "print(3*7)"
    21

I hardly ever do this because I want to run more than a one-liner. What I find more useful is to launch Python with the -q option to suppress all the start-up verbiage and simply bring up a prompt.

    $ python -q
    >>>>

Calling shell from Python

If you run Python with the ipython command rather than the default python you get much better shell integration. IPython let’s you type a shell command at any point simply by preceding it with a !. For example, the following command tells us this is the 364th day of the year.

    In [1]: ! date +%j
    364

You can run some of the most common shell commands, such as cd and ls without even a bang prefix. These are “magic” commands that do what you’d expect if you forgot for a moment that you’re in a Python REPL rather than a command shell.

    In [2]: cd ..
    Out[2]: '/mnt/c/Users'

IPython also supports other forms of shell integration such as capturing the output of a shell command as a Python variable, or using a Python variable as an argument to a shell command.

Why can’t grep find negative numbers?

Posted on 7 December 2019 by John

Suppose you’re looking for instances of -42 in a file foo.txt. The command

    grep -42 foo.txt

won’t work. Instead you’ll get a warning message like the following.

    Usage: grep [OPTION]... PATTERN [FILE]...
    Try 'grep --help' for more information.

Putting single or double quotes around -42 won’t help. The problem is that grep interprets 42 as a command line option, and <codegrep doesn’t have such an option. This is a problem if you’re searching for negative numbers, or any pattern that beings with a dash, such as -able or --version.

The solution is to put -e in front of a regular expression containing a dash. That tells grep that the next token at the command line is a regular expression, not a command line option. So

    grep -e -42 foo.txt

will work.

You can also use -e several times to give grep several regular expressions to search for. For example,

    grep -e cat -e dog foo.txt

will search for “cat” or “dog.”

See the previous post for another example of where grep doesn’t seem to work. By default grep supports a restricted regular expression syntax and may need to be told to use “extended” regular expressions.

Why doesn’t grep work?

Posted on 5 December 2019 by John

If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.

As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.

Let’s start with the following regular expression.

    [0-9]{3}\.?[0-9]{0,2}

This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since . is a special character in regular expressions, we have to use a backslash to literally match a period.)

The regular expression above will work with Perl or Python, but not with grep or sed by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep and sed support basic regular expressions (BRE) by default.

Basic regular expressions would use \{3\} rather than {3} to match a pattern three times. So, for example,

   echo 123 | grep "[0-9]\{3\}"

would return 123, but

   echo 123 | grep "[0-9]{3}"

would return nothing.

Similarly,

    echo 123 | sed -n "/[0-9]\{3\}/p"

would return 123 but

    echo 123 | sed -n "/[0-9]{3}/p"

returns nothing.

(The -n option to sed tells it not to print every line by default. The p following the regular expression tells sed to print those lines that match the pattern. Here there’s only one line, the output of echo, but typically grep and sed would be use on files with multiple lines.)

Turning on ERE support

You can tell grep and sed that you want to use extended regular expressions by giving either one the -E option. So, for example, both

   echo 123 | grep -E "[0-9]{3}"

and

    echo 123 | sed -E -n "/[0-9]{3}/p"

will print 123.

You can use egrep as a synonym for grep -E, at least with Gnu implementations.

Incidentally, awk uses extended regular expressions, and so

    echo 123 | awk "/[0-9]{3}/"

will also print 123.

Going back to our full regular expression, using \.? for an optional period works with grep and sed if we ask for ERE support. The following commands all print 123.4.

    echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}"
    echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p"
    echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"

Without the -E option, grep and sed will not return a match.

This doesn’t fix everything

At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually

    \d{3}\.?\d{0,2}

rather than

    [0-9]{3}\.?[0-9]{0,2}

I used the shortcut \d to denote a digit. Python, Perl, and Awk will understand this, but grep will not, even with the -E option.

grep will understand \d if instead you use the -P option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep support PCRE. And sed does not have an option to support PCRE.

Set theory at the command line

Posted on 24 November 2019 by John

Often you have two lists, and you want to know what items belong to both lists, or which things belong to one list but not the other. The command line utility comm was written for this task.

Given two files, A and B, comm returns its output in three columns:

A – B
B – A
A ∩ B.

Here the minus sign means set difference, i.e. A – B is the set of things in A but not in B.

Venn diagram of comm parameters
The numbering above will be correspond to command line options for comm. More on that shortly.

Difference and intersection

Here’s an example. Suppose we have a file states.txt containing a list of US states

    Alabama
    California
    Georgia
    Idaho
    Virginia

and a file names.txt containing female names.

    Charlotte
    Della
    Frances
    Georgia
    Margaret
    Marilyn
    Virginia

Then the command

    comm states.txt names.txt

returns the following.

    Alabama
    California
            Charlotte
            Della
            Frances
                    Georgia
    Idaho
            Margaret
            Marilyn
                    Virginia

The first column is states which are not female names. The second column is female names which are not states. The third column is states which are also female names.

Filtering output

The output of comm is tab-separated. So you could pull out one of the columns by piping the output through cut, but you probably don’t want to do that for a couple reasons. First, you might get unwanted spaces. For example,

    comm states.txt names.txt | cut -f 1

returns

Alabama
California




Idaho

Worse, if you ask cut to return the second column, you won’t get what you expect.

Alabama
California
Charlotte
Della
Frances

Idaho
Margaret
Marilyn

This is because although comm uses tabs, it doesn’t produce a typical tab-separated file.

The way to filter comm output is to tell comm which columns to suppress.

If you only want the first column, states that are not names, use the option -23, i.e. to select column 1, tell comm not to print columns 2 or 3. So

    comm -23 states.txt names.txt

returns

    Alabama
    California
    Idaho

with no blank lines. Similarly, if you just want column 2, use comm -13. And if you want the intersection, column 3, use comm -12.

Sorting

The comm utility assumes its input files are sorted. If they’re not, it will warn you.

If your files are not already sorted, you could sort them and send them to comm in a one-liner [1].

    comm <(sort states.txt) <(sort names.txt)

Multisets

If your input files are truly sets, i.e. no item appears twice, comm will act as you expect. If you actually have a multiset, i.e. items may appear more than once, the output may surprise you at first, though on further reflection you may agree it does what you’d hope.

Suppose you have a file places.txt of places

    Cincinnati
    Cleveland
    Washington
    Washington

and a file of US Presidents.

    Adams
    Adams
    Cleveland
    Cleveland
    Jefferson
    Washington

Then

    comm places.txt presidents.txt

produces the following output.

            Adams
            Adams
    Cincinnati
                    Cleveland
            Cleveland
            Jefferson
                    Washington
    Washington

Cleveland was on the list of presidents twice. (Grover Cleveland was the 22nd and the 24th president, the only US president to serve non-consecutive terms.) The output of comm lists Cleveland once in the intersection of places and presidents (say 22nd president and a city in Ohio), but also lists him once as a president not corresponding to a place. That is because there are two Clevelands on our list of presidents, there is only one in our list of places.

Suppose we had included five places named Cleveland (There are cities named Cleveland in several states). Then comm would list Cleveland 3 times as a place not a president, and 2 times as a place and a president.

In general, comm utility follows the mathematical conventions for multisets. Suppose an item x appears m times in multiset A, and n times in a multiset B. Then x appears max(m–n, 0) times in A–B, max(n–m, 0) times in B–A, and min(m, n) times in A ∩ B.

Union

To form the union of two files as multisets of lines, just combine them into one file, with duplicates. You can join file1 and file2 with cat (short for “concatenate”).

    cat file1 file2 > multiset_union_file

To find the union of two files as sets, first find the union as multisets, then remove duplicates.

    cat file1 file2 | sort | uniq > set_union_file

Note that uniq, like comm, assumes files are sorted.

[1] The commutility is native to Unix-like systems, but has been ported to Windows. The examples in this post will work on Windows with ports of the necessary utilities, except for where we sort a file before sending it on to comm with <(sort file). That’s not a feature of sort or comm but of the shell, bash in my case. The Windows command line doesn’t support this syntax. (But bash ported to Windows would.)

Formatting numbers at the command line

Posted on 30 October 2019 by John

The utility numfmt, part of Gnu Coreutils, formats numbers. The main uses are grouping digits and converting to and from unit suffixes like k for kilo and M for mega. This is somewhat useful for individual invocations, but like most command line utilities, the real value is using it as part of pipeline.

The --grouping option will separate digits according to the rules of your locale. So on my computer

    numfmt --grouping 123456789

returns 123,456,789. On a French computer, it would return 123.456.789 because the French use commas as decimal separators and use periods to group digits [1].

You can also use numfmt to convert between ordinary numbers and numbers with units. Unfortunately, there’s some ambiguity regarding what units like kilo and mega mean. A kilogram is 1,000 grams, but a kilobyte is 2¹⁰ = 1,024 bytes. (Units like kibi and mebi were introduced to remove this ambiguity, but the previous usage is firmly established.)

If you want to convert 2M to an ordinary number, you have to specify whether you mean 2 × 10⁶ or 2 × 2²⁰. For the former, use --from=si (for Système international d’unités) and for the latter use --from=iec (for International Electrotechnical Commission).

    $ numfmt --from=si 2M
    2000000
    $ numfmt --from=iec 2M
    2097152

One possible gotcha is that the symbol for kilo is capital K rather than lower case k; all units from kilo to Yotta use a capital letter. Another is that there must be no space between the numerals and the suffix, e.g. 2G is legal but 2 G is not.

You can use Ki for kibi, Mi for mebi etc. if you use --from=iec-i.

    $ numfmt --from=iec-i 2Gi  
    2147483648

To convert from ordinary numbers to numbers with units use the --to option.

    $ numfmt --to=iec 1000000 
    977K

Computing pi with bc

Posted on 29 October 2019 by John

I wanted to stress test the bc calculator a little and so I calculated π to 10,000 digits a couple different ways.

First I ran

    time bc -l <<< "scale=10000;4*a(1)"

which calculates π as 4 arctan(1). This took 2 minutes and 38 seconds.

I imagine bc is using some sort of power series to compute arctan, and so smaller arguments should converge faster. So next I used a formula due to John Machin (1680–1752).

    time bc -l <<< "scale=10000;16*a(1/5) - 4*a(1/239)"

This took 52 seconds.

Both results were correct to 9,998 decimal places.

When you set the scale variable to n, bc doesn’t just carry calculations out to n decimal places; it uses more and tries to deliver n correct decimal places in the final result.

Why bc

This quirky little calculator is growing on me. For one thing, I like its limitations. If I need to do something that isn’t easy to do with bc, that probably means that I should write a script rather than trying to work directly at the command line.

Another thing I like about it is that it launches instantly. It doesn’t give you a command prompt, and so if you launch it in quiet mode you could think that it’s still loading when in fact it’s waiting on you. And if you send bc code with a here-string as in the examples above, you don’t even have to launch it per se.

If you want to try bc, I’d recommend launching it with the options -lq. You might even want to alias bc to bc -lq. The -l option loads math libraries. You’d think that would be the default for a calculator, but bc was written in a more resource-constrained time when you didn’t load much by default. The -l option also sets scale to 20, i.e. you get twenty decimal places of precision; the default is zero!

The -q option isn’t necessary, but it starts bc in quiet mode, suppressing three lines of copyright and warranty announcements.

As part of its minimalist design, bc only includes a few math functions, and you have to bootstrap the rest. For example, it includes sine and cosine but not tangent. More on how to use the built-in functions to compute more functions here.

Splitting lines and numbering the pieces

Posted on 23 October 2019 by John

As I mentioned in my computational survivalist post, I’m working on a project where I have a dedicated computer with little more than basic Unix tools, ported to Windows. It’s given me new appreciation for how the standard Unix tools fit together; I’ve had to rely on them for tasks I’d usually do a different way.

I’d seen the nl command before for numbering lines, but I thought “Why would you ever want to do that? If you want to see line numbers, use your editor.” That way of thinking looks at the tools one at a time, asking what each can do, rather than thinking about how they might work together.

Today, for the first time ever, I wanted to number lines from the command line. I had a delimited text file and wanted to see a numbered list of the column headings. I’ve written before about how you can extract columns using cut, but you have to know the number of a column to select it. So it would be nice to see a numbered list of column headings.

The data I’m working on is proprietary, so I downloaded a PUMS (Public Use Microdata Sample) file named ss04hak.csv from the US Census to illustrate instead. The first line of this file is

RT,SERIALNO,DIVISION,MSACMSA,PMSA,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,FSP,GRNTP,GRPIP,HHL,HHT,HINCP,HUPAC,LNGI,MV,NOC,NPF,NRC,OCPIP,PSF,R18,R65,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,FTAXP,FTELP,FTENP,FVACSP,FVALP,FVEHP,FWATP,FYBLP

I want to grab the first line of this file, replace commas with newlines, and number the results. That’s what the following one-liner does.

    head -n 1 ss04hak.csv | sed "s/,/\n/g" | nl

The output looks like this:

     1  RT 
     2  SERIALNO 
     3  DIVISION  
     4  MSACMSA
     5  PMSA
...
   100  FWATP
   101  FYBLP

Now if I wanted to look at a particular field, I could see the column number without putting my finger on my screen and counting. Then I could use that column number as an argument to cut -f.

File character counts

Posted on 16 October 2019 by John

Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.

A few days ago Fatih Karakurt left an elegant solution to this problem in a comment:

    fold -w1 file | sort | uniq -c

The fold function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the -w option. Setting that to 1 makes each character its own line. Then sort prepares the input for uniq, and the -c option causes uniq to display counts.

This works on ASCII files but not Unicode files. For a Unicode file, you might do something like the following Python code.

import collections

count = collections.Counter()
file = open("myfile", "r", encoding="utf8")
for line in file.readlines():
    for c in line.strip("\n"):
        count[ord(c)] += 1

for p in sorted(list(count)):
    print(chr(p), hex(p), count[p])

Computational survivalist

Posted on 9 October 2019 by John

survival gear

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m not often in that position, but that’s kinda where I am on one project.

Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use grep -c, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use grep -o and pipe it to wc -l.

The -o option tells grep to output the regex matches, one per line. So counting the number of lines with wc -l gives the number of matches.

Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

Command line