One of the fiddly parts of regular expressions is how to handle line breaks. Should regular expression searches be applied one line at a time, or should an entire file be treated as a single line?
This morning I was trying to track down a LaTeX file that said “discussed in the Section” rather than simply “discussed in Section.” I wanted to search on “the Section” to see whether I had a similar error in other files.
Line breaks don’t matter to LaTeX [1], so “the” could be at the end of one line and “Section” at the beginning of another. I found what I was after by using
grep -Pzo "the\s+Section" foo.tex
Here -P tells grep to use Perl regular expressions. That’s not necessary here, but I imprinted on Perl regular expressions long ago, and I use PCRE (Perl compatible regular expressions) whenever possible so I don’t have to remember the annoying little syntax differences between various regex implementations.
The -z option says to treat the entire file as one long string. This eliminates the line break issue.
The -o option says to output only what the regular expression matches. Otherwise grep will return the matching line. Ordinarily that wouldn’t be so bad, but because of the -z option, the matching line is the entire file.
The \s+ charcters between the and Section represent one or more whitespace characters, such as a space or a newline.
The -P flag is a Gnu feature, so it works on Linux. But macOS ships with BSD-derived versions of its utilities, and its version grep does not support the -P option. On my Macbook I have ggrep mapped to the Gnu version of grep.
Another option is to use ripgrep rather than grep. It uses Perl-like regular expressions, and so there is no need for anything like the -P flag. The analog of -z in ripgrep is -U, so the counterpart of the command above would be
ripgrep -Uo "the\s+Section" foo.tex
Usually regular expression searches are so fast that execution time doesn’t matter. But when it does matter, ripgrep can be an order of magnitude faster than grep.
Related posts
- RegexTip on X
- Regular expressions and successive approximation
- Searching Greek and Hebrew with regular expressions
[1] LaTeX decides how to break lines in the output independent of line breaks in the input. This allows you to arrange the source file logically rather than aesthetically.






