Regular expressions in R

Doing extensive text manipulation in R would be painful; the R language was developed for analyzing data sets, not for munging text files. However, R does have some facilities for working with text using regular expressions. This comes in handy, for example, when selecting rows of a data set according to regular expression pattern matches in some columns.

R supports two regular expression flavors: POSIX 1003.2 and Perl. Regular expression functions in R contain two arguments: extended, which defaults to TRUE, and perl, which defaults to FALSE. By default R uses POSIX extended regular expressions, though if extended is set to FALSE, it will use basic POSIX regular expressions. If perl is set to TRUE, R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.

Regular expressions are represented as strings. Metacharacters often need to be escaped. For example, the metacharacter \w must be entered as \\w to prevent R from interpreting the leading backslash before sending the string to the regular expression parser.

The grep function requires two arguments. The first is a string containing a regular expression. The second is a vector of strings to search for matches. The grep function returns a list of indices. If the regular expression matches a particular vector component, that component's index is part of the list.

Example:

grep("apple", c("crab apple", "Apple jack", "apple sauce"))

returns the vector (1, 3) because the first and third elements of the array contain "apple." Note that grep is case-sensitive by default and so "apple" does not match "Apple." To perform a case-insensitive match, add ignore.case = TRUE to the function call.

There is an optional argument value that defaults to FALSE. If this argument is set to TRUE, grep will return the actual matches rather than their indices.

The function sub replaces one pattern with another. It requires three arguemtns: a regular expression, a replacement pattern, and a vector of strings to process. It is analogous to s/// in Perl. Note that if you use the Perl regular expression flavor by adding perl = TRUE and want to use capture references such as \1 or \2 in the replacement pattern, these must be entered as \\1 or \\2.

The sub function replaces only the first instance of a regular expression. To replace all instances of a pattern, use gsub. The gsub function is analogous to s///g in Perl.

The function regexpr requires two arguments, a regular expression and a vector of text to process. It is similar to grep, but returns the locations of the regular expression matches. If a particular component does not match the regular expression, the return vector contains a -1 for that component. The function gregexpr is a variation on regexpr that returns the number of matches in each component.

The function strsplit also uses regular expressions, splitting its input according to a specified regular expression.

Resources

Notes on using regular expressions in other languages:

Daily tips on regular expressions

See also R for programmers coming from other languages.