Regular expressions in R
Doing extensive text manipulation in R would be painful; the R language was developed for analyzing data sets, not for munging text files. However, R does have some facilities for working with text using regular expressions. This comes in handy, for example, when selecting rows of a data set according to regular expression pattern matches in some columns.
R supports two regular expression flavors: POSIX 1003.2 and Perl.
Regular expression functions in R contain two arguments: extended
, which
defaults to TRUE
, and perl
, which defaults to
FALSE
. By default R uses
POSIX extended regular expressions, though if extended is set to
FALSE
,
it will use basic POSIX regular expressions. If perl
is set to
TRUE
, R
will use the Perl 5 flavor of regular expressions as implemented in the
PCRE library.
Regular expressions are represented as strings. Metacharacters often
need to be escaped. For example, the metacharacter \w
must be entered as
\\w
to prevent R from interpreting the leading backslash before sending
the string to the regular expression parser.
The grep
function requires two arguments. The first is a string
containing a regular expression. The second is a vector of strings to
search for matches. The grep
function returns a list of indices. If the
regular expression matches a particular vector component, that
component's index is part of the list.
Example:
grep("apple", c("crab apple", "Apple jack", "apple sauce"))
returns the vector (1, 3) because the first and third elements of the
array contain "apple." Note that grep is case-sensitive by default and
so "apple" does not match "Apple." To perform a case-insensitive match,
add ignore.case = TRUE
to the function call.
There is an optional argument value
that defaults to
FALSE
. If this
argument is set to TRUE
,
grep
will return the actual matches rather than
their indices.
The function sub
replaces one pattern with another. It requires three
arguemtns: a regular expression, a replacement pattern, and a vector of
strings to process. It is analogous to s///
in Perl. Note that if you
use the Perl regular expression flavor by adding perl = TRUE
and want to
use capture references such as \1
or
\2
in the replacement pattern,
these must be entered as \\1
or
\\2
.
The sub
function replaces only the first instance of a regular
expression. To replace all instances of a pattern, use
gsub
.
The gsub
function is analogous to
s///g
in Perl.
The function regexpr
requires two arguments, a regular expression and a
vector of text to process. It is similar to grep
, but returns the
locations of the regular expression matches. If a particular component
does not match the regular expression, the return vector contains a -1
for that component. The function gregexpr
is a variation on
regexpr
that
returns the number of matches in each component.
The function strsplit
also uses regular expressions, splitting its input
according to a specified regular expression.
Resources
Notes on using regular expressions in other languages:
Daily tips on regular expressions
See also R for programmers coming from other languages.