A.4 Looking ahead and back

Lookahead specifies a pattern to be matched but not returned. A lookahead is actually a subexpression and is formatted as such. The syntax for a lookahead pattern is a subexpression preceded by ?=, and the text to match follows the = sign. Some refer to this behaviour as “match but not consume”, in the sense that lookhead and lookahead match a pattern after/before what we actually want to extract, but do not return it.

In the following example, we only want to matcch “my homepage” that followed by a </title>, and we do not want </title> in the results

text <- c("<title>my homepage</title>", "<p>my homepage</p>")
str_extract(text, "my homepage(?=</title>)")
#> [1] "my homepage" NA
# looking ahead (and back) must be used in subexpressions 
str_extract(text, "my homepage?=</title>")
#> [1] NA NA

Similarly, ?<= is interpreted as the lookback operator, which specifies a pattern before the text we actually want to extract. Following is an example. A database search lists products, and you need only the prices.

Following is an example. A database search lists products, and you need only the prices.

text <- c("ABC01: $23.45", 
          "HGG42: $5.31", 
          "CFMX1: $899.00", 
          "XTC99: $69.96", 
          "Total items found: 4")

str_extract(text, "(?<=\\$)[0-9]+")
#> [1] "23"  "5"   "899" "69"  NA

ookahead and lookbehind operations may be combined, as in the following example

str_extract("<title>my homepage</title>", "(?<=<title>)my homepage(?=</title>)")
#> [1] "my homepage"

Additionally, (?=) and (?<=) are known as positive lookahead and lookback. A lesser used version is the negative form of those two operators, looking for text that does not match the specified pattern.

class description
(?=) positive lookahead
(?!) negative lookahead
(?<=) positive lookbehind
(?<!) negative lookbehind

Suppose we want to extract just the quantities but not the prices in the followin text:

text <- c("I paid $30 for 100 apples, 50 oranges, and 60 pears. I saved $5 on this order.")
# without word boundary, 0 after 3 as in $30 will be included
str_view_all(text, "\\b(?<!\\$)\\d+\\b")