A.5 Backreferences

Backreferences are used to overcome the problem that one match has no knowledge of its previous match, appearing as a pair of a subexpression and a \number referencing to that subexpression.

Find all repeated words (often typos):

text <- "This is a block of of text, several words here are are repeated, and and they should not be."
str_view_all(text, "(\\w+) \\1")

Another example with html data where we want to match all normal header tags, note that the last pair <h2>...<h3> is invalid:

text <- "<BODY>
<H1>Welcome to my Homepage</H1>
Content is divided into two sections:<BR>
<H2>ColdFusion</H2>
Information about Macromedia ColdFusion.
<H2>Wireless</H2>
Information about Bluetooth, 802.11, and more.
<H2>This is not valid HTML</H3>
</BODY>"

str_extract_all(text, "<[Hh](\\d)>.+</[Hh]\\1>")
#> [[1]]
#> [1] "<H1>Welcome to my Homepage</H1>" "<H2>ColdFusion</H2>"            
#> [3] "<H2>Wireless</H2>"

Backreferences is particularly useful when performing replace operations.

text <- "user@gmail.com is my email address"
str_replace(text, "(.+@.+\\.com)", "<a href: \\1>\\1<a>")
#> [1] "<a href: user@gmail.com>user@gmail.com<a> is my email address"