Search for Pattern in a Data Frame Character Column
detective.Rd
Find and modify strings containing a specified pattern in a data frame character column.
Usage
detective(.data, pattern, ..., .exclude = NULL, .arrange_by = desc(n))
detective(.data, pattern, ..., .exclude = NULL) <- value
Arguments
- .data
a data frame, or a data frame extension (e.g. a
tibble
).- pattern
Pattern to look for.
The default interpretation is a regular expression, as described in
vignette("regular-expressions")
. Useregex()
for finer control of the matching behaviour.Match a fixed string (i.e. by comparing only bytes), using
fixed()
. This is fast, but approximate. Generally, for matching human text, you'll wantcoll()
which respects character matching rules for the specified locale.Match character, word, line and sentence boundaries with
boundary()
. An empty pattern, "", is equivalent toboundary("character")
.- ...
<
tidy-select
>character
orfactor
columns to search and return.- .exclude
a single
character
string signifying items to be excluded, interpreted as forpattern
; defaultNULL
.- .arrange_by
<
data-masking
> quoted name(s) of column(s) for ordering results. Usedesc()
to sort by variables in descending order; defaultdesc(n)
.- value
a single
character
string providing the replacement value.
Value
detective()
returns a tibble
with columns selected using ... and
n
, giving the count of occurences of each item.
Details
detective()
finds and counts strings matching pattern
but not matching .exclude
in selected
columns in .data
, while detective()<-
is the equivalent replacement function. Both functions forms
allow use of the various possibilities for the pattern
argument of str_detect
.
Use pattern = regex("xyz", ignore_case = TRUE)
for a case insensitive search. Use utils
package glob2rx()
to change a wildcard or globbing pattern into a regular expression.
character
or factor
columns in .data
are selected using ...
with the
<tidy-select
> syntax of package dplyr, including use of
selection helpers.
The output may be ordered by the values of selected columns using the syntax of arrange()
,
including use of across()
or pick()
to select columns with
<tidy-select
> (see examples).
Examples
## Find strings containing a specified pattern in a data frame
starwars |> detective("Sky", name)
#> # A tibble: 3 × 2
#> name n
#> <chr> <int>
#> 1 Anakin Skywalker 1
#> 2 Luke Skywalker 1
#> 3 Shmi Skywalker 1
## Use regex() to make case insensitive
starwars |> detective(regex("WALKER", TRUE), name, .arrange_by = desc(name))
#> # A tibble: 3 × 2
#> name n
#> <chr> <int>
#> 1 Shmi Skywalker 1
#> 2 Luke Skywalker 1
#> 3 Anakin Skywalker 1
## Use | for alternatives
starwars |> detective("Sky|Organa", name)
#> # A tibble: 5 × 2
#> name n
#> <chr> <int>
#> 1 Anakin Skywalker 1
#> 2 Bail Prestor Organa 1
#> 3 Leia Organa 1
#> 4 Luke Skywalker 1
#> 5 Shmi Skywalker 1
## Replace strings containing a specified pattern
starwars |> detective("Darth", name)
#> # A tibble: 2 × 2
#> name n
#> <chr> <int>
#> 1 Darth Maul 1
#> 2 Darth Vader 1
starwars |> detective("Darth", name, .exclude = "Vader") <- "Darth The First"
starwars |> detective("Darth", name, .arrange_by = desc(name))
#> # A tibble: 2 × 2
#> name n
#> <chr> <int>
#> 1 Darth Vader 1
#> 2 Darth The First 1
## Exclude strings containing unwanted patterns
starwars |> detective("Sky", name, .exclude = "Luke")
#> # A tibble: 2 × 2
#> name n
#> <chr> <int>
#> 1 Anakin Skywalker 1
#> 2 Shmi Skywalker 1
## Return multiple columns
starwars |> detective("Human", homeworld, species)
#> # A tibble: 15 × 3
#> homeworld species n
#> <chr> <chr> <int>
#> 1 Tatooine Human 8
#> 2 NA Human 6
#> 3 Naboo Human 5
#> 4 Alderaan Human 3
#> 5 Corellia Human 2
#> 6 Coruscant Human 2
#> 7 Bespin Human 1
#> 8 Chandrila Human 1
#> 9 Concord Dawn Human 1
#> 10 Eriadu Human 1
#> 11 Haruun Kal Human 1
#> 12 Kamino Human 1
#> 13 Serenno Human 1
#> 14 Socorro Human 1
#> 15 Stewjon Human 1
starwars |> detective("Human", homeworld, species, .exclude = "s")
#> # A tibble: 13 × 3
#> homeworld species n
#> <chr> <chr> <int>
#> 1 Tatooine Human 8
#> 2 NA Human 6
#> 3 Naboo Human 5
#> 4 Alderaan Human 3
#> 5 Corellia Human 2
#> 6 Chandrila Human 1
#> 7 Concord Dawn Human 1
#> 8 Eriadu Human 1
#> 9 Haruun Kal Human 1
#> 10 Kamino Human 1
#> 11 Serenno Human 1
#> 12 Socorro Human 1
#> 13 Stewjon Human 1
starwars |> detective("Human", homeworld, species, .exclude = regex("s", TRUE))
#> # A tibble: 10 × 3
#> homeworld species n
#> <chr> <chr> <int>
#> 1 Tatooine Human 8
#> 2 NA Human 6
#> 3 Naboo Human 5
#> 4 Alderaan Human 3
#> 5 Corellia Human 2
#> 6 Chandrila Human 1
#> 7 Concord Dawn Human 1
#> 8 Eriadu Human 1
#> 9 Haruun Kal Human 1
#> 10 Kamino Human 1
## Select columns using <tidy-select> syntax from {dplyr},
## including use of “selection helpers”
starwars |> detective(
"brown", contains("color"), species,
.arrange_by = across(contains("color"))
)
#> # A tibble: 25 × 5
#> hair_color skin_color eye_color species n
#> <chr> <chr> <chr> <chr> <int>
#> 1 black brown brown Zabrak 1
#> 2 black dark brown Human 2
#> 3 black dark brown NA 1
#> 4 black fair brown Human 2
#> 5 black light brown Human 1
#> 6 black tan brown Human 2
#> 7 brown brown blue Wookiee 1
#> 8 brown brown brown Ewok 1
#> 9 brown fair blue Human 3
#> 10 brown fair blue NA 1
#> # ℹ 15 more rows
starwars |> detective(
"brown", name, contains("color"), species,
.exclude = "Human", .arrange_by = across(contains("color"))
)
#> # A tibble: 12 × 6
#> name hair_color skin_color eye_color species n
#> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 Eeth Koth black brown brown Zabrak 1
#> 2 Gregar Typho black dark brown NA 1
#> 3 Tarfful brown brown blue Wookiee 1
#> 4 Wicket Systri Warrick brown brown brown Ewok 1
#> 5 Jek Tono Porkins brown fair blue NA 1
#> 6 Cordé brown light brown NA 1
#> 7 Chewbacca brown unknown blue Wookiee 1
#> 8 Dexter Jettster none brown yellow Besali… 1
#> 9 Ackbar none brown mottle orange Mon Ca… 1
#> 10 Grievous none brown, white green, yellow Kaleesh 1
#> 11 Yoda white green brown Yoda's… 1
#> 12 Jabba Desilijic Tiure NA green-tan, brown orange Hutt 1
starwars |> detective(
"brown", contains("color"), species,
) <- "chestnut"
starwars |> detective("brown", name, contains("color"), species)
#> # A tibble: 0 × 6
#> # ℹ 6 variables: name <chr>, hair_color <chr>, skin_color <chr>,
#> # eye_color <chr>, species <chr>, n <int>
starwars |> detective("chestnut", name, contains("color"), species)
#> # A tibble: 35 × 6
#> name hair_color skin_color eye_color species n
#> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 Ackbar none chestnut orange Mon Calamari 1
#> 2 Arvel Crynyd chestnut fair chestnut Human 1
#> 3 Bail Prestor Organa black tan chestnut Human 1
#> 4 Beru Whitesun Lars chestnut light blue Human 1
#> 5 Biggs Darklighter black light chestnut Human 1
#> 6 Boba Fett black fair chestnut Human 1
#> 7 Chewbacca chestnut unknown blue Wookiee 1
#> 8 Cliegg Lars chestnut fair blue Human 1
#> 9 Cordé chestnut light chestnut NA 1
#> 10 Dexter Jettster none chestnut yellow Besalisk 1
#> # ℹ 25 more rows
## Use {utils} glob2rx() to create regular expression, in this instance
## a wildcard * finding every character except a new line
starwars |> detective(glob2rx("*"), !c(name, contains("color")))
#> # A tibble: 65 × 5
#> sex gender homeworld species n
#> <chr> <chr> <chr> <chr> <int>
#> 1 male masculine Tatooine Human 6
#> 2 male masculine NA Human 4
#> 3 male masculine Naboo Gungan 3
#> 4 male masculine Naboo Human 3
#> 5 female feminine Mirial Mirialan 2
#> 6 female feminine Naboo Human 2
#> 7 female feminine Tatooine Human 2
#> 8 female feminine NA Human 2
#> 9 male masculine Alderaan Human 2
#> 10 male masculine Corellia Human 2
#> # ℹ 55 more rows
## Equivalent using {stringr} regex(".")
identical(
starwars |> detective(glob2rx("*"), !c(name, contains("color"))),
starwars |> detective(regex("."), !c(name, contains("color")))
)
#> [1] TRUE
## Equivalent using caret "^" in pattern string
identical(
starwars |> detective(glob2rx("*"), !c(name, contains("color"))),
starwars |> detective("^", !c(name, contains("color")))
)
#> [1] TRUE