r - Easy regex in gsub() function is not working -
i have started programming in r. currently, practicing feature engineering on famous titanic dataset.
inter alia, want extract title of persons in dataset.
i have these:
montvila, rev. juozas johnston, miss. catherine helen
and want these:
rev. miss.
my own approach not working. cant figure out problem is:
gsub("([a-za-z:space:]+, )|(\.[a-za-z:space:]+)", "", data_raw$name)
hope can me! great.
kind regards, marcus
i suggest regex replace text last chunk of letters followed dot.
> x <- c("montvila, rev. juozas", "johnston, miss. catherine helen") > sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x) [1] "rev." "miss."
or simpler regmatches
solution:
> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x))) [1] "rev." "miss."
or, if need check dot, "exclude" match, use pcre regex regmatches
(perl=true
) allows using lookarounds in pattern:
> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=true))) [1] "rev" "miss"
here, (?=\\.)
positive lookahead requires .
after 1+ letters, excludes match.
details:
^
- start of string.*
- 0+ chars many possible last...\\b
- word boundary([[:alpha:]]+\\.)
- group 1: 1 or more letters followed literal.
.*
- 0+ chars end of string.
the tre regex used, .
matches char including line break chars.
also, in code, .
escaped single \
, results in error since \.
wrong escape sequence. regex escapes must defined double backslashes.
Comments
Post a Comment