r - Easy regex in gsub() function is not working -
i have started programming in r. currently, practicing feature engineering on famous titanic dataset.
inter alia, want extract title of persons in dataset.
i have these:
montvila, rev. juozas johnston, miss. catherine helen and want these:
rev. miss. my own approach not working. cant figure out problem is:
gsub("([a-za-z:space:]+, )|(\.[a-za-z:space:]+)", "", data_raw$name) hope can me! great.
kind regards, marcus
i suggest regex replace text last chunk of letters followed dot.
> x <- c("montvila, rev. juozas", "johnston, miss. catherine helen") > sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x) [1] "rev." "miss." or simpler regmatches solution:
> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x))) [1] "rev." "miss." or, if need check dot, "exclude" match, use pcre regex regmatches (perl=true) allows using lookarounds in pattern:
> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=true))) [1] "rev" "miss" here, (?=\\.) positive lookahead requires . after 1+ letters, excludes match.
details:
^- start of string.*- 0+ chars many possible last...\\b- word boundary([[:alpha:]]+\\.)- group 1: 1 or more letters followed literal..*- 0+ chars end of string.
the tre regex used, . matches char including line break chars.
also, in code, . escaped single \, results in error since \. wrong escape sequence. regex escapes must defined double backslashes.
Comments
Post a Comment