r - Easy regex in gsub() function is not working -

i have started programming in r. currently, practicing feature engineering on famous titanic dataset.

inter alia, want extract title of persons in dataset.

i have these:

montvila, rev. juozas  johnston, miss. catherine helen

and want these:

rev. miss.

my own approach not working. cant figure out problem is:

gsub("([a-za-z:space:]+, )|(\.[a-za-z:space:]+)", "", data_raw$name)

hope can me! great.

kind regards, marcus

i suggest regex replace text last chunk of letters followed dot.

> x <- c("montvila, rev. juozas", "johnston, miss. catherine helen") > sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x) [1] "rev."  "miss."

or simpler regmatches solution:

> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x))) [1] "rev."  "miss."

or, if need check dot, "exclude" match, use pcre regex regmatches (perl=true) allows using lookarounds in pattern:

> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=true))) [1] "rev"  "miss"

here, (?=\\.) positive lookahead requires . after 1+ letters, excludes match.

details:

the tre regex used, . matches char including line break chars.

also, in code, . escaped single \, results in error since \. wrong escape sequence. regex escapes must defined double backslashes.

QR