r - Easy regex in gsub() function is not working -


i have started programming in r. currently, practicing feature engineering on famous titanic dataset.

inter alia, want extract title of persons in dataset.

i have these:

montvila, rev. juozas  johnston, miss. catherine helen 

and want these:

rev. miss. 

my own approach not working. cant figure out problem is:

gsub("([a-za-z:space:]+, )|(\.[a-za-z:space:]+)", "", data_raw$name) 

hope can me! great.

kind regards, marcus

i suggest regex replace text last chunk of letters followed dot.

> x <- c("montvila, rev. juozas", "johnston, miss. catherine helen") > sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x) [1] "rev."  "miss." 

or simpler regmatches solution:

> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x))) [1] "rev."  "miss." 

or, if need check dot, "exclude" match, use pcre regex regmatches (perl=true) allows using lookarounds in pattern:

> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=true))) [1] "rev"  "miss" 

here, (?=\\.) positive lookahead requires . after 1+ letters, excludes match.

details:

  • ^ - start of string
  • .* - 0+ chars many possible last...
  • \\b - word boundary
  • ([[:alpha:]]+\\.) - group 1: 1 or more letters followed literal .
  • .* - 0+ chars end of string.

the tre regex used, . matches char including line break chars.

also, in code, . escaped single \, results in error since \. wrong escape sequence. regex escapes must defined double backslashes.


Comments

Popular posts from this blog

Formatting string according to pattern without regex in php -

c - zlib and gdi32 with OpenSSL? -

java - inputmismatch exception -