r - Easy regex in gsub() function is not working -


i have started programming in r. currently, practicing feature engineering on famous titanic dataset.

inter alia, want extract title of persons in dataset.

i have these:

montvila, rev. juozas  johnston, miss. catherine helen 

and want these:

rev. miss. 

my own approach not working. cant figure out problem is:

gsub("([a-za-z:space:]+, )|(\.[a-za-z:space:]+)", "", data_raw$name) 

hope can me! great.

kind regards, marcus

i suggest regex replace text last chunk of letters followed dot.

> x <- c("montvila, rev. juozas", "johnston, miss. catherine helen") > sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x) [1] "rev."  "miss." 

or simpler regmatches solution:

> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x))) [1] "rev."  "miss." 

or, if need check dot, "exclude" match, use pcre regex regmatches (perl=true) allows using lookarounds in pattern:

> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=true))) [1] "rev"  "miss" 

here, (?=\\.) positive lookahead requires . after 1+ letters, excludes match.

details:

  • ^ - start of string
  • .* - 0+ chars many possible last...
  • \\b - word boundary
  • ([[:alpha:]]+\\.) - group 1: 1 or more letters followed literal .
  • .* - 0+ chars end of string.

the tre regex used, . matches char including line break chars.

also, in code, . escaped single \, results in error since \. wrong escape sequence. regex escapes must defined double backslashes.


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -