pos tagger - Selecting text from corresponding tags in a sequence in R -
i trying extract text corresponding tag in sentence in sequence. trying part of speech corresponds each sentence in text file. code:
postext<- "the verifone not working, when customers slide card nothing happens. screen frozen. rebooted did not help." postext1<- c("the verifone not working","scanner not scanning","printer offline","when customers slide card nothing happens. screen frozen. rebooted did not help.") tagpos <- function(x, ...) { s <- as.string(x) word_token_annotator <- maxent_word_token_annotator() a2 <- annotation(1l, "sentence", 1l, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, maxent_pos_tag_annotator(), a2) a3w <- a3[a3$type == "word"] postags <- unlist(lapply(a3w$features, `[[`, "pos")) postagged <- paste(sprintf("%s/%s", s[a3w], postags), collapse = " ") list(postagged = postagged, postags = postags) } dd1 <- do.call(rbind, strsplit(as.character(postext), ' ')) dd_v1 <- tagpos(dd1)$postagged dd_v1
output
[1] "the/dt verifone/nnp is/vbz not/rb working/vbg ,/, when/wrb customers/nns slide/nn card/nn nothing/nn happens/vbz ./. the/dt screen/nn is/vbz frozen/vbn ./. we/prp rebooted/vbd but/cc it/prp did/vbd not/rb help/vb ./."
i want extract text of tag in sequence. example: want extract texts tag 'nnp','vbz','rb','vbg' in sequence entire text file wherever have following sequence in sentences.
my desired outputs is:
[1] verifone not working
thank help.
this rather naive approach , in case have plenty of strings, slow, give try
# constrol sequence ids (probably regex nicer do...) tags <- sapply(strsplit(strsplit(dd_v1,"/")[[1]][-1]," "),"[",1) # define constants matchseq <- c('nnp','vbz','rb', 'vbg') totaltags <- length(tags) searchlength <- length(matchseq) # loop through subvectors , store starting points of possible matches startpoints <- c() for(i in 1:(totaltags-searchlength)){ if(identical(tags[i:(i+searchlength-1)], matchseq)) startpoints <- c(startpoints,i) } # print results, if there if(!is.null(startpoints)) paste(dd1[startpoints:(startpoints+searchlength-1)], collapse=" ")
if find more location, can e.g. loop on startpoints
, print every single sequence separately.
Comments
Post a Comment