Convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressions
Noun phrases are of common interest when doing natural language processing. Extracting noun phrases
from text can be done easily by defining a sequence of Parts of Speech tags. For example this sequence of POS tags
can be seen as a noun phrase: Adjective, Noun, Preposition, Noun.
This function recodes Universal POS tags to one of the following 1-letter tags, in order to simplify writing regular expressions
to find Parts of Speech sequences:
A: adjective
C: coordinating conjuction
D: determiner
M: modifier of verb
N: noun or proper noun
P: preposition
O: other elements
After which identifying a simple noun phrase can be just expressed by using the following regular expression (A|N)*N(P+D*(A|N)*N)* which basically says start with adjective or noun, another noun, a preposition, determiner adjective or noun and next a noun again.
as_phrasemachine(x, type = c("upos", "penn-treebank"))
x |
a character vector of POS tags for example by using |
type |
either 'upos' or 'penn-treebank' indicating to recode Universal Parts of Speech tags to the counterparts as described in the description, or to recode Parts of Speech tags as known in the Penn Treebank to the counterparts as described in the description |
For more information on extracting phrases see http://brenocon.com/handler2016phrases.pdf
the character vector x
where the respective POS tags are replaced with one-letter tags
x <- c("PROPN", "SCONJ", "ADJ", "NOUN", "VERB", "INTJ", "DET", "VERB", "PROPN", "AUX", "NUM", "NUM", "X", "SCONJ", "PRON", "PUNCT", "ADP", "X", "PUNCT", "AUX", "PROPN", "ADP", "X", "PROPN", "ADP", "DET", "CCONJ", "INTJ", "NOUN", "PROPN") as_phrasemachine(x)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.