Aggregate rsyntax annotations
A method for aggregating rsyntax annotations. The intended purpose is to compute aggregate values for a given label in an annotation column.
For example, you used annotate_rsyntax to add a column with subject-predicate labels, and now you want to concatenate the tokens with these labels. With annotate_rsyntax you would first aggregate the subject tokens, then aggregate the predicate tokens. By default (txt = T) the column with concatenated tokens are added.
You can specify any aggregation function using any column in tc$tokens. So say you want to perform a sentiment analysis on the quotes of politicians. You first used annotate_rsyntax to create an annotation column 'quote', that has the labels 'source', 'verb', and 'quote'. You also used code_dictionary to add a column with unique politician ID's and a column with sentiment scores. Now you can aggregate the source tokens to get a single unique ID, and aggregate the quote tokens to get a single sentiment score.
aggregate_rsyntax( tc, annotation, ..., by_col = NULL, txt = F, labels = NULL, rm_na = T )
tc |
a tCorpus |
annotation |
The name of the rsyntax annotation column |
... |
To aggregate columns for specific |
by_col |
A character vector with other column names in tc$tokens to aggregate by. |
txt |
If TRUE, add columns with concatenated tokens for each label. Can also be a character vector specifying for which specific labels to create this column |
labels |
Instead of using all labels, a character vector of labels can be given |
rm_na |
If TRUE, remove rows with only NA values |
A data.table
## Not run: tc = tc_sotu_udpipe$copy() tc$udpipe_clauses() subject_verb_predicate = aggregate_rsyntax(tc, 'clause', txt=TRUE) head(subject_verb_predicate) ## We can also add specific aggregation functions ## count number of tokens in predicate aggregate_rsyntax(tc, 'clause', agg_label('predicate', n = length(token_id))) ## same, but with txt for only the subject label aggregate_rsyntax(tc, 'clause', txt='subject', agg_label('predicate', n = length(token_id))) ## example application: sentiment scores for specific subjects # first use queries to code subjects tc$code_features(column = 'who', query = c('I# I~s <this president>', 'we# we americans <american people>')) # then use dictionary to get sentiment scores dict = melt_quanteda_dict(quanteda::data_dictionary_LSD2015) dict$sentiment = ifelse(dict$code %in% c('negative','neg_positive'), -1, 1) tc$code_dictionary(dict) sent = aggregate_rsyntax(tc, 'clause', txt='predicate', agg_label('subject', subject = na.omit(who)[1]), agg_label('predicate', sentiment = mean(sentiment, na.rm=TRUE))) head(sent) sent[,list(sentiment=mean(sentiment, na.rm=TRUE), n=.N), by='subject'] ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.