zipfR: vec2xxx – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

vec2xxx

Type-Token Statistics for Samples and Empirical Data (zipfR)

Description

Compute type-frequency list, frequency spectrum and vocabulary growth curve from a token vector representing a random sample or an observed sequence of tokens.

Usage

vec2tfl(x)

  vec2spc(x)

  vec2vgc(x, steps=200, stepsize=NA, m.max=0)

Arguments

`x`	a vector of length N_0, representing a random sample or other observed data set of N_0 tokens. For each token, the corresponding element of `x` specifies the type that the token belongs to. Usually, `x` is a character vector, but it might also specify integer IDs in some cases.
`steps`	number of steps for which vocabulary growth data V(N) is calculated. The values of N will be evenly spaced (up to rounding differences) from N=1 to N=N_0.
`stepsize`	alternative way of specifying the steps of the vocabulary growth curve. In this case, vocabulary growth data will be calculated every `stepsize` tokens. The first step is chosen such that the last step corresponds to the full sample (N=N_0). Only one of the parameters `steps` and `stepsize` may be specified.
`m.max`	an integer in the range $1 ... 9$, specifying how many spectrum elements V_m(N) to include in the vocabulary growth curve. By default only vocabulary size V(N) is calculated, i.e. `m.max=0`.

Details

There are two main applications for the vec2xxx functions:

a): They can be used to calculate type-token statistics and vocabulary growth curves for random samples generated from a LNRE model (with the rlnre function).
b): They provide an easy way to process a user's own data without having to rely on external scripts to compute frequency spectra and vocabulary growth curves. All that is needed is a text file in one-token-per-line formt (i.e. where each token is given on a separate line). See "Examples" below for further hints.

Both applications work well for samples of up to approx. 1 million tokens. For considerably larger data sets, specialized external software should be used, such as the Perl scripts provided on the zipfR homepage.

Value

An object of class tfl, spc or vgc, representing the type frequency list, frequency spectrum or vocabulary growth curve of the token vector x, respectively.

Examples

## type-token statistics for random samples from a LNRE distribution

model <- lnre("fzm", alpha=.5, A=1e-6, B=.05)
x <- rlnre(model, 100000)

vec2tfl(x)
vec2spc(x)  # same as tfl2spc(vec2tfl(x))
vec2vgc(x)

sample.spc <- vec2spc(x)
exp.spc <- lnre.spc(model, 100000)
plot(exp.spc, sample.spc)

sample.vgc <- vec2vgc(x, m.max=1, steps=500)
exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1)
plot(exp.vgc, sample.vgc, add.m=1)

## Not run: 
## load token vector from a file in one-token-per-line format
x <- readLines(filename)
x <- readLines(file.choose()) # with file selection dialog 

## you can also perform whitespace tokenization and filter the data
brown <- scan("brown.pos", what=character(0), quote="")
nouns <- grep("/NNS?$", brown, value=TRUE)
plot(vec2spc(nouns))
plot(vec2vgc(nouns, m.max=1), add.m=1)

## End(Not run)

zipfR

Statistical Models for Word Frequency Distributions

v0.6-70

GPL-3

Authors

Stefan Evert <stefan.evert@fau.de>, Marco Baroni <marco.baroni@unitn.it>

Initial release

2020-10-10