Type-Token Statistics for Samples and Empirical Data (zipfR)
Compute type-frequency list, frequency spectrum and vocabulary growth curve from a token vector representing a random sample or an observed sequence of tokens.
vec2tfl(x) vec2spc(x) vec2vgc(x, steps=200, stepsize=NA, m.max=0)
x |
a vector of length N_0, representing a random sample or
other observed data set of N_0 tokens. For each token, the
corresponding element of |
steps |
number of steps for which vocabulary growth data V(N) is calculated. The values of N will be evenly spaced (up to rounding differences) from N=1 to N=N_0. |
stepsize |
alternative way of specifying the steps of the
vocabulary growth curve. In this case, vocabulary growth data will
be calculated every |
m.max |
an integer in the range $1 ... 9$, specifying how many
spectrum elements V_m(N) to include in the vocabulary growth
curve. By default only vocabulary size V(N) is calculated,
i.e. |
There are two main applications for the vec2xxx
functions:
They can be used to calculate type-token statistics and
vocabulary growth curves for random samples generated from a LNRE
model (with the rlnre
function).
They provide an easy way to process a user's own data without having to rely on external scripts to compute frequency spectra and vocabulary growth curves. All that is needed is a text file in one-token-per-line formt (i.e. where each token is given on a separate line). See "Examples" below for further hints.
Both applications work well for samples of up to approx. 1 million
tokens. For considerably larger data sets, specialized external
software should be used, such as the Perl scripts provided on the
zipfR
homepage.
An object of class tfl
, spc
or vgc
, representing
the type frequency list, frequency spectrum or vocabulary growth curve
of the token vector x
, respectively.
rlnre
for generating random samples (in the form of the
required token vectors) from a LNRE model
## type-token statistics for random samples from a LNRE distribution model <- lnre("fzm", alpha=.5, A=1e-6, B=.05) x <- rlnre(model, 100000) vec2tfl(x) vec2spc(x) # same as tfl2spc(vec2tfl(x)) vec2vgc(x) sample.spc <- vec2spc(x) exp.spc <- lnre.spc(model, 100000) plot(exp.spc, sample.spc) sample.vgc <- vec2vgc(x, m.max=1, steps=500) exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1) plot(exp.vgc, sample.vgc, add.m=1) ## Not run: ## load token vector from a file in one-token-per-line format x <- readLines(filename) x <- readLines(file.choose()) # with file selection dialog ## you can also perform whitespace tokenization and filter the data brown <- scan("brown.pos", what=character(0), quote="") nouns <- grep("/NNS?$", brown, value=TRUE) plot(vec2spc(nouns)) plot(vec2vgc(nouns, m.max=1), add.m=1) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.