Generate substitution and indel costs
The function seqcost
proposes different ways to generate substitution costs
(supposed to reflect state dissimilarities) and possibly indel costs. Proposed methods are:
"CONSTANT"
(same cost for all substitutions), "TRATE"
(derived from the observed transition rates), "FUTURE"
(Chi-squared distance between conditional state distributions lag
positions ahead), "FEATURES"
(Gower distance between state features), "INDELS"
, "INDELSLOG"
(based on estimated indel costs).
The substitution-cost matrix is intended to serve as sm
argument in the seqdist
function that computes distances between sequences. seqsubm
is an alias that returns only the substitution cost matrix, i.e., no indel.
seqcost(seqdata, method, cval = NULL, with.missing = FALSE, miss.cost = NULL, time.varying = FALSE, weighted = TRUE, transition = "both", lag = 1, miss.cost.fixed = NULL, state.features = NULL, feature.weights = NULL, feature.type = list(), proximities = FALSE) seqsubm(...)
seqdata |
A sequence object as returned by the seqdef function. |
method |
String. How to generate the costs. One of |
cval |
Scalar. For method |
with.missing |
Logical. Should an additional entry be added in the matrix for the missing states?
If |
miss.cost |
Scalar or vector. Cost for substituting the missing state. Default is |
miss.cost.fixed |
Logical. Should the substitution cost for missing be set as the |
time.varying |
Logical. If |
weighted |
Logical. Should weights in |
transition |
String. Only used if |
lag |
Integer. For methods |
state.features |
Data frame with features values for each state. |
feature.weights |
Vector of feature weights with length equal to the number of columns of |
feature.type |
List of feature types. See |
proximities |
Logical: should state proximities be returned instead of substitution costs? |
... |
Arguments passed to |
The substitution-cost matrix has dimension ns*ns, where ns is the number of states in the alphabet of the sequence object. The element (i,j) of the matrix is the cost of substituting state i with state j. It defines the dissimilarity between the states i and j.
With method CONSTANT
, the substitution costs are all set equal to the cval
value, the default value being 2.
With method TRATE
(transition rates), the transition probabilities between all pairs of
states is first computed (using the seqtrate function). Then, the
substitution cost between states i and j is obtained with
the formula
SC(i,j) = cval - P(i|j) -P(j|i)
where P(i|j) is the probability of transition from state j to
i lag
positions ahead.
With method FUTURE
, the cost between i and j is the Chi-squared distance between the vector (d(alphabet | i)) of probabilities of transition from states i and
j to all the states in the alphabet lag
positions ahead:
SC(i,j) = ChiDist(d(alphabet | i), d(alphabet | j))
With method FEATURES
, each state is characterized by the variables state.features
, and the cost between i and j is computed as the Gower distance between their vectors of state.features
values.
With methods INDELS
and INDELSLOG
, values of indels are first derived from the state relative frequencies f_i. For INDELS
, indel_i = 1/f_i is used, and for INDELSLOG
, indel_i = log[2/(1 + f_i)].
Substitution costs are then set as SC(i,j) = indel_i + indel_j.
For all methods but INDELS
and INDELSLOG
, the indel is set as max(sm)/2 when time.varying=FALSE
and as 1 otherwise.
For seqcost
, a list of two elements, indel
and sm
or prox
:
indel |
The indel cost. Either a scalar or a vector of size ns. |
sm |
The substitution-cost matrix when |
prox |
The state proximity matrix when |
sm
and prox
are a matrix of size ns * ns, where ns
is the number of states in the alphabet of the sequence object.
For seqsubm
, only one element, the matrix sm
.
Gilbert Ritschard and Matthias Studer (and Alexis Gabadinho for first version of seqsubm
)
Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1-37.
Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2010). Mining Sequence Data in
R
with the TraMineR
package: A user's guide. Department of Econometrics and
Laboratory of Demography, University of Geneva.
Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. DOI: 10.1111/rssa.12125
Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland, 2014. DOI: 10.12682/lives.2296-1658.2014.33
## Defining a sequence object with columns 10 to 25 ## of a subset of the 'biofam' example data set. data(biofam) biofam.seq <- seqdef(biofam[501:600,10:25]) ## Indel and substitution costs based on log of inverse state frequencies lifcost <- seqcost(biofam.seq, method="INDELSLOG") ## Here lifcost$indel is a vector biofam.om <- seqdist(biofam.seq, method="OM", indel=lifcost$indel, sm=lifcost$sm) ## Optimal matching using transition rates based substitution-cost matrix ## and the associated indel cost ## Here trcost$indel is a scalar trcost <- seqcost(biofam.seq, method="TRATE") biofam.om <- seqdist(biofam.seq, method="OM", indel=trcost$indel, sm=trcost$sm) ## Using costs based on FUTURE with a forward lag of 4 fucost <- seqcost(biofam.seq, method="FUTURE", lag=4) biofam.om <- seqdist(biofam.seq, method="OM", indel=fucost$indel, sm=fucost$sm) ## Optimal matching using a unique substitution cost of 2 ## and an insertion/deletion cost of 3 ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2) biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3, sm=ccost) ## Displaying the distance matrix for the first 10 sequences biofam.om.c2[1:10,1:10] ## ================================= ## Example with weights and missings ## ================================= data(ex1) ex1.seq <- seqdef(ex1[,1:13], weights=ex1$weights) ## Unweighted subm <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=FALSE) ex1.om <- seqdist(ex1.seq, method="OM", indel=subm$indel, sm=subm$sm, with.missing=TRUE) ## Weighted subm.w <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=TRUE) ex1.omw <- seqdist(ex1.seq, method="OM", indel=subm.w$indel, sm=subm.w$sm, with.missing=TRUE) ex1.om == ex1.omw
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.