Pedigree Reconstruction
Perform pedigree reconstruction based on SNP data, including parentage assignment and sibship clustering.
sequoia( GenoM = NULL, LifeHistData = NULL, SeqList = NULL, Module = "ped", MaxSibIter = 42, Err = 1e-04, ErrFlavour = "version2.0", MaxMismatch = NA, Tfilter = -2, Tassign = 0.5, MaxSibshipSize = 100, DummyPrefix = c("F", "M"), Complex = "full", Herm = "no", UseAge = "yes", args.AP = list(Flatten = NULL, Smooth = TRUE), FindMaybeRel = FALSE, CalcLLR = TRUE, quiet = FALSE, Plot = NULL )
GenoM |
numeric matrix with genotype data: One row per individual, and
one column per SNP, coded as 0, 1, 2 or -9 (missing). See also
|
LifeHistData |
dataframe with 3 columns (optionally 5):
If the species has multiple generations per year, use an integer coding such that the candidate parents' ‘Birth year’ is at least one smaller than their putative offspring's. Column names are ignored, so ensure column order is ID - sex - birth year (- BY.min - BY.max). Individuals do not need to be in the same order as in ‘GenoM’, nor do all genotyped individuals need to be included. |
SeqList |
list with output from a previous run, to be re-used in the
current run. Used are elements ‘PedigreePar’, ‘LifeHist’, ‘AgePriors’,
‘Specs’, and ‘ErrM’, and these override the corresponding input parameters.
Not all of these elements need to be present, and all other elements are
ignored. If |
Module |
one of
NOTE: Until 'MaxSibIter' is fully deprecated: if 'MaxSibIter' differs
from the default ( |
MaxSibIter |
[will be deprecated] number of iterations of sibship clustering, including assignment of grandparents to sibships and avuncular relationships between sibships. Clustering continues until convergence or until MaxSibIter is reached. Set to 0 for parentage assignment only. |
Err |
estimated genotyping error rate, as a single number or 3x3 matrix. Details below. The error rate is presumed constant across SNPs, and missingness is presumed random with respect to actual genotype. |
ErrFlavour |
function that takes |
MaxMismatch |
DEPRECATED AND IGNORED. Now calculated
automatically using |
Tfilter |
threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time. |
Tassign |
minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive. |
MaxSibshipSize |
maximum number of offspring for a single individual (a generous safety margin is advised). |
DummyPrefix |
character vector of length 2 with prefixes for dummy dams (mothers) and sires (fathers); maximum 20 characters each. Length 3 vector in case of hermaphrodites (or default prefix 'H'). |
Complex |
Breeding system complexity. Either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous). |
Herm |
Hermaphrodites, either "no", "A" (distinguish between dam and sire role, default if at least 1 individual with sex=4), or "B" (no distinction between dam and sire role). Both of the latter deal with selfing. |
UseAge |
either "yes" (default), "no", or "extra" (additional rounds with extra reliance on ageprior, may boost assignments but increased risk of erroneous assignments); used during full reconstruction only. |
args.AP |
list with arguments to be passed on to
|
FindMaybeRel |
DEPRECATED AND IGNORED, advised to run
|
CalcLLR |
TRUE/FALSE; calculate log-likelihood ratios for all assigned
parents (genotyped + dummy; parent vs. otherwise related). Time-consuming
in large datasets. Can be done separately with |
quiet |
suppress messages: TRUE/FALSE/"verbose". |
Plot |
display plots from |
For each pair of candidate relatives, the likelihoods are calculated of them being parent-offspring (PO), full siblings (FS), half siblings (HS), grandparent-grandoffspring (GG), full avuncular (niece/nephew - aunt/uncle; FA), half avuncular/great-grandparental/cousins (HA), or unrelated (U). Assignments are made if the likelihood ratio (LLR) between the focal relationship and the most likely alternative exceed the threshold Tassign.
Dummy parents of sibships are denoted by F0001, F0002, ... (mothers) and M0001, M0002, ... (fathers), are appended to the bottom of the pedigree, and may have been assigned real or dummy parents themselves (i.e. sibship-grandparents). A dummy parent is not assigned to singletons.
The genotyping error rate 'Err' is by default at locus level, not allele level: the probability to observe true homozygote aa as heterozygote Aa is \approx E, and as alternate homozygote AA (E/2)^2; the probability to observe a true heterozygote as aa = the probability to observe it as AA = E/2. This error structure can be fully customised by providing a 3x3 matrix of observed genotype (columns) conditional on actual genotype (rows) instead.
Full explanation of the various options and interpretation of the output is provided in the vignette.
A list with some or all of the following components:
AgePriors |
Matrix with age-difference based probability ratios for
each relationship, used for full pedigree reconstruction; see
|
DummyIDs |
Dataframe with pedigree for dummy individuals, as well as
their sex, estimated birth year (point estimate, upper and lower bound of
95% confidence interval; see also |
DupGenotype |
Dataframe, duplicated genotypes (with different IDs, duplicate IDs are not allowed). The specified number of maximum mismatches is used here too. Note that this dataframe may include pairs of closely related individuals, and monozygotic twins. |
DupLifeHistID |
Dataframe, row numbers of duplicated IDs in life history dataframe. For convenience only, but may signal a problem. The first entry is used. |
ErrM |
Error matrix; probability of observed genotype (columns) conditional on actual genotype (rows) |
ExcludedInd |
Individuals in GenoM which were excluded because of a too low genotyping success rate (<50%). |
ExcludedSNPs |
Column numbers of SNPs in GenoM which were excluded because of a too low genotyping success rate (<10%). |
LifeHist |
Provided dataframe with sex and birth year data. |
LifeHistPar |
LifeHist with additional columns 'Sexx' (inferred Sex when assigned as part of parent-pair), 'BY.est' (mode of birth year probability distribution), 'BY.lo' (lower limit of 95% highest density region), 'BY.hi' (higher limit), inferred after parentage assignment. 'BY.est' is NA when the probability distribution is flat between 'BY.lo' and 'BY.hi'. |
LifeHistSib |
as LifeHistPar, but estimated after full pedigree reconstruction |
MaybeParent |
Dataframe with pairs of individuals who are more likely parent-offspring than unrelated, but which could not be phased due to unknown age difference or sex, or for whom LLR did not pass Tassign. |
MaybeRel |
Dataframe with pairs of individuals who are more likely to be first or second degree relatives than unrelated, but which could not be assigned. |
MaybeTrio |
Dataframe with non-assigned parent-parent-offspring trios (both parents are of unknown sex), with similar columns as the pedigree |
NoLH |
Vector, IDs in genotype data for which no life history data is provided. |
Pedigree |
Dataframe with assigned genotyped and dummy parents from Sibship step; entries for dummy individuals are added at the bottom. |
PedigreePar |
Dataframe with assigned parents from Parentage step. |
Specs |
Named vector with parameter values. |
TotLikParents |
Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Parentage. |
TotLikSib |
Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Sibship clustering. |
AgePriorExtra |
As AgePriors, but including columns for grandparents and avuncular pairs. NOT updated after parentage assignment, but returned as used during the run. |
DummyClones |
Hermaphrodites only: female-male dummy ID pairs that refer to the same non-genotyped individual |
List elements PedigreePar and Pedigree both have the following columns:
id |
Individual ID |
dam |
Assigned mother, or NA |
sire |
Assigned father, or NA |
LLRdam |
Log10-Likelihood Ratio (LLR) of this female being the mother,
versus the next most likely relationship between the focal individual and
this female. See Details below for relationships considered, and see
|
LLRsire |
idem, for male parent |
LLRpair |
LLR for the parental pair, versus the next most likely configuration between the three individuals (with one or neither parent assigned) |
OHdam |
Number of loci at which the offspring and mother are opposite homozygotes |
OHsire |
idem, for father |
MEpair |
Number of Mendelian errors between the offspring and the parent pair, includes OH as well as e.g. parents being opposing homozygotes, but the offspring not being a heterozygote. The offspring being OH with both parents is counted as 2 errors. |
While every effort has been made to ensure that sequoia provides what it claims to do, there is absolutely no guarantee that the results provided are correct. Use of sequoia is entirely at your own risk.
Jisca Huisman, jisca.huisman@gmail.com
Huisman, J. (2017) Pedigree reconstruction from SNP data: Parentage assignment, sibship clustering, and beyond. Molecular Ecology Resources 17:1009–1024.
GenoConvert
to read in various data formats,
CheckGeno
, SnpStats
to calculate
missingness and allele frequencies,
SimGeno
to simulate SNP data from a pedigree
MakeAgePrior
to estimate effect of age on relationships,
GetMaybeRel
to find pairs of potential relatives,
SummarySeq
and PlotAgePrior
to visualise
results,
GetRelM
to turn a pedigree into pairwise relationships,
CalcOHLLR
to calculate Mendelian errors and LLR for any
pedigree,
CalcPairLL
for likelihoods of various relationships
between specific pairs,
CalcBYprobs
to estimate birth years,
PedCompare
and ComparePairs
to compare to
two pedigrees,
EstConf
to estimate assignment errors,
writeSeq
to save results,
vignette("sequoia") for detailed manual & FAQ.
# === EXAMPLE 1: simulated data === data(SimGeno_example, LH_HSg5, package="sequoia") head(SimGeno_example[,1:10]) head(LH_HSg5) # parentage assignment: SeqOUT <- sequoia(GenoM = SimGeno_example, Err = 0.005, LifeHistData = LH_HSg5, Module="par", Plot=TRUE) names(SeqOUT) SeqOUT$PedigreePar[34:42, ] # compare to true (or old) pedigree: PC <- PedCompare(Ped_HSg5, SeqOUT$PedigreePar) PC$Counts["GG",,] # parentage assignment + full pedigree reconstruction: SeqOUT2 <- sequoia(GenoM = SimGeno_example, Err = 0.005, LifeHistData = LH_HSg5, Module="ped", quiet="verbose") SeqOUT2$Pedigree[34:42, ] PC2 <- PedCompare(Ped_HSg5, SeqOUT2$Pedigree) PC2$Counts["GT",,] PC2$Counts[,,"dam"] # different kind of pedigree comparison: ComparePairs(Ped1=Ped_HSg5, Ped2=SeqOUT$PedigreePar, patmat=TRUE) # results overview: SummarySeq(SeqOUT2) # important to run with approx. correct genotyping error rate: SeqOUT2.b <- sequoia(GenoM = SimGeno_example, # Err = 1e-4 by default LifeHistData = LH_HSg5, Module="ped", Plot=FALSE) PC2.b <- PedCompare(Ped_HSg5, SeqOUT2.b$Pedigree) PC2.b$Counts["GT",,] ## Not run: # === EXAMPLE 2: real data === # ideally, select 400-700 SNPs: high MAF & low LD # save in 0/1/2/NA format (PLINK's --recodeA) GenoM <- GenoConvert(InFile = "inputfile_for_sequoia.raw", InFormat = "raw") # can also do Colony format SNPSTATS <- SnpStats(GenoM) # perhaps after some data-cleaning: write.table(GenoM, file="MyGenoData.txt", row.names=T, col.names=F) # later: GenoM <- as.matrix(read.table("MyGenoData.txt", row.names=1, header=F)) LHdata <- read.table("LifeHistoryData.txt", header=T) # ID-Sex-birthyear SeqOUT <- sequoia(GenoM, LHdata, Err=0.005) SummarySeq(SeqOUT) writeSeq(SeqOUT, folder="sequoia_output") # several text files # runtime: SeqOUT$Specs$TimeEnd - SeqOUT$Specs$TimeStart ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.