protti: create_synthetic_data – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

create_synthetic_data

Creates a synthetic limited proteolysis proteomics dataset

Description

This function creates a synthetic limited proteolysis proteomics dataset that can be used to test functions while knowing the ground truth.

Usage

create_synthetic_data(
  n_proteins,
  frac_change,
  n_replicates,
  n_conditions,
  method = "random_effect",
  concentrations = NULL,
  median_offset_sd = 0.05,
  mean_protein_intensity = 16.88,
  sd_protein_intensity = 1.4,
  mean_n_peptides = 12.75,
  size_n_peptides = 0.9,
  mean_sd_peptides = 1.7,
  sd_sd_peptides = 0.75,
  mean_log_replicates = -2.2,
  sd_log_replicates = 1.05,
  effect_sd = 2,
  dropout_curve_inflection = 14,
  dropout_curve_sd = -1.2,
  additional_metadata = TRUE
)

Arguments

`n_proteins`	Numeric, specifies the number of proteins in the synthetic dataset.
`frac_change`	Numeric, the fraction of proteins that has a peptide changing in abundance. So far only one peptide per protein is changing.
`n_replicates`	Numeric, the number of replicates per condition.
`n_conditions`	Numeric, the number of conditions.
`method`	Character, specifies the method type for the random sampling of significantly changing peptides. If `method = "random_effect"`, the effect for each condition is randomly sampled and conditions do not depend on each other. If `method = "dose_response"`, the effect is sampled based on a dose response curve and conditions are related to each other depending on the curve shape. In this case the concentrations argument needs to be specified.
`concentrations`	Numeric vector of the length of number of conditions, only needs to be specified if `method = "dose_response"`. This allows equal sampling of peptide intensities. It ensures that the same positions of dose response curves are sampled for each peptide based on the provided concentrations.
`median_offset_sd`	Numeric, standard deviation of normal distribution that is used for sampling of inter-sample-differences. Default is 0.05.
`mean_protein_intensity`	Numeric, mean of the protein intensity distribution. Default: 16.8.
`sd_protein_intensity`	Numeric, standard deviation of the protein intensity distribution. Default: 1.4.
`mean_n_peptides`	Numeric, mean number of peptides per protein. Default: 12.75.
`size_n_peptides`	Numeric, dispersion parameter (the shape parameter of the gamma mixing distribution). Can be theoretically calculated as `mean + mean^2/variance`, however, it should be rather obtained by fitting the negative binomial distribution to real data. This can be done by using the `optim` function (see Example section). Default: 0.9.
`mean_sd_peptides`	Numeric, mean of peptide intensity standard deviations within a protein. Default: 1.7.
`sd_sd_peptides`	Numeric, standard deviation of peptide intensity standard deviation within a protein. Default: 0.75.
`mean_log_replicates, sd_log_replicates`	Numeric, `meanlog` and `sdlog` value of the log normal distribution of replicate standard deviations. Can be obtained by fitting a log normal distribution to the distribution of replicate standard deviations from a real dataset. This can be done using the `optim` function (see Example section). Default: -2.2 and 1.05.
`effect_sd`	Numeric, standard deviation of a normal distribution around `mean = 0` that is used to sample the effect of significantly changeing peptides. Default: 2.
`dropout_curve_inflection`	Numeric, intensity inflection point of a probabilistic dropout curve that is used to sample intensity dependent missing values. This argument determines how many missing values there are in the dataset. Default: 14.
`dropout_curve_sd`	Numeric, standard deviation of the probabilistic dropout curve. Needs to be negative to sample a droupout towards low intensities. Default: -1.2.
`additional_metadata`	Logical, determines if metadata such as protein coverage, missed cleavages and charge state should be sampled and added to the list.

Value

A data frame that contains complete peptide intensities and peptide intensities with values that were created based on a probabilistic dropout curve.

Examples

create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.1,
  n_replicates = 3,
  n_conditions = 2
)

# determination of mean_n_peptides and size_n_peptides parameters based on real data (count)
# example peptide count per protein
count <- c(6, 3, 2, 0, 1, 0, 1, 2, 2, 0)
theta <- c(mu = 1, k = 1)
negbinom <- function(theta) {
  -sum(stats::dnbinom(count, mu = theta[1], size = theta[2], log = TRUE))
}
fit <- stats::optim(theta, negbinom)
fit

# determination of mean_log_replicates and sd_log_replicates parameters
# based on real data (standard_deviations)

# example standard deviations of replicates
standard_deviations <- c(0.61, 0.54, 0.2, 1.2, 0.8, 0.3, 0.2, 0.6)
theta2 <- c(meanlog = 1, sdlog = 1)
lognorm <- function(theta2) {
  -sum(stats::dlnorm(standard_deviations, meanlog = theta2[1], sdlog = theta2[2], log = TRUE))
}
fit2 <- stats::optim(theta2, lognorm)
fit2

protti

Bottom-Up Proteomics and LiP-MS Quality Control and Data Analysis Tools

v0.1.1

MIT + file LICENSE

Authors

Jan-Philipp Quast [aut, cre], Dina Schuster [aut], ETH Zurich [cph, fnd]

Initial release