Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

create_synthetic_data

Creates a synthetic limited proteolysis proteomics dataset


Description

This function creates a synthetic limited proteolysis proteomics dataset that can be used to test functions while knowing the ground truth.

Usage

create_synthetic_data(
  n_proteins,
  frac_change,
  n_replicates,
  n_conditions,
  method = "random_effect",
  concentrations = NULL,
  median_offset_sd = 0.05,
  mean_protein_intensity = 16.88,
  sd_protein_intensity = 1.4,
  mean_n_peptides = 12.75,
  size_n_peptides = 0.9,
  mean_sd_peptides = 1.7,
  sd_sd_peptides = 0.75,
  mean_log_replicates = -2.2,
  sd_log_replicates = 1.05,
  effect_sd = 2,
  dropout_curve_inflection = 14,
  dropout_curve_sd = -1.2,
  additional_metadata = TRUE
)

Arguments

n_proteins

Numeric, specifies the number of proteins in the synthetic dataset.

frac_change

Numeric, the fraction of proteins that has a peptide changing in abundance. So far only one peptide per protein is changing.

n_replicates

Numeric, the number of replicates per condition.

n_conditions

Numeric, the number of conditions.

method

Character, specifies the method type for the random sampling of significantly changing peptides. If method = "random_effect", the effect for each condition is randomly sampled and conditions do not depend on each other. If method = "dose_response", the effect is sampled based on a dose response curve and conditions are related to each other depending on the curve shape. In this case the concentrations argument needs to be specified.

concentrations

Numeric vector of the length of number of conditions, only needs to be specified if method = "dose_response". This allows equal sampling of peptide intensities. It ensures that the same positions of dose response curves are sampled for each peptide based on the provided concentrations.

median_offset_sd

Numeric, standard deviation of normal distribution that is used for sampling of inter-sample-differences. Default is 0.05.

mean_protein_intensity

Numeric, mean of the protein intensity distribution. Default: 16.8.

sd_protein_intensity

Numeric, standard deviation of the protein intensity distribution. Default: 1.4.

mean_n_peptides

Numeric, mean number of peptides per protein. Default: 12.75.

size_n_peptides

Numeric, dispersion parameter (the shape parameter of the gamma mixing distribution). Can be theoretically calculated as mean + mean^2/variance, however, it should be rather obtained by fitting the negative binomial distribution to real data. This can be done by using the optim function (see Example section). Default: 0.9.

mean_sd_peptides

Numeric, mean of peptide intensity standard deviations within a protein. Default: 1.7.

sd_sd_peptides

Numeric, standard deviation of peptide intensity standard deviation within a protein. Default: 0.75.

mean_log_replicates, sd_log_replicates

Numeric, meanlog and sdlog value of the log normal distribution of replicate standard deviations. Can be obtained by fitting a log normal distribution to the distribution of replicate standard deviations from a real dataset. This can be done using the optim function (see Example section). Default: -2.2 and 1.05.

effect_sd

Numeric, standard deviation of a normal distribution around mean = 0 that is used to sample the effect of significantly changeing peptides. Default: 2.

dropout_curve_inflection

Numeric, intensity inflection point of a probabilistic dropout curve that is used to sample intensity dependent missing values. This argument determines how many missing values there are in the dataset. Default: 14.

dropout_curve_sd

Numeric, standard deviation of the probabilistic dropout curve. Needs to be negative to sample a droupout towards low intensities. Default: -1.2.

additional_metadata

Logical, determines if metadata such as protein coverage, missed cleavages and charge state should be sampled and added to the list.

Value

A data frame that contains complete peptide intensities and peptide intensities with values that were created based on a probabilistic dropout curve.

Examples

create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.1,
  n_replicates = 3,
  n_conditions = 2
)

# determination of mean_n_peptides and size_n_peptides parameters based on real data (count)
# example peptide count per protein
count <- c(6, 3, 2, 0, 1, 0, 1, 2, 2, 0)
theta <- c(mu = 1, k = 1)
negbinom <- function(theta) {
  -sum(stats::dnbinom(count, mu = theta[1], size = theta[2], log = TRUE))
}
fit <- stats::optim(theta, negbinom)
fit

# determination of mean_log_replicates and sd_log_replicates parameters
# based on real data (standard_deviations)

# example standard deviations of replicates
standard_deviations <- c(0.61, 0.54, 0.2, 1.2, 0.8, 0.3, 0.2, 0.6)
theta2 <- c(meanlog = 1, sdlog = 1)
lognorm <- function(theta2) {
  -sum(stats::dlnorm(standard_deviations, meanlog = theta2[1], sdlog = theta2[2], log = TRUE))
}
fit2 <- stats::optim(theta2, lognorm)
fit2

protti

Bottom-Up Proteomics and LiP-MS Quality Control and Data Analysis Tools

v0.1.1
MIT + file LICENSE
Authors
Jan-Philipp Quast [aut, cre], Dina Schuster [aut], ETH Zurich [cph, fnd]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.