Fitting four-parameter dose response curves (using parallel processing)
This function is a wrapper around fit_drc_4p
that allows the use of all system cores for model fitting. It should only be used on systems that have enough memory available.
Workers can either be set up manually before running the function with future::plan(multiprocess)
or automatically by the function (maximum number of workers is 12 in this case). If workers are set up manually the
number of cores should be provided to n_cores
. Worker can be terminated after completion with future::plan(sequential)
. It is not possible to export the
individual fit objects when using this function as compared to the non parallel function as they are too large for efficient export from the workers.
parallel_fit_drc_4p( data, sample, grouping, response, dose, filter = "post", replicate_completeness = 0.7, condition_completeness = 0.5, correlation_cutoff = 0.8, log_logarithmic = TRUE, retain_columns = NULL, n_cores = NULL )
data |
A data frame containing at least the input variables. |
sample |
The name of the column containing the sample names. |
grouping |
The name of the column containing precursor, peptide or protein identifiers. |
response |
The name of the column containing response values, eg. log2 transformed intensities. |
dose |
The name of the column containing dose values, eg. the treatment concentrations. |
filter |
A character vector indicating if models should be filtered. The option |
replicate_completeness |
Similar to |
condition_completeness |
This argument determines how many conditions need to at least fulfill the "complete enough" criteria
set with |
correlation_cutoff |
A numeric vector specifying the correlation cutoff used for data filtering. |
log_logarithmic |
logical indicating if a logarithmic or log-logarithmic model is fitted. If response values form a symmetric curve for non-log transformed dose values, a logarithmic model instead of a log-logarithmic model should be used. Usually biological dose response data has a log-logarithmic distribution, which is the reason this is the default. Log-logarithmic models are symmetric if dose values are log transformed. |
retain_columns |
A vector indicating if certain columns should be retained from the input data frame. Default is not retaining
additional columns |
n_cores |
Optional, the number of cores used if workers are set up manually. |
If data filtering options are selected, data is filtered based on multiple criteria. In general, curves are only fitted if there are at least 5 conditions with data points present to ensure that there is potential for a good curve fit. Therefore, this is also the case if no filtering option is selected. Furthermore, a completeness cutoff is defined for filtering. By default each entity (e.g. precursor) is filtered to contain at least 70 all conditions (adjusted downward). This can be adjusted with the according arguments. In addition to the completeness cutoff, also a significance cutoff is applied. ANOVA is used to compute the statistical significance of the change for each entity. The resulting p-value is adjusted using the Benjamini-Hochberg method and a cutoff of q <= 0.05 is applied. Curve fits that have a minimal value that is higher than the maximal value are excluded as they were likely wrongly fitted. Curves with a correlation below 0.8 are not passing the filtering. If a fit does not fulfill the significance or completeness cutoff, it has a chance to still be considered if half of its values (+/-1 value) pass the replicate completeness criteria and half do not pass it. In order to fall into this category, the values that fulfill the completeness cutoff and the ones that do not fulfill it need to be consecutive, meaning located next to each other based on their concentration values. Furthermore, the values that do not pass the completeness cutoff need to be lower in intensity. Lastly, the difference between the two groups is tested for statistical significance using a Welch's t-test and a cutoff of p <= 0.1 (we want to mainly discard curves that falsly fit the other criteria but that have clearly non-significant differences in mean). This allows curves to be considered that have missing values in half of their observations due to a decrease in intensity. It can be thought of as conditions that are missing not at random (MNAR). It is often the case that those entities do not have a significant p-value since half of their conditions are not considered due to data missingness.
The final filtered list is ranked based on a score calculated on entities that pass the filter. The score is the negative log10
of the adjusted ANOVA p-value scaled between 0 and 1 and the correlation scaled between 0 and 1 summed up and divided by 2. Thus,
the highest score an entity can have is 1 with both the highest correlation and adjusted p-value. The rank is corresponding to
this score. Please note, that entities with MNAR conditions might have a lower score due to the missing or non-significant ANOVA
p-value. You should have a look at curves that are TRUE for dose_MNAR
in more detail.
A data frame is returned that contains correlations of predicted to measured values as a measure of the goodness of the curve fit,
an associated p-value and the four parameters of the model for each group. Furthermore, input data for plots is returned in the columns plot_curve
(curve and confidence interval) and plot_points
(measured points).
## Not run: parallel_fit_drc_4p( data, sample = r_file_name, grouping = eg_precursor_id, response = intensity, dose = concentration ) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.