SimDesign: runSimulation – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

runSimulation

Run a Monte Carlo simulation given a data.frame of conditions and simulation functions

Description

This function runs a Monte Carlo simulation study given a set of predefined simulation functions, design conditions, and number of replications. Results can be saved as temporary files in case of interruptions and may be restored by re-running runSimulation, provided that the respective temp file can be found in the working directory. runSimulation supports parallel and cluster computing, global and local debugging, error handling (including fail-safe stopping when functions fail too often, even across nodes), provides bootstrap estimates of the sampling variability (optional), and automatic tracking of error and warning messages and their associated .Random.seed states. For convenience, all functions available in the R work-space are exported across all computational nodes so that they are more easily accessible (however, other R objects are not, and therefore must be passed to the fixed_objects input to become available across nodes). For an in-depth tutorial of the package please refer to Chalmers and Adkins (2020; doi: 10.20982/tqmp.16.4.p248). For an earlier didactic presentation of the package refer to Sigal and Chalmers (2016; doi: 10.1080/10691898.2016.1246953). Finally, see the associated wiki on Github (https://github.com/philchalmers/SimDesign/wiki) for tutorial material, examples, and applications of SimDesign to real-world simulation experiments.

Usage

runSimulation(
  design,
  replications,
  generate,
  analyse,
  summarise,
  fixed_objects = NULL,
  packages = NULL,
  filename = NULL,
  debug = "none",
  load_seed = NULL,
  save_results = FALSE,
  parallel = FALSE,
  ncores = parallel::detectCores(),
  cl = NULL,
  notification = NULL,
  boot_method = "none",
  boot_draws = 1000L,
  CI = 0.95,
  seed = rint(nrow(design), min = 1L, max = 2147483647L),
  save_seeds = FALSE,
  save = TRUE,
  store_results = FALSE,
  store_warning_seeds = FALSE,
  warnings_as_errors = FALSE,
  max_errors = 50L,
  allow_na = FALSE,
  allow_nan = FALSE,
  stop_on_fatal = FALSE,
  MPI = FALSE,
  save_details = list(),
  progress = TRUE,
  verbose = TRUE
)

## S3 method for class 'SimDesign'
summary(object, ...)

## S3 method for class 'SimDesign'
print(x, list2char = TRUE, ...)

Arguments

`design`	a `tibble` or `data.frame` object containing the Monte Carlo simulation conditions to be studied, where each row represents a unique condition and each column a factor to be varied. See `createDesign` for the standard approach to create this simulation design object
`replications`	number of independent replications to perform per condition (i.e., each row in `design`). Must be greater than 0
`generate`	user-defined data and parameter generating function. See `Generate` for details. Note that this argument may be omitted by the user if they wish to generate the data with the `analyse` step, but for real-world simulations this is generally not recommended
`analyse`	user-defined analysis function that acts on the data generated from `Generate` (or, if `generate` was omitted, can be used to generate and analyses the simulated data). See `Analyse` for details
`summarise`	optional (but highly recommended) user-defined summary function from `Summarise` to be used to compute meta-statistical summary information after all the replications have completed within each `design` condition. Omitting this function will return a list of `data.frame`s (or a single `data.frame`, if only one row in `design` is supplied) or, for more general objects (such as `list`s), a `list` containing the results returned form `Analyse`. Alternatively, the value `NA` can be passed to let the generate-analyse-summarise process to run as usual, where the summarise components are instead included only as a placeholder. Omitting this input is only recommended for didactic purposes because it leaves out a large amount of information (e.g., try-errors, warning messages, saving files, etc), can witness memory related issues, and generally is not as flexible internally. If users do not wish to supply a summarise function then it is is recommended to pass `NA` to this argument while also supplying passing `save_results = TRUE` to save the results to the hard-drive during the simulation. This provides a more RAM friendly alternative to storing all the Generate-Analyse results in the working environment, where the Analysis results can be summarised at a later time
`fixed_objects`	(optional) an object (usually a named `list`) containing additional user-defined objects that should remain fixed across conditions. This is useful when including large vectors/matrices of population parameters, fixed data information that should be used across all conditions and replications (e.g., including a common design matrix for linear regression models), or simply control constant global elements (e.g., a constant for sample size)
`packages`	a character vector of external packages to be used during the simulation (e.g., `c('MASS', 'extraDistr', 'simsem')` ). Use this input when `parallel = TRUE` or `MPI = TRUE` to use non-standard functions from additional packages, otherwise the functions must be made available by using explicit `library` or `require` calls within the provided simulation functions. Alternatively, functions can be called explicitly without attaching the package with the `::` operator (e.g., `extraDistr::rgumbel()`)
`filename`	(optional) the name of the `.rds` file to save the final simulation results to. If the extension `.rds` is not included in the file name (e.g. `"mysimulation"` versus `"mysimulation.rds"`) then the `.rds` extension will be automatically added to the file name to ensure the file extension is correct. Note that if the same file name already exists in the working directly at the time of saving then a new file will be generated instead and a warning will be thrown. This helps to avoid accidentally overwriting existing files. Default is `NULL`, indicating no file will be saved by default
`debug`	a string indicating where to initiate a `browser()` call for editing and debugging, and pairs particularly well with the `load_seed` argument for precice debugging. General options are `'none'` (default; no debugging), `'error'`, which starts the debugger when any error in the code is detected in one of three generate-analyse-summarise functions, and `'all'`, which debugs all the user defined functions regardless of whether an error was thrown or not. Specific options include: `'generate'` to debug the data simulation function, `'analyse'` to debug the computational function, and `'summarise'` to debug the aggregation function. Alternatively, users may place `browser` calls within the respective functions for debugging at specific lines, which is useful when debugging based on conditional evaluations (e.g., `if(this == 'problem') browser()`). Note that parallel computation flags will automatically be disabled when a `browser()` is detected or when a debugging argument other than `'none'` is supplied
`load_seed`	used to replicate an exact simulation state, which is primarily useful for debugging purposes. Input can be a character object indicating which file to load from when the `.Random.seed`s have be saved (after a call with `save_seeds = TRUE`), or an integer vector indicating the actual `.Random.seed` values. E.g., `load_seed = 'design-row-2/seed-1'` will load the first seed in the second row of the `design` input, or explicitly passing the elements from `.Random.seed` (see `SimExtract` to extract the seeds associated explicitly with errors during the simulation, where each column represents a unique seed). If the input is a character vector then it is important NOT to modify the `design` input object, otherwise the path may not point to the correct saved location, while if the input is an integer vector (or single column `tbl` object) then it WILL be important to modify the `design` input in order to load this exact seed for the corresponding design row. Default is `NULL`
`save_results`	logical; save the results returned from `Analyse` to external `.rds` files located in the defined `save_results_dirname` directory/folder? Use this if you would like to keep track of the individual parameters returned from the `analysis` function. Each saved object will contain a list of three elements containing the condition (row from `design`), results (as a `list` or `matrix`), and try-errors. See `SimResults` for an example of how to read these `.rds` files back into R after the simulation is complete. Default is `FALSE`. WARNING: saving results to your hard-drive can fill up space very quickly for larger simulations. Be sure to test this option using a smaller number of replications before the full Monte Carlo simulation is performed. See also `reSummarise` for applying summarise functions from saved simulation results
`parallel`	logical; use parallel processing from the `parallel` package over each unique condition?
`ncores`	number of cores to be used in parallel execution. Default uses all available
`cl`	cluster object defined by `makeCluster` used to run code in parallel. If `NULL` and `parallel = TRUE`, a local cluster object will be defined which selects the maximum number cores available and will be stopped when the simulation is complete. Note that supplying a `cl` object will automatically set the `parallel` argument to `TRUE`
`notification`	an optional, empty argument function to be executed upon completion of the simulation. This can be used, for instance, to trigger email or SMS notifications that indicate the simulation has been completed. For example, to utilize the `RPushbullet` package (and assuming users have previously a) registered for a Pushbullet account, and b) installed the application on their mobile device and computer), use the following: Prior Setup Prior to defining `notification`, load the `RPushbullet` library via `library(RPushbullet)`. If this is the first time you have used the package then a suitable `rpushbullet.json` file will not exist on your computer, and you'll need to supply a suitable token and the devise to push the notification to via the `pbSetup()` setup Execution Supply a definition of `notification` that utilizes the `pbPost` function. E.g., `runSimulation(..., notification = function() pbPost(type = "note", title = "SimDesign", body = "Simulation Complete"))` Alternatively, if users wish to have an email sent upon completion then the following template that uses the `sendmailR` package could be used: Using `sendmailR` `runSimulation(..., notification = function() sendmailR::sendmail(from="<sendmailR@your.computer>", to="<your.email@address>", subject="SimDesign", msg="Simulation Complete", control=list(smtpServer="ASPMX.L.GOOGLE.COM")))`. However, note that this may be less reliable since the email message could be directed to a spam folder.
`boot_method`	method for performing non-parametric bootstrap confidence intervals for the respective meta-statistics computed by the `Summarise` function. Can be `'basic'` for the empirical bootstrap CI, `'percentile'` for percentile CIs, `'norm'` for normal approximations CIs, or `'studentized'` for Studentized CIs (should only be used for simulations with lower replications due to its computational intensity). Default is `'none'`, which performs no bootstrapping
`boot_draws`	number of non-parametric bootstrap draws to sample for the `summarise` function after the generate-analyse replications are collected. Default is 1000
`CI`	bootstrap confidence interval level (default is 95%)
`seed`	a vector of integers to be used for reproducibility. The length of the vector must be equal the number of rows in `design`. This argument calls `set.seed` or `clusterSetRNGStream` for each condition, respectively, but will not be run when `MPI = TRUE`. Default randomly generates seeds within the range 1 to 2147483647 for each condition.
`save_seeds`	logical; save the `.Random.seed` states prior to performing each replication into plain text files located in the defined `save_seeds_dirname` directory/folder? Use this if you would like to keep track of every simulation state within each replication and design condition. This can be used to completely replicate any cell in the simulation if need be. As well, see the `load_seed` input to load a given `.Random.seed` to exactly replicate the generated data and analysis state (mostly useful for debugging). When `TRUE`, temporary files will also be saved to the working directory (in the same way as when `save = TRUE`). Default is `FALSE` Note, however, that this option is not typically necessary or recommended since the `.Random.seed` states for simulation replications that throw errors during the execution are automatically stored within the final simulation object, and can be extracted and investigated using `SimExtract`. Hence, this option is only of interest when all of the replications must be reproducible (which occurs very rarely), otherwise the defaults to `runSimulation` should be sufficient
`save`	logical; save the temporary simulation state to the hard-drive? This is useful for simulations which require an extended amount of time, though for shorter simulations can be disabled to slightly improve computational efficiency. When `TRUE`, a temp file will be created in the working directory which allows the simulation state to be saved and recovered (in case of power outages, crashes, etc). As well, triggering this flag will save any fatal `.Random.seed` states when conditions unexpectedly crash (where each seed is stored row-wise in an external .rds file), which provides a much easier mechanism to debug issues (see `load_seed` for details). Upon completion, this temp file will be removed. To recover your simulation at the last known location (having patched the issues in the previous execution code) simply re-run the code you used to initially define the simulation and the external file will automatically be detected and read-in. Default is `TRUE`
`store_results`	logical; store the complete tables of simulation results in the returned object? This is `FALSE` by default to help avoid RAM issues (see `save_results` as a more suitable alternative). However, if the `Design` object is omitted from the call to `runSimulation()`, or the number of rows in `Design` is exactly 1, then this argument is automatically set to `TRUE` as RAM storage will no longer be an issue. To extract these results pass the returned object to `SimExtract(..., what = 'results')`, which will return a named list of all the simulation results for each condition if `nrow(Design) > 1`; otherwise, if `nrow(Design) == 1` or `Design` was missing the `results` object will be stored as-is
`store_warning_seeds`	logical; in addition to storing the `.Random.seed` states whenever error messages are raised, also store the `.Random.seed` states when warnings are raised? This is disabled by default since warnings are generally less problematic than errors, and because many more warnings messages may be raised throughout the simulation (potentially causing RAM related issues when constructing the final simulation object as any given simulation replicate could generate numerous warnings, and storing the seeds states could add up quickly). Set this to `TRUE` when replicating warning messages is important, however be aware that too many warnings messages raised during the simulation implementation could cause RAM related issues.
`warnings_as_errors`	logical; treat warning messages as error messages during the simulation? Default is FALSE, therefore warnings are only collected and not used to restart the data generation step, and the seeds associated with the warning message conditions are not stored within the final simulation object
`max_errors`	the simulation will terminate when more than this number of consecutive errors are thrown in any given condition, causing the simulation to continue to the next unique `design` condition. This is included to avoid getting stuck in infinite re-draws, and to indicate that something fatally problematic is going wrong in the generate-analyse phases. Default is 50
`allow_na`	logical; should `NA`s be allowed in the analyse step as a valid result from the simulation analysis? Default is FALSE
`allow_nan`	logical; should `NaN`s be allowed in the analyse step as a valid result from the simulation analysis? Default is FALSE
`stop_on_fatal`	logical; should the simulation be terminated immediately when the maximum number of consecutive errors (`max_errors`) is reached? If `FALSE`, the simulation will continue as though errors did not occur, however a column `FATAL_TERMINATION` will be included in the resulting object indicating the final error message observed, and `NA` placeholders will be placed in all other row-elements. Default is `FALSE`
`MPI`	logical; use the `foreach` package in a form usable by MPI to run simulation in parallel on a cluster? Default is `FALSE`
`save_details`	a list pertaining to information regarding how and where files should be saved when the `save` or `save_results` flags are triggered. `safe` logical; trigger whether safe-saving should be performed. When `TRUE` files will never be overwritten accidentally, and where appropriate the program will either stop or generate new files with unique names. Default is `TRUE` `compname` name of the computer running the simulation. Normally this doesn't need to be modified, but in the event that a manual node breaks down while running a simulation the results from the temp files may be resumed on another computer by changing the name of the node to match the broken computer. Default is the result of evaluating `unname(Sys.info()['nodename'])` `out_rootdir` root directory to save all files to. Default uses the current working directory `save_results_dirname` a string indicating the name of the folder to save result objects to when `save_results = TRUE`. If a directory/folder does not exist in the current working directory then a unique one will be created automatically. Default is `'SimDesign-results_'` with the associated `compname` appended `save_seeds_dirname` a string indicating the name of the folder to save `.Random.seed` objects to when `save_seeds = TRUE`. If a directory/folder does not exist in the current working directory then one will be created automatically. Default is `'SimDesign-seeds_'` with the associated `compname` appended
`progress`	logical; display a progress bar (using the `pbapply` package) for each simulation condition? This is useful when simulations conditions take a long time to run (see also the `notifications` argument). Default is `TRUE`
`verbose`	logical; print messages to the R console? Default is `TRUE`
`object`	SimDesign object returned from `runSimulation`
`...`	additional arguments
`x`	SimDesign object returned from `runSimulation`
`list2char`	logical; for `tibble` object re-evaluate list elements as character vectors for better printing of the levels? Note that this does not change the original classes of the object, just how they are printed. Default is TRUE

Details

The strategy for organizing the Monte Carlo simulation work-flow is to

1): Define a suitable Design object (a tibble or data.frame) containing fixed conditional information about the Monte Carlo simulations. Each row or this design object pertains to a unique set of simulation to study, while each column the simulation factor under investigation (e.g., sample size, distribution types, etc). This is often expedited by using the createDesign function, and if necessary the argument subset can be used to remove redundant or non-applicable rows
2): Define the three step functions to generate the data (Generate; see also https://CRAN.R-project.org/view=Distributions for a list of distributions in R), analyse the generated data by computing the respective parameter estimates, detection rates, etc (Analyse), and finally summarise the results across the total number of replications (Summarise).
3): Pass the design object and three defined R functions to runSimulation, and declare the number of replications to perform with the replications input. This function will return a suitable tibble object with the complete simulation results and execution details
4): Analyze the output from runSimulation, possibly using ANOVA techniques (SimAnova) and generating suitable plots and tables

Expressing the above more succinctly, the functions to be called have the following form, with the exact functional arguments listed:

Design <- createDesign(...)
Generate <- function(condition, fixed_objects = NULL) {...}
Analyse <- function(condition, dat, fixed_objects = NULL) {...}
Summarise <- function(condition, results, fixed_objects = NULL) {...}
res <- runSimulation(design=Design, replications, generate=Generate, analyse=Analyse, summarise=Summarise)

The condition object above represents a single row from the design object, indicating a unique Monte Carlo simulation condition. The condition object also contains two additional elements to help track the simulation's state: an ID variable, indicating the respective row number in the design object, and a REPLICATION element indicating the replication iteration number (an integer value between 1 and replication). This setup allows users to easily locate the rth replication (e.g., REPLICATION == 500) within the jth row in the simulation design (e.g., ID == 2). The REPLICATION input is also useful when temporarily saving files to the hard-drive when calling external command line utilities (see examples on the wiki).

For a template-based version of the work-flow, which is often useful when initially defining a simulation, use the SimFunctions function. This function will write a template simulation to one/two files so that modifying the required functions and objects can begin immediately. This means that users can focus on their Monte Carlo simulation details right away rather than worrying about the repetitive administrative code-work required to organize the simulation's execution flow.

Finally, examples, presentation files, and tutorials can be found on the package wiki located at https://github.com/philchalmers/SimDesign/wiki.

Value

a tibble from the dplyr package (also of class 'SimDesign') with the original design conditions in the left-most columns, simulation results in the middle columns, and additional information in the right-most columns (see below).

The right-most column information for each condition are: REPLICATIONS to indicate the number of Monte Carlo replications, SIM_TIME to indicate how long (in seconds) it took to complete all the Monte Carlo replications for each respective design condition, COMPLETED to indicate the date in which the given simulation condition completed, SEED for the integer values in the seed argument, and, if applicable, ERRORS and WARNINGS which contain counts for the number of error or warning messages that were caught (if no errors/warnings were observed these columns will be omitted). Note that to extract the specific error and warnings messages see SimExtract. Finally, if boot_method was a valid input other than 'none' then the final right-most columns will contain the labels BOOT_ followed by the name of the associated meta-statistic defined in summarise() and and bootstrapped confidence interval location for the meta-statistics.

Saving data, results, seeds, and the simulation state

To conserve RAM, temporary objects (such as data generated across conditions and replications) are discarded; however, these can be saved to the hard-disk by passing the appropriate flags. For longer simulations it is recommended to use the save_results flag to write the analysis results to the hard-drive.

The use of the save_seeds option can be evoked to save R's .Random.seed state to allow for complete reproducibility of each replication within each condition. These individual .Random.seed terms can then be read in with the load_seed input to reproduce the exact simulation state at any given replication. Most often though, save_seeds is less useful since problematic seeds are automatically stored in the final simulation object to allow for easier replicability of potentially problematic errors (which incidentally can be extracted using SimExtract(res, 'error_seeds') and passed to the load_seed argument). Finally, providing a vector of seeds is also possible to ensure that each simulation condition is macro reproducible under the single/multi-core method selected.

Finally, when the Monte Carlo simulation is complete it is recommended to write the results to a hard-drive for safe keeping, particularly with the filename argument provided (for reasons that are more obvious in the parallel computation descriptions below). Using the filename argument supplied is safer than using, for instance, saveRDS directly because files will never accidentally be overwritten, and instead a new file name will be created when a conflict arises; this type of implementation safety is prevalent in many locations in the package to help avoid unrecoverable (yet surprisingly common) mistakes during the process of designing and executing Monte Carlo simulations.

Resuming temporary results

In the event of a computer crash, power outage, etc, if save = TRUE was used (the default) then the original code used to execute runSimulation() need only be re-run to resume the simulation. The saved temp file will be read into the function automatically, and the simulation will continue one the condition where it left off before the simulation state was terminated. If users wish to remove this temporary simulation state entirely so as to start anew then simply pass SimClean(temp = TRUE) in the R console to remove any previously saved temporary objects.

A note on parallel computing

When running simulations in parallel (either with parallel = TRUE or MPI = TRUE) R objects defined in the global environment will generally not be visible across nodes. Hence, you may see errors such as Error: object 'something' not found if you try to use an object that is defined in the workspace but is not passed to runSimulation. To avoid this type or error, simply pass additional objects to the fixed_objects input (usually it's convenient to supply a named list of these objects). Fortunately, however, custom functions defined in the global environment are exported across nodes automatically. This makes it convenient when writing code because custom functions will always be available across nodes if they are visible in the R workspace. As well, note the packages input to declare packages which must be loaded via library() in order to make specific non-standard R functions available across nodes.

Author(s)

Phil Chalmers rphilip.chalmers@gmail.com

References

Chalmers, R. P., & Adkins, M. C. (2020). Writing Effective and Reliable Monte Carlo Simulations with the SimDesign Package. The Quantitative Methods for Psychology, 16(4), 248-280. doi: 10.20982/tqmp.16.4.p248

Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136-156. doi: 10.1080/10691898.2016.1246953

Examples

#-------------------------------------------------------------------------------
# Example 1: Sampling distribution of mean

# This example demonstrate some of the simpler uses of SimDesign,
# particularly for classroom settings. The only factor varied in this simulation
# is sample size.

# skeleton functions to be saved and edited
SimFunctions()

#### Step 1 --- Define your conditions under study and create design data.frame

Design <- createDesign(N = c(10, 20, 30))


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

# help(Generate)
Generate <- function(condition, fixed_objects = NULL) {
    dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
    dat
}

# help(Analyse)
Analyse <- function(condition, dat, fixed_objects = NULL) {
    ret <- mean(dat) # mean of the sample data vector
    ret
}

# help(Summarise)
Summarise <- function(condition, results, fixed_objects = NULL) {
    ret <- c(mu=mean(results), SE=sd(results)) # mean and SD summary of the sample means
    ret
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# run the simulation
Final <- runSimulation(design=Design, replications=1000,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
Final

# reproduce exact simulation
Final_rep <- runSimulation(design=Design, replications=1000, seed=Final$SEED,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
Final_rep


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Extras
## Not run: 
# compare SEs estimates to the true SEs from the formula sigma/sqrt(N)
5 / sqrt(Design$N)

# To store the results from the analyse function either
#   a) omit a definition of of summarise(), or
#   b) pass save_results = TRUE to runSimulation() and read the results in with SimResults()
# Note that the latter method should be adopted for longer simulations

# e.g., the a) approach
res <- runSimulation(design=Design, replications=1000,
                     generate=Generate, analyse=Analyse)
str(res)
head(res[[1]])

# or b) approach
Final <- runSimulation(design=Design, replications=1000, save_results=TRUE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
res <- SimResults(Final)
str(res)
head(res[[1]]$results)


# obtain empirical bootstrapped CIs during an initial run
# the simulation was completed (necessarily requires save_results = TRUE)
res <- runSimulation(design=Design, replications=1000, boot_method = 'basic',
                     generate=Generate, analyse=Analyse, summarise=Summarise)
res

# alternative bootstrapped CIs that uses saved results via reSummarise().
# Default directory save to:
dirname <- paste0('SimDesign-results_', unname(Sys.info()['nodename']), "/")
res <- reSummarise(summarise=Summarise, dir=dirname, boot_method = 'basic')
res

# remove the saved results from the hard-drive if you no longer want them
SimClean(results = TRUE)


## End(Not run)


#-------------------------------------------------------------------------------
# Example 2: t-test and Welch test when varying sample size, group sizes, and SDs

# skeleton functions to be saved and edited
SimFunctions()

## Not run: 
# in real-world simulations it's often better/easier to save
# these functions directly to your hard-drive with
SimFunctions('my-simulation')

## End(Not run)

#### Step 1 --- Define your conditions under study and create design data.frame

Design <- createDesign(sample_size = c(30, 60, 90, 120),
                       group_size_ratio = c(1, 4, 8),
                       standard_deviation_ratio = c(.5, 1, 2))
Design

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

Generate <- function(condition, fixed_objects = NULL) {
    N <- condition$sample_size      # alternatively, could use Attach() to make objects available
    grs <- condition$group_size_ratio
    sd <- condition$standard_deviation_ratio
    if(grs < 1){
        N2 <- N / (1/grs + 1)
        N1 <- N - N2
    } else {
        N1 <- N / (grs + 1)
        N2 <- N - N1
    }
    group1 <- rnorm(N1)
    group2 <- rnorm(N2, sd=sd)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    welch <- t.test(DV ~ group, dat)
    ind <- t.test(DV ~ group, dat, var.equal=TRUE)

    # In this function the p values for the t-tests are returned,
    #  and make sure to name each element, for future reference
    ret <- c(welch = welch$p.value, independent = ind$p.value)
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    #find results of interest here (e.g., alpha < .1, .05, .01)
    ret <- EDR(results, alpha = .05)
    ret
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# first, test to see if it works
res <- runSimulation(design=Design, replications=5,
                     generate=Generate, analyse=Analyse, summarise=Summarise)
res

## Not run: 
# complete run with 1000 replications per condition
res <- runSimulation(design=Design, replications=1000, parallel=TRUE,
                     generate=Generate, analyse=Analyse, summarise=Summarise)
res
View(res)

## save final results to a file upon completion (not run)
runSimulation(design=Design, replications=1000, parallel=TRUE, filename = 'mysim',
              generate=Generate, analyse=Analyse, summarise=Summarise)



## Debug the generate function. See ?browser for help on debugging
##   Type help to see available commands (e.g., n, c, where, ...),
##   ls() to see what has been defined, and type Q to quit the debugger
runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE, debug='generate')

## Alternatively, place a browser() within the desired function line to
##   jump to a specific location
Summarise <- function(condition, results, fixed_objects = NULL) {
    #find results of interest here (e.g., alpha < .1, .05, .01)
    browser()
    ret <- EDR(results[,nms], alpha = .05)
    ret
}

runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE)



## EXTRA: To run the simulation on a MPI cluster, use the following setup on each node (not run)
# library(doMPI)
# cl <- startMPIcluster()
# registerDoMPI(cl)
# Final <- runSimulation(design=Design, replications=1000, MPI=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise)
# saveRDS(Final, 'mysim.rds')
# closeCluster(cl)
# mpi.quit()


## Similarly, run simulation on a network linked via ssh
##  (two way ssh key-paired connection must be possible between master and slave nodes)
##
## define IP addresses, including primary IP
# primary <- '192.168.2.20'
# IPs <- list(
#     list(host=primary, user='phil', ncore=8),
#     list(host='192.168.2.17', user='phil', ncore=8)
# )
# spec <- lapply(IPs, function(IP)
#                    rep(list(list(host=IP$host, user=IP$user)), IP$ncore))
# spec <- unlist(spec, recursive=FALSE)
#
# cl <- parallel::makeCluster(type='PSOCK', master=primary, spec=spec)
# res <- runSimulation(design=Design, replications=1000, parallel = TRUE,
#                      generate=Generate, analyse=Analyse, summarise=Summarise, cl=cl)

#~~~~~~~~~~~~~~~~~~~~~~~~
###### Post-analysis: Analyze the results via functions like lm() or SimAnova(), and create
###### tables(dplyr) or plots (ggplot2) to help visualize the results.
###### This is where you get to be a data analyst!

library(dplyr)
res %>% summarise(mean(welch), mean(independent))
res %>% group_by(standard_deviation_ratio, group_size_ratio) %>%
   summarise(mean(welch), mean(independent))

# quick ANOVA analysis method with all two-way interactions
SimAnova( ~ (sample_size + group_size_ratio + standard_deviation_ratio)^2, res,
  rates = TRUE)

# or more specific ANOVAs
SimAnova(independent ~ (group_size_ratio + standard_deviation_ratio)^2,
    res, rates = TRUE)

# make some plots
library(ggplot2)
library(tidyr)
dd <- res %>%
   select(group_size_ratio, standard_deviation_ratio, welch, independent) %>%
   pivot_longer(cols=c('welch', 'independent'), names_to = 'stats')
dd

ggplot(dd, aes(factor(group_size_ratio), value)) + geom_boxplot() +
    geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    facet_wrap(~stats)

ggplot(dd, aes(factor(group_size_ratio), value, fill = factor(standard_deviation_ratio))) +
    geom_boxplot() + geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    facet_grid(stats~standard_deviation_ratio) +
    theme(legend.position = 'none')


## End(Not run)

SimDesign

Structure for Organizing Monte Carlo Simulation Designs

v2.3

GPL (>= 2)

Authors

Phil Chalmers [aut, cre] (<https://orcid.org/0000-0001-5332-2810>), Matthew Sigal [ctb], Ogreden Oguzhan [ctb]

Initial release