Previous chapter
PackagesPackage Basics
Next chapter

Introduction

In R, the fundamental unit of share-able code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. There are currently over 13,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package. Packages thus allow for easy, transparent and cross-platform extension of the R base system.

Packages can also depend on other packages as stated in the fields Depends, Imports and Suggests in each package’s meta data file DESCRIPTION. Through their dependencies packages can span a huge graph which is also referred as an entire Package Universe. Below you can find an example graph including the most downloaded packages on the RStudio CRAN mirror:

Network of most popular CRAN packages. Click for interactive version.

Reading Materials

If you aren’t already familiar with the basics of R package development, the following links provide additional documentation and tutorials:

For this course we recommend to read through R Packages (Hadley Wickham) for an (opinionated) overview of the R package system and Writing R Extensions as the main reference for package authors. The other resources are already a bit outdated but still represent valuable resources for a quick overview.

Software Prerequisites

There are two main prerequisites for building R packages:

  1. GNU software development tools including a C/C++ compiler; and
  2. LaTeX for building R manuals and vignettes.

If you don’t already have these tools installed on your system please consult the article on Package Development Prerequisites for additional details on how to install these dependencies.

Install Packages

How to install R packages

Test your knowledge

Terminology

What is a package?

The term R package comes in different definitions but is most generally defined as a collection of R-source, data and other files which can easily be installed and loaded using the install.packages() and library() function using the R system. You can find the various definitions and sources of R packages below:

R Core Team, Writing R Extensions:

Packages provide a mechanism for loading optional code, data and documentation as needed.

Chambers, 2008:

An R package is a collection of source code and other files that, when installed by R, allows the user to attach the related software by a call to the library() function.

Leisch, 2009:

Packages allow for easy, transparent and cross-platform extension of the R base system. An R package can be thought of as the software equivalent of a scientific article: Articles are the de facto standard to communicate scientific results, and readers expect them to be in a certain format.

Terms and definitions

Regarding terminology R makes a quite peculiar distinction between terms like package and library, as compared to other programming languages. You can find the most important R-terms and definitions below as described by Leisch, 2009:

  • Package: An extension of the R base system with code, data and documentation in standardized format.
  • Library: A directory containing installed packages.
  • Repository: A website providing packages for installation.
  • Source: The original version of a package with human-readable text and code.
  • Binary: A compiled version of a package with computer-readable text and code, may work only on a specific platform.
  • Base packages: Part of the R source tree, maintained by R Core.
  • Recommended packages: Part of every R installation, but not necessarily maintained by R Core.
  • Contributed packages: All the rest. This does not mean that these packages are necessarily of lesser quality than the above. The goal is to keep the base distribution as lean as possible. These packages need to be installed using the install.packages() command.

First let’s find out which packages are part of R’s Base- and Recommended packages. Use the commands below to determine the entire package list. Hint: Remove either base or recommended from the filter to see solely packages from either the base or recommended set.

pkg <- installed.packages()
sel <- pkg[, "Priority"] %in% c("base", "recommended")
rownames(pkg[sel, ])

Package structure

The extracted sources of an R package are simply a directory somewhere on your hard drive. The directory has the same name as the package and the following contents (all of which are described in more detail below).

Basic Package Structure{width=80%}

All but the DESCRIPTION file are optional, though any useful package will have man/ and at least one of R/ and data/. Note that capitalization of the names of files and directories is important, R is case-sensitive as are most operating systems (except Windows).

DESCRIPTION

A file named DESCRIPTION with metadata of the package like author, dependencies and license conditions in a structured text format that is readable by computers and by people.

The DESCRIPTION file contains basic information about the package in the following format:

Package: pkgname
Version: 0.5-1
Date: 2015-01-01
Title: My First Collection of Functions
Authors@R: c(person("Joe", "Developer", role = c("aut", "cre"),
                     email = "Joe.Developer@some.domain.net"),
              person("Pat", "Developer", role = "aut"),
              person("A.", "User", role = "ctb",
                     email = "A.User@whereever.net"))
Author: Joe Developer [aut, cre],
  Pat Developer [aut],
  A. User [ctb]
Maintainer: Joe Developer <Joe.Developer@some.domain.net>
Depends: R (>= 3.1.0)
Imports: nlme
Suggests: MASS
Description: A (one paragraph) description of what
  the package does and why it may be useful.
License: GPL (>= 2)
URL: https://www.r-project.org, http://www.another.url
BugReports: https://pkgname.bugtracker.url

The format is that of a version of a Debian Control File (see also ?read.dcf).

Dependencies to other packages can be expressed in the fields Depends, Imports and Suggests.

  • The Depends field should only state the required R version. It is nowadays not recommended to add packages to the Depends section since this loads the entire namespace and can lead to namespace clashes.
  • Other package dependencies should typically be added to the Imports field which requires package authors to explicitly import specific functions using the importFrom statement in the NAMESPACE file.
  • Packages which are only required for package building (e.g. testthat, roxygen2) are included in the Suggests section. In rare cases, if we have a lot of package dependencies which are not critical for a proper functioning of most use cases we can also add those packages to the suggests section. See e.g. the machine learning packages caret or mlr which are wrappers/abstractions for many other R packages to provide a common interface.

See also Writing R Extensions and R Packages for more information.

R

A sub-directory of R code, see also Writing R Extensions.

The R sub-directory contains R code files, only. The uppercase .R extension for these files is recommended since it seems to be not used by any other software. This is also the default setting for files created with R-Studio using File→New File→R Script. For functions with many lines-of-code (LOC) it is recommended to put only ONE function in a file with the same name as the function. This makes it easier for other user to find functions through the file browser. If multiple functions are put together in the same file they should be related to each other and the file name shall describe all containing functions in a general way.

Ideally, the R code files should directly assign R objects and definitely should not call functions with side effects such as require and options. If computations are required to create objects these can use code ‘earlier’ in the package (see the ‘Collate’ field) plus functions in the ‘Depends’ packages provided that the objects created do not depend on those packages except via namespace imports.

Exercise

Please inspect the functions below which are stored in the package directory R/mypackagefunctions.R of the package named mypackage.

importSeparatedFormat <- function(file, sep = ",", ...) {
  read.table(file, sep = sep, ...)
}

CREATE_model <- function(mydata) {
  mod <<- lm(y ~ ., data = mydata)
}

plotModel <- function(data) {
  plot(data$x, data$y)
  abline(mod)
}

The functions are supposed to be executed as follows:

mydata <- importSeparatedFormat("myfile.csv")
CREATE_model_Coefficients(mydata)
plotModel()

man

A man/ sub-directory of documentation files. R objects are documented in files written in R documentation (.Rd) format, a simple markup language much of which closely resembles (La)TeX, which can be processed into a variety of formats, including LaTeX, HTML and plain text. Each documentation file should include a title, short description, documentation of each parameter and working examples. Typically, .Rd files should have the same name as the .R source file to be documented.

As an example, let us look at a simplified version of src/library/base/man/load.Rd which documents the R function load.

% File src/library/base/man/load.Rd
\name{load}
\alias{load}
\title{Reload Saved Datasets}
\description{
  Reload the datasets written to a file with the function
  \code{save}.
}
\usage{
load(file, envir = parent.frame())
}
\arguments{
  \item{file}{a connection or a character string giving the
    name of the file to load.}
  \item{envir}{the environment where the data should be
    loaded.}
}
\seealso{
  \code{\link{save}}.
}
\examples{
## save all data
save(list = ls(), file= "all.RData")

## restore the saved values to the current environment
load("all.RData")

## restore the saved values to the workspace
load("all.RData", .GlobalEnv)
}
\keyword{file}

R documentation (.Rd) files can also be automatically generated using the roxygen2 package to keep code and documentation within only one single .R source file. This makes code easier to maintain as seen in one of the next sections covering Documentation.

data

A sub-directory of data sets, see also Writing R Extensions.

The data sub-directory is for data files, either to be made available via lazy-loading or for loading using data(). It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files.

Data files can have one of three types as indicated by their extension: plain R code (.R or .r), tables (.tab, .txt, or .csv, see ?data for the file formats, and note that .csv is not the standard22 CSV format), or save() images (.RData or .rda). The files should not be hidden (have names starting with a dot).

Less Common Package Elements

  • exec/ for other executables (e.g. Perl or Java).
  • inst/ for miscellaneous other stuff. The contents of this directory are completely copied to the installed version of a package.
  • A configure script to check for other required software or handle differences between systems.

Distribution of packages

A package can either be distributed as a source package or a binary representation of the corresponding source package. Although packages can be shared using network drives/e-mail/etc. the most popular and user-friendly ways to share packages is over public or private package repositories.

Public Package Repositories

Public package repositories distribute packages through mirrors in the entire world and require packages to be thoroughly tested using e.g. R CMD check.

With the exception of GitHub all repositories listed above also offer packages in binary form (at least for Windows, Mac OS) and can be installed using the command install.packages() from the R-console. For GitHub packages we recommend the function remotes::install_github().

The default package repository is CRAN for all R installations. If using RStudio the Global CDN is used and most probably the fastest in most parts of the world - it can be changed at Tools→Global Options→Packages→CRAN mirror. You can also add a Secondary repository below using RStudio v.1.2+.

RStudio Packages{width=80%}

Alternatively, you can change the Rprofile.site file located at /etc/R/Rprofile.site for Linux or C:\Program Files\R\R-3.x.y\etc\Rprofile.site for Windows with the following entry:

local({
  r <- getOption("repos")
  r["CRAN"] <- "http://cran.cnr.berkeley.edu/"
  r["Private"] <- "file://my-private-repo"
  options(repos = r)
})

This setting is required if you would like to add private repositories, e.g. from your organisation.

Private Package Repositories

Private repositories are typically used to either use packages which are not published/publicly available within an organisation or to create stable environments where packages have been tested and are not supposed to change often.

The easiest way to create a private package repository is miniCRAN which creates a package repository on your local machine/network share. miniCRAN can host source and binary packages and is typically referenced as file://path/my-private-repo.

Running your own web server with Apache or Nginx is another possibility to create your own private repository within your team/organisation. This is also the same setup you CRAN servers are set up. You can then reference your repository similar to https://mydomain.com/cranpath.