Quantargo Blog

Vienna<-R 2022 November Meetup (live/virtual)

Tue, 08 Nov 2022 22:00:00 +0000

Vienna<-R 2022 November Meetup (live/virtual)

After a longer COVID break we are happy to announce the upcoming ViennaR Meetup on Thursday, November 10! 🙌🎉🥳

The (live) Meetup is hosted at TU Vienna, the legendary Goldenes Lamm, Seminarraum 107/1 - where some R-Core magic happened.

👉REGISTER FOR LIVE MEETUP

Note, that this meetup is hybrid and also available virtually via Zoom. Please register separately at https://www.meetup.com/viennar/events/289309745/ in case you want to attend virtually. International guests welcome!

👉REGISTER FOR VIRTUAL MEETUP

AGENDA

18:00 Doors Open
18:15 Introduction (15min, Start of Virtual Meetup)
18:30 pdfmole - Extracting Tables from PDF files (Florian Schwendinger)
19:15 holiglm - Holistic Generalized Linear Models (Benjamin Schwendinger)
20:00 (End)

DETAILS

pdfmole

To read-in the data either

pdfminer
pdfboxr or
tesseract can be used.

In principle, any package which returns the data in a similar format could be used. The packages pdfminer and pdfboxr can be used if the PDF-file store already the text (in most cases) if the PDF contains only images of the tables tesseract can be used.

👉Github

holiglm

Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The R package holiglm provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art conic mixed-integer solvers, the package can reliably solve GLMs for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the stats::glm() function.

👉Github

Please feel free to join the networking session at a pub nearby.

Greetings,

Your ViennaR organizers

👉REGISTER FOR LIVE MEETUP 👉REGISTER FOR VIRTUAL MEETUP

Make code, not war! ✌❤️

Dashboard Framework Part 2: Running Shiny in AWS Fargate with CDK

Fri, 11 Mar 2022 12:00:00 +0000

Dashboard Framework Part 2: Running Shiny in AWS Fargate with CDK

In the previous post we outlined the architecture of a dashboard framework to run dashboards based on multiple technologies including Shiny and Flask in production. We will now show how to run a basic Shiny dashboard in AWS Fargate behind an Application Load Balancer in less than 60 lines of CDK code. To define our stack in a reproducible manner we will make use of the Amazon Cloud Development Kit (CDK) with Typescript. Starting from a basic CDK stack we now specify the most important components of our stack:

The Application Load Balancer (ALB) to route traffic to our dashboards.
The Fargate cluster to run our dashboard tasks in a scalable manner.

The deployed stack will finally run an example Shiny dashboard behind an Application Load Balancer. Note that the resulting stack will only run one dashboard without encryption. We’ll implement these features as part of the next post. The resulting CDK code can also be downloaded from Github at https://github.com/quantargo/dashboards.

Prerequisites

To run the following code examples make sure to have

an AWS Account
a locally configured AWS account by running e.g. aws configure with the aws CLI
a local Node.js installation (version >= 14.15.0)
Typescript: npm -g install typescript
CDK (version >= 2.0): npm install -g aws-cdk

Initialize CDK and Deploy first App

To initialize a sample project we first create a project folder and within the folder execute cdk init:

mkdir dashboards
cd dashboards
cdk init app --language typescript

This command creates a new CDK Typescript project and installs all required packages. The following 2 files are relevant for stack development:

bin/dashboards.ts: Main file which initializes CDK stack class. You can explicitly set the environment env if you use a different account or region for deployment.
lib/dashboards-stack.ts: CDK Stack class to which all components of our stack will be added.

Specify Application Load Balancer (ALB)

Next, we need to create an Application Load Balancer (ALB) within a new VPC which is responsible for secure connections and routing. We create a new VPC and add an internetFacing load balancer to it. This means that the load balancer will be accessible from the public internet and will therefore be placed into a public subnet. Within the lib/dashboards-stack.ts file we put the following lines:

// Put imports on top of the file
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2'

// Put below lines within the DashboardsStack constructor
const vpc = new ec2.Vpc(this, 'MyVpc');

const lb = new elbv2.ApplicationLoadBalancer(this, 'LB', {
  vpc: vpc,
  internetFacing: true,
  loadBalancerName: 'DashboardBalancer'
});

Specify Dashboard Cluster

Next, we need to add an ECS cluster to our VPC to run our dashboards efficiently:

// Put imports on top of the file
import * as ecs from 'aws-cdk-lib/aws-ecs'

// Put below lines within the DashboardsStack constructor
const cluster = new ecs.Cluster(this, 'DashboardCluster', {
  vpc: vpc
});

Add First Fargate Task Definition

We can now add our first Fargate dashboard to the cluster by specifying a task definition. We use the rocker/shiny Docker container as an example running on port 3838. This also requires respective port mappings in the container definition. Additionally, we use half a virtual CPU (512)—1024 would equal a full one—and a memory size of 1024 MiB. By specifying the Fargate service we are already finished with the specification to run our first container in the cluster:

const taskDefinition = new ecs.FargateTaskDefinition(this, 'TaskDefinition', {
  cpu: 512,
  memoryLimitMiB: 1024,
});

const port = 3838

const container = taskDefinition.addContainer('Container', {
  image: ecs.ContainerImage.fromRegistry('rocker/shiny'),
  portMappings: [{ containerPort: port }],
})

const service = new ecs.FargateService(this, 'FargateService', {
  cluster: cluster,
  taskDefinition: taskDefinition,
  desiredCount: 1,
  serviceName: 'FargateService'
})

Put Service Behind ALB

Next, we put the Fargate service into an ALB target group so that traffic can be routed through the ALB:

const tg1 = new elbv2.ApplicationTargetGroup(this, 'TargetGroup', {
  vpc: vpc,
  targets: [service],
  protocol: elbv2.ApplicationProtocol.HTTP,
  stickinessCookieDuration: cdk.Duration.days(1),
  port: port,
  healthCheck: {
    path: '/',
    port: `${port}`
  }
})

Note that we added 2 parameters to the ALB target group definition:

stickinessCookieDuration: Since Shiny sessions are stateful we need to prevent the ALB to switch instances (in case there are more) during a session. The session duration set to one day should be sufficient.
healthCheck: The health check needs to specify the port (as string) and set to the container port 3838, as well.

Finally, we add an HTTP listener which directly forwards all incoming traffic to our dashboard:

const listener = lb.addListener(`HTTPListener`, {
  port: 80,
  defaultAction: elbv2.ListenerAction.forward([tg1]) 
})

Deploy

Before deployment you should also bootstrap your CDK environment:

cdk bootstrap

Now the stack should be ready for deployment. As an extra step, you can now check if the stack can be successfully synthesized using

cdk synth

Any errors popping up during cdk synth need to be fixed immediately. By continously using cdk synth we make sure that the feedback cycles during development are as short as possible. If cdk synth is successful we can now run

cdk deploy

Finally, you should see the successful output message including the DashboardsStack.LoadBalancerDNSName which you can directly access through the browser:

Outputs:
DashboardsStack.LoadBalancerDNSName = DashboardBalancer-<9-DIGIT-NUMBER>..elb.amazonaws.com
Stack ARN:
arn:aws:cloudformation:::stack/DashboardsStack/

✨  Total time: 297.67s

Destroy

If you don’t use the stack any more and to reduce cloud costs just run:

cdk destroy

Conclusion

We could show how to run your first basic Shiny dashboard behind an Application Load Balancer in very few lines of CDK Typescript code. In the next post we will cover end-to-end encryption through SSL/TLS and host-based routing to add multiple dashboards to the ALB.

Make code, not war! ✌️

Get in Touch

Interested in creating your own dashboard framework or other data science cloud stacks? Just get in touch:

E-Mail: info@quantargo.com

Appendix - Full Code

The full CDK code stack for this post is available on Github.

Below you find the full code specifiying the stack from lib/dashboards-stack.ts:

import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';

import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2'
import * as ecs from 'aws-cdk-lib/aws-ecs'
import * as cdk from 'aws-cdk-lib'

export class DashboardsStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const vpc = new ec2.Vpc(this, 'MyVpc');

    const lb = new elbv2.ApplicationLoadBalancer(this, 'LB', {
      vpc: vpc,
      internetFacing: true,
      loadBalancerName: 'DashboardBalancer'
    });

    const cluster = new ecs.Cluster(this, 'DashboardCluster', {
      vpc: vpc
    });

    const taskDefinition = new ecs.FargateTaskDefinition(this, 'TaskDefinition', {
      cpu: 512,
      memoryLimitMiB: 1024,
    });

    const port = 3838

    const container = taskDefinition.addContainer('Container', {
      image: ecs.ContainerImage.fromRegistry('rocker/shiny'),
      portMappings: [{ containerPort: port }],
    })
    
    const service = new ecs.FargateService(this, 'FargateService', {
      cluster: cluster,
      taskDefinition: taskDefinition,
      desiredCount: 1,
      serviceName: 'FargateService'
    })

    const tg1 = new elbv2.ApplicationTargetGroup(this, 'TargetGroup', {
      vpc: vpc,
      targets: [service],
      protocol: elbv2.ApplicationProtocol.HTTP,
      stickinessCookieDuration: cdk.Duration.days(1),
      port: port,
      healthCheck: {
        path: '/',
        port: `${port}`
      }
    })

    const listener = lb.addListener(`HTTPListener`, {
      port: 80,
      defaultAction: elbv2.ListenerAction.forward([tg1]) 
    })

    new cdk.CfnOutput(this, 'LoadBalancerDNSName', { value: lb.loadBalancerDnsName });
  }
}

Creating a Dashboard Framework with AWS (Part 1)

Wed, 09 Mar 2022 12:00:00 +0000

Creating a Dashboard Framework with AWS (Part 1)

R-Shiny is an excellent framework to create interactive dashboards for data scientists with no extensive web development experience. Similar technologies in other languages include the Flask, Dash or Streamlit Python frameworks. Bringing all different Dashboards under the hood including unified authentication and user management can be a challenging task. In this blog series we will show how we’ve implemented such a framework with AWS.

Use Case and Requirements

The dashboard framework was created for a research department at a major financial institution. Analysts and data scientists already had created dashboards covering different topics based on numerous technologies including R-Shiny and Python-Flask. However, a secure and unified user authentication mechanism is crucial to put the dashboards into production and restrict access only to selected users. Additionally, most analysts and data scientists do not have much dev-ops experience such as Docker containers and thus needed an easy and automated way to adapt their existing dashboards. Last but not least, the team head count was limited on the system operations side, so a simple solution with low maintenance was needed. The entire solution needed to be implemented through Amazon Web Services (AWS) as the cloud provider of choice.

Based on this situation we were asked to create a dashboard framework architecture with these requirements in mind:

Secure, end-to-end encrypted (SSL, TLS) access to dashboards.
Secure authentication through E-mail and Single-Sign-On (SSO).
Horizontal scalability of dashboards according to usage, fail-safe.
Easy adaptability by analysts through automation and continuous integration (CI/CD).
Easy maintenance and extensibility for system operators.

System Architecture

All considerations above led to a simple yet effective system architecture based on selected managed AWS services including

Application Load Balancer (ALB) to handle secure end-to-end (SSL) encrypted access to the dashboards based on different host names (host-based-routing).
AWS Cognito for user authentication based on E-mail and SSO through Ping Federate.
AWS Fargate for horizontal scalability, fail-safe operations and easy maintenance.
AWS Codepipeline and Codebuild for automated build of dashboard Docker containers.
Extensive usage of managed services requiring low maintenance (Fargate, Cognito, ALB) and Amazon Cloud Development Kit (CDK) to define and manage infrastructure-as-code managed in Git and deployed via Code Pipelines.

The figure below illustrates the resulting architecture in more detail:

1. Application Load Balancer

A central piece of the system architecture is the Application Load Balancer (ALB) to route traffic securely to each dashboard. We configured the ALB with host-based routing, so that requests to e.g. https://dashboard1.domain.com or https://dashboard2.domain.com are routed to the respective dashboards. The ALB handles SSL-offloading so that all communication between clients and the load balancer is end-to-end SSL or TLS encrypted. Additionally, we use a feature of ALB to authenticate users through a OIDC compliant identity provider, such as Amazon Cognito. Thus, all users without an authentication token are redirected to a login page, as provided by a Cognito Hosted UI. After successful authentication users are allowed to access the respective dashboard of choice.

2. AWS Cognito for User Authentication

We used Cognito as a managed identity provider by AWS supporting all important authentication mechanisms like e-mail/password (plus MFA) and federation providers like Google, Facebook or Apple. Most importantly, Cognito also supports SAML providers like Ping Federate for SSO within large corporations. The login form is also hosted by Cognito and presented to users who have not yet logged into any dashboard:

3. AWS Fargate

All Dashboards are running within Docker containers and hosted as Fargate Tasks within a common cluster. This makes it possible to create dashboards independently from each other including different versions of R (or Python), packages and even operating systems. The pricing of Fargate tasks is comparable to EC2 depending on the CPU/Memory configuration but comes with the advantage of being completely managed. This also makes auto-scaling a breeze which adds new tasks depending on the current workload.

4. Code Pipeline

Time to market is essential in many industries to get changes and features as fast as possible to the customer. Additionally, many dashboard developers do not want to be occupied with dev-ops tasks like Docker containers and bash scripting. By using Code Pipeline we made sure that dashboard developers only needed to push changes to the repository—the pipeline, builds the docker container, pushes it to elasitic container registry (ECR) and subsequently deploys the new container to the cluster using Code Deploy. The deployment ensures that users have a seamless experience by redirecting new sessions to new instances and dropping old instances once no more open sessions are left.

5. CDK and CI/CD

The AWS Cloud Development Kit (CDK) was a very important tool to quickly setup the entire stack including infrastructure components, build pipelines, and even domain entries. Typescript was our language of choice since 1. It provides the best support by the community (followed by Python) and 2. CDK is also written in Typescript which makes debugging much easier. Since CDK code gets synthesized to common AWS Cloud Formation Templates through the command cdk synth developers get immediate feedback if something went wrong and can shorten the feedback cycle. Through cdk deploy the template can be uploaded and deployed as Cloud Formation Stacks. Thanks to the infrastructure as code principle it is very easy to track changes in Git version control and upload stacks to multiple accounts for development/staging/production.

Conclusion

We could give an overview of the system architecture to deploy a simple yet powerful dashboard framework within Amazon Web Services. The presented framework fulfilled all security requirements and is requires low maintenance efforts thanks to many integrated managed AWS services. In the next post we will show how such a framework can be built from scratch using CDK including dashboard templates for R-Shiny, Flask, Dash and Streamlit.

Stay tuned! ✌️

Get in Touch

Interested in creating your own dashboard framework or other data science cloud stacks? Just get in touch:

E-Mail: info@quantargo.com Contact Form: Link

Quantargo Workspace Now Out of Beta

Tue, 28 Sep 2021 12:30:00 +0000

Quantargo Workspace Now Out of Beta

We’re thrilled to announce that Quantargo Workspace is now out of Beta and generally available! Quantargo Workspace lets you easily create and manage data science projects using R and Python, with advanced features like publishing, scheduling and credential management. Get started here for free.

New Features

In tandem with the launch we also added awesome new features which enable a host of new use-cases:

📝 Publishing

Publishing makes it dead simple to quickly share outputs of your workspace like reports, plots or data sets. Simply hit the “Publish” button and let the magic happen: the file is executed and all outputs are automatically published to a unique URL that you can share! This URL is always up-to-date, so if you re-publish your file the publication will reflect this automatically. This works with any R or Python code as well as RMarkdown documents!

Published outputs can then be viewed and shared via a standalone link:

⏱️ Scheduling

You can now create schedules from a new panel in the workspace editor. Schedules allow you to automate tedious tasks like report generation and data aggregation by running your code in regular intervals. Different intervals are supported like daily, weekly and monthly:

You can create multiple schedules, each with different intervals and times. This makes it a perfect for report generation and together with Auto-Publish you get an always up-to-date link for your reports. Scheduling has been in the works for quite some time and it is finally ready, so please try it out and let us know what you think!

🔑 Credential Management

With this latest addition, you can now store confidential credentials like API keys and service credentials. Secrets allow you to securely store and use secrets in your code without exposing them. They are encrypted at rest and never shared.

Together with scheduling this allows you to securely connect to third party APIs. Check out the new Twitter Bot template for how to connect to the Twitter API through Quantargo Workspace.

➡️ Get Started for Free Now

Limited time coupon code for our Developer and PRO plans: Use the code FREEWORKSPACE at checkout to get the first month completely free! Our paid plans allow you to create private workspaces and as well as give you a lot more API calls.

That’s it for now. Stay safe and healthy! ✌️😃

Data Science Conference Austria 2021

Thu, 23 Sep 2021 17:00:00 +0000

Data Science Conference Austria 2021

Data Science Conference (DSC) Austria is knocking on YOUR door, this time the theme is AI powered sustainability: Save the world through data! And the best is—we still have free tickets until Sept 25, so be quick! 👌💪🤞

DSC Austria will happen on September 27-28th and during the event, you will get a chance to listen to over 3 Keynotes, 25 high-quality talks and 6 tech tutorials on the topic of Sustainability, AI & ML, Data-Driven Decision Making and Data & AI Literacy—but that’s not all!

With the DSC Austria ticket you get:

✅ Full access to DSC Austria 2021 talks and sessions

✅ Entry to virtual networking sessions

✅ Online certificate of attendance

Check it out and reserve your spot:

RESERVE FREE TICKET • CHECK FULL PROGRAM

Introducing the 30 Day Sustainability Data Challenge

As part of Quantargo’s Tech tutorial on Sept 27 at 9 AM CET we will start the 30 Day Sustainability Data Challenge. The challenge is inspired by the 30 Day Chart Challenge and asks participants to post interesting visualizations covering sustainability on Twitter. Anyone is welcome to contribute, no matter which data source or tool you use.

The only rules are:

Include the hashtag #30DaySustainabilityDataChallenge.
Include a link to the source code of your analysis/visualization.
And, most importantly, add an interesting visualization, animation or meme on sustainability.

You can also consider adding other hashtags like #rstats or #sustainability to reach more people.

At the end of the challenge we will sum the number of likes and retweets of each twitter account which participated and posted according to above guidelines. We will also post rankings as the challenge progresses. It is allowed and even encouraged to create scheduled Twitter bots using our Quantargo workspace (see next section).

🏆🏆🏆 And here come the prices:

The first place receives a yearly subscription to our Quantargo workspace, a yearly subscription to all online courses and a seat at our next Advanced Data Transformation workshop worth €950 including 4 dates in November and lifetime access to all materials. Yes, we will also send a Quantargo goodie–bag including sweets and LOTs of hex–stickers.
The second and third place gets a seat at the Advanced Data Transformation workshop and a yearly subscription to our Quantargo workspace.
The fourth and fifth place get a yearly subscription to the Quantargo workspace.

Quantargo Workspace

In the tech tutorial during the conference on Sept 27 at 9 AM CET, we will also introduce the brand new scheduling and (encrypted) secrets features of the Quantargo workspace. With these new features it is very easy to create scheduled R Bots which tweet new messages at a specified time and interval. We will show some examples of how to create bots tweeting about sustainability.

Additionally, the workspace is great to seamlessly create APIs. We will show an example covering and AirBnB dataset in Vienna to

Create an XGBoost model using tidymodels to predict apartment prices based attributes in listings.
Use the model to find cheap apartments in Vienna and plot them with the leaflet package.
Use the API to programmatically create plots and find apartments in that area.

CHECK AIRBNB WORKSPACE

So stay tuned and healthy, see you at the conference and happy to see your posts! #30DaySustainabilityDataChallenge.

The Elon Musk Tweet Effect on Dogecoin (DOGE)

Fri, 16 Jul 2021 19:00:00 +0000

The Elon Musk Tweet Effect on Dogecoin (DOGE)

Unveil the Dogefather

Elon Musk is known for his regular tweets about many different topics—in particular his companies Tesla and SpaceX. With close to 60 million followers he truly is a Twitter celebrity and his opinions have a big impact on technologies and companies. Most recently his tweets also covered Dogecoin, a crypto currency featuring a dog. With a little R-code we checked the effect of his tweets on the Dodgecoin price and discovered significant spikes.

Ingredients

As data sets we use the tweet timeline from @elonmusk. With the rtweet package the timeline can be downloaded as

library(rtweet)

## Visit https://developer.twitter.com/ to get access key
token <- create_token(
  app = "YouAppName",
  consumer_key = "",
  consumer_secret = "",
  access_token = "",
  access_secret = ""
  )

tmls <- get_timelines("elonmusk",
  n = 3200,
  token = token)

Note, that the token creation can be a bit tricky since you first need to register an App at the Twitter developer page https://developer.twitter.com. It’s important to fill-in not only the consumer_key/consumer_secret but also the access_token/access_secret to successfully create the token. After the successful retrieval we get a data frame with (an excerpt) of the tweets from the @elonmusk timeline:

library(dplyr)

tmls %>%
  select(created_at, text)

# A tibble: 600 x 2
   created_at          text                                          
                                                          
 1 2021-07-13 03:05:20 "those who attack space\nmaybe don’t realize …
 2 2021-07-13 02:39:11 "@Rogozin 👏👏"                               
 3 2021-07-13 02:37:57 "Loki is pretty good. Basically, live-action …
 4 2021-07-13 02:33:53 "@dogeofficialceo 🤣"                         
 5 2021-07-13 02:33:26 "@CGDaveMac Maybe if it sees a Shiba Inu, the…
 6 2021-07-13 02:30:16 "🤯 https://t.co/Z11qszTY4v"                  
 7 2021-07-12 22:18:34 "@OwenSparks_ @jeremyjudkins Haha Buzz Corp –…
 8 2021-07-12 22:07:39 "@ErcXspace @kimpaquette Interesting idea"    
 9 2021-07-12 21:40:43 "@kimpaquette Not yet, but they will. It’s ne…
10 2021-07-12 21:38:29 "@cleantechnica OPP? https://t.co/muZdxKdUXz" 
# … with 590 more rows

We executed get_timelines() multiple times to get most tweets out of the timeline.

To study the price effect of his tweets the binancer package was used to download intraday open-high-low-close (OHCL) data in 1-minute intervals from the Binance crytocurrency exchange. We decided on the Dodge vs. Bitcoin (DOGE/BTC) currency pair to also adjust for the overall market movements and to better see the price effect. The function binance_klines() returns a data.table containing all intraday pricing data:

# Install through `remotes::install_github("daroczig/binancer")`
library(binancer) 
binance_klines("DOGEBTC", interval = "1m") %>%
  select(open_time, open, high, low, close, volume)

               open_time     open     high      low    close volume
  1: 2021-07-16 07:51:00 5.78e-06 5.78e-06 5.77e-06 5.78e-06  80201
  2: 2021-07-16 07:52:00 5.78e-06 5.78e-06 5.77e-06 5.77e-06   2337
  3: 2021-07-16 07:53:00 5.78e-06 5.78e-06 5.77e-06 5.78e-06   6161
  4: 2021-07-16 07:54:00 5.77e-06 5.78e-06 5.77e-06 5.78e-06  31250
  5: 2021-07-16 07:55:00 5.78e-06 5.78e-06 5.77e-06 5.77e-06  67220
 ---                                                               
496: 2021-07-16 16:06:00 5.65e-06 5.65e-06 5.63e-06 5.65e-06 145786
497: 2021-07-16 16:07:00 5.65e-06 5.65e-06 5.64e-06 5.64e-06 135197
498: 2021-07-16 16:08:00 5.65e-06 5.65e-06 5.64e-06 5.65e-06  14782
499: 2021-07-16 16:09:00 5.65e-06 5.65e-06 5.64e-06 5.65e-06 254613
500: 2021-07-16 16:10:00 5.65e-06 5.66e-06 5.64e-06 5.66e-06  85440

Twitter Event Study

A typical way to study such events on financial markets is to look at the price movements right before and after each tweet happened. Especially the price action around each tweet can give us an indication of its market effect. For this analysis it is critical to correctly join our 2 data sources, containing twitter and price data. We also need to add the relative time (e.g. minutes relative to tweet timestamp) to to the tweet event which can be used as a common scale for plotting. For this task the function find_price_window() was created to return the relative price changes around a specified date:

library(lubridate)

find_price_window <- function(date, sym = "DOGEBTC", window_length = 20) {
  date_rounded <- floor_date(date, unit = "minute")
  start_time <- date_rounded - window_length * 60
  end_time <- date_rounded + window_length * 60
  dodgebtc <- binance_klines(sym, interval = '1m', start_time = start_time, end_time = end_time)
  dodgebtc$close_time <- ceiling_date(dodgebtc$close_time, unit = "minute")
  close_zero <- dodgebtc$close[dodgebtc$close_time == date_rounded]
  out <- dodgebtc %>%
    mutate(timediff = difftime(close_time, date_rounded, units = "mins")) %>%
    mutate(price_rel = close/close_zero - 1) %>%
    mutate(date_rounded = date_rounded) %>%
    select(date_rounded, time = timediff, price = price_rel, volume = taker_buy_base_asset_volume) %>%
    arrange(time)
  out
}

Now we can create the data table dogetweets which contains the tweets AND the price action around each tweet by the relative time:

library(purrr)

dogetweets <- tmls %>%
  filter(grepl("doge", text)) %>%
  mutate(date = as.Date(created_at)) %>% 
  mutate(price_window = map(created_at, find_price_window))  %>%
  mutate(event_num = 1:nrow(.))

dogetweets %>%
  select(created_at, text, price_window)

# A tibble: 12 x 3
   created_at          text                           price_window   
                                                    
 1 2021-07-13 02:33:53 "@dogeofficialceo 🤣"          
We can finally unnest() the price data from price_window and create a ggplot, containing the price movements around each elonmusk dodge tweet:
library(ggplot2)
library(tidyr)

dogetweets %>%
  unnest(price_window) %>%
  select(created_at, date_rounded, time, price, volume) %>%
  mutate(time = as.numeric(time)) %>%
  mutate(created_at = format(created_at, "%Y-%m-%d %H:%M:%S")) %>%
  ggplot(mapping = aes(x = time, y = price)) + 
  geom_line(aes(color = created_at, group = created_at)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_vline(xintercept = 0) + 
  ylab("") + 
  xlab("Minutes to Tweet Creation") + 
  ggtitle("Price Impact DOGE/BTC around @elonmusk Tweet") +
  theme_minimal()

We can also show a table containing the Top-10 tweets by absolute price movement:
dogetweets %>%
  unnest(price_window) %>%
  filter(time == 10) %>%
  arrange(desc(abs(price))) %>%
  select(created_at, price, text) %>%
  head(10)
# A tibble: 10 x 3
   created_at             price text                                 
                                                     
 1 2021-05-24 19:49:56  0.0494  "If you’d like to help develop Doge,…
 2 2021-05-25 05:37:12  0.0144  "@heydave7 @dogecoin_devs Doge has d…
 3 2021-05-20 13:57:18  0.0131  "@thatdogegirl @WhatsupFranks @Tesla…
 4 2021-06-01 23:54:20  0.0100  "@dogeofficialceo @SouthPark When I …
 5 2021-06-09 20:07:38  0.00757 "@dogeofficialceo @MattWallace888 No…
 6 2021-07-13 02:33:53  0.00647 "@dogeofficialceo 🤣"                
 7 2021-06-25 02:00:20 -0.00508 "@hiddin2urleft @ItsDogeCoin @Invest…
 8 2021-07-08 22:34:41  0.00321 "@dogeofficialceo @newscientist Kind…
 9 2021-05-22 22:07:12 -0.00221 "@flcnhvy @thatdogegirl @WhatsupFran…
10 2021-06-05 08:21:59  0.00195 "@lexfridman @VitalikButerin @ethere…
Results
For some tweets we indeed see a slight price effect for the DOGE/BTC quote on the Binance exchange. The most important tweet which triggered an immediate, positive price reaction by almost 5% versus Bitcoin from our sample seems to be this:


If you’d like to help develop Doge, please submit ideas on GitHub & https://t.co/liAPQMFaQB @dogecoin_devs

— Elon Musk (@elonmusk) May 24, 2021


Reproducing Results, QBit Workspace
If you’re interested in a fully reproducible workspace including all data sets for download check the created Workspace HERE
In the next post we will investigate how the sentiment of tweets may affect the price direction of specific markets.
Happy coding!

Full Workspace Automation through a Programmatic Interface (API) Available Now

Mon, 05 Jul 2021 15:00:00 +0000

Full Workspace Automation through a Programmatic Interface (API) Available Now

Each workspace already is an API

QBit Workspace is a new service to immediately deploy data science results at scale. You can think of it as an online data science editor (like RStudio) which can also be controlled and automated from any programming language through a REST API. Once a workspace has been created—including code, environment objects and files—there is no need for a separate (API) deployment step any more. Each workspace already is an API. With its powerful REST API interface it can be easily embedded into any application, app or programming language without running and managing your own R- or Python server.

We’re now happy to announce the launch of our API service in public beta, which allows to control every aspect of the workspace programmatically including actions like:

Workspace creation
Workspace deployment
Code execution
Rendering of RMarkdown documents
File up- and downloads
Package install/remove

The API thus allows to create completely new use cases which can be easily embedded with any programming language into web applications or mobile apps. No API packages like R plumber or Python Flask are needed!

Get Started with the QBit Workspaces API

To use the API from R first install the qbit package from the Quantargo Github repository:

remotes::install_github("quantargo/qbit")

Next, you need to retrieve your free API key from the Quantargo page settings section:

For more information about API key creation and usage also see our detailed step-by-step guide. Ideally, set your API key QKEY through the options() settings as

options(QKEY = "")

so that all further API calls use the key accordingly. Now you are ready to interact with QBit workspace! As a first example, we’ll show how to create an API-ready RMarkdown report within R.

Creating RMarkdown Documents through the qbit R-API

RMarkdown combines markdown text with R outputs (e.g. plots, tables) to create reproducible documents in multiple output formats (e.g. HTML, PDF, Word, Powerpoint, see also here). Most R-data scientists use their local (RStudio) environment to produce these reports. But what if we want to render these reports through a web application on the fly, maybe even parametrized or with updated input data sets? In the following section we’ll create a QBit workspace for RMarkdown to quickly render an HTML document through the API.

Let’s start by creating a new workspace based on the RMarkdown template:

qbit_id <- qbit::create(qbit_name = "RMarkdown Example HTML document")

qbit_id

[1] "qbit-rmarkdown-example-html-document-eGJWV404T"

The created workspace received a new and unique qbit_id based on its qbit_name title. You can also visit the new workspace online and even share its link with your friends/co-workers. Further changes to your workspace can now be done through the API using the qbit::deploy() function or directly within the online editor.

Once you are satisfied with your workspace you can run specific R commands, retrieve their respective outputs and integrate them into your application. Most typically, you might want to execute specific commands like predict() (for model predictions) or any kinds of user–defined functions through qbit::run. The qbit::run interface, which allows to execute any arbitrary R code, is therefore very general and can support any complex API use cases. For our RMarkdown use case we would like to render the main.Rmd file as an HTML document with qbit::render:

render_out <- qbit::render(qbit_id)

The $console_output element contains a data frame (tibble) of all created contents through the call:

render_out$console_output

# A tibble: 8 x 3
  type       content                                          name   
                                                      
1 code-input "rmarkdown::render(\"main.Rmd\")"                   
2 code-mess… "Warning message: \n\nprocessing file: main.Rmd…    
3 code-outp… "\r  |                                         …    
4 code-outp… "\r  |                                         …    
5 code-mess… "Warning message: output file: main.knit.md\n\n"    
6 code-outp… "/usr/bin/pandoc +RTS -K512m -RTS main.utf8.md …    
7 code-mess… "Warning message: \nOutput created: main.html\n"    
8 file       "https://cdn.quantargo.com/assets/user/courses/… main.h…

The link of the created Rmarkdown document is located in the row where content type equals "file":

library(dplyr)
render_out$console_output %>%
  filter(type == "file")

# A tibble: 1 x 3
  type  content                                              name    
                                                      
1 file  https://cdn.quantargo.com/assets/user/courses/b8451… main.ht…

The included link in the content column can be easily integrated into your own web application (via an <iframe> tag) or just downloaded locally (e.g. via download.file() in R).

Thanks to the serverless (AWS Lambda) back-end the QBit Workspace is quickly scalable to thousands of concurrent requests. The service is now available in public beta and can be deployed into your own infrastructure (Docker/Container based including Lambda, Kubernetes, Open Shift) upon request.

Happy deploying!

Create and Preview RMarkdown Documents with QBit Workspace

Fri, 25 Jun 2021 16:35:00 +0000

Create and Preview RMarkdown Documents with QBit Workspace

RMarkdown is an excellent format to create documents which combine code outputs with text—a programming paradigm called Literate Programming first introduced by Donald Knuth. Although RMarkdown documents are mostly used by the R community, preferably within the RStudio IDE, the format is not restricted to the R language. Also other language engines like Python, SQL or Julia can be used with RMarkdown. The current knitr package version 1.33 lists even 44 available engines:

names(knitr::knit_engines$get())

 [1] "awk"       "bash"      "coffee"    "gawk"      "groovy"   
 [6] "haskell"   "lein"      "mysql"     "node"      "octave"   
[11] "perl"      "psql"      "Rscript"   "ruby"      "sas"      
[16] "scala"     "sed"       "sh"        "stata"     "zsh"      
[21] "highlight" "Rcpp"      "tikz"      "dot"       "c"        
[26] "cc"        "fortran"   "fortran95" "asy"       "cat"      
[31] "asis"      "stan"      "block"     "block2"    "js"       
[36] "css"       "sql"       "go"        "python"    "julia"    
[41] "sass"      "scss"      "R"         "bslib"

Thanks to the pandoc document converter RMarkdown also supports many different output formats which can be set with the output parameter in the YAML header, including:

HTML: Static HTML files output: html_document
PDF: PDF Documents generated through Latex, output: pdf_document
Word: Microsoft Word documents, output: word_document
Presentations: Presentation formats like MS Powerpoint output: powerpoint_presentation
Dashboards: flexdashboard output: flexdashboard::flex_dashboard

QBits Workspace facilitates the authoring of RMarkdown documents directly within the browser thanks to instant previews in the Viewer pane. The instant preview functionality leads to faster development of RMarkdown documents. See below a short presentation of how RMarkdown authoring works:

Create New RMarkdown Document

In the Workspaces Section section of your Dashboard you can create a New Workspace, enter its name and select the RMarkdown template:

Create HTML Document

Set output: html_document in the YAML header of the document and hit the Render button:

RMarkdown Example HTML document

Create PDF Document

Set output: pdf_document and Render:

RMarkdown Example PDF document

Create Word Document

Set output: word_document and Render:

RMarkdown Example Word document

Create Powerpoint Presentation

Set output: powerpoint_presentation and Render:

RMarkdown Example Powerpoint document

Give it a try by either creating a new workspace from scratch or by copying one of the existing QBit Workspace examples.

Happy reporting, feedback welcome! ✌️

Get your unique certificate with online assessments

Thu, 06 May 2021 11:10:00 +0000

Get your unique certificate with online assessments

At Quantargo, teaching is a big part of what we do. You can use our platform to dive into new data science skills and to understand previously untouched subjects–all in an easy-to-use and interactive environment. Our online data science courses provide a pre-configured data science environment so that you can focus solely on the content.

However, for many of you this may not be enough. Assessments are a crucial prerequisite to prove your skills, show your strengths and detect your weak spots. You may need to prove your skills to a third party, such as an institution or an employer, in which case being able to prove your practical and theoretical skills is very important. This is exactly what we’re tackling with this latest update to our course platform, by introducing online assessments.

How it works

Our course lessons give learners a friendly and interactive environment to dive into new topics. If you forgot something while you’re doing an exercise, you can always go back to get a good grasp of it.

The new online assessments raise the bar as compared to the course mode. First you need to unlock the assessment by finishing the course. This makes sure you have got all the information you need to pass the test. Then, during the assessment you have only limited time available to finish it. Furthermore, while the assessment is ongoing all course materials are locked, so you can’t cheat by opening the course on a different device. Although your personal cheat sheets are still accessible during the assessment.

After passing the final course assessment, you get a unique certificate.

All in all, you really need to know what you’re doing in order to pass it–which is exactly what you’d want from an assessment!

Consolidate Your Knowledge with Trainings

The final course assessment unlocks your unique certificate, but also tests your progress for the whole course! Trainings on the other hand allow you to use start assessments per lesson. This is a perfect way to solidify your knowledge and find weak spots in your understanding the subject matter. All your attempts are saved for easy review afterwards.

Available Now

Online assessments, as well as trainings, are available for all 15+ courses and lessons. If you’re a PRO subscriber already you can go to your dashboard right now and start training!

If you’re just getting started, check out the FREE introduction course lessons Basics and Data Frames and Tibbles. After completion you’ll find the new “Trainings” button enabled in your course dashboard!

Happy learning ✌️

New Course Available Now: Machine Learning with Tidymodels

Tue, 20 Apr 2021 09:30:00 +0000

New Course Available Now: Machine Learning with Tidymodels

The ever increasing application of machine learning models in industry and academia requires tools which are easy to use and ensure a reliable model fitting process. The R package universe covers practically all statistical models on the planet including all relevant machine learning models like neural nets, support vector machines, decision trees, and random forests. However, most of these packages do not provide a consistent interface, which makes it hard to fit and compare models from different families. Even worse, it is hard to create standardized workflows for typical machine learning projects which ensure that

no information has been leaked from the training data, leading to higher performance numbers.
models are compared on the same re-sampling procedures.
performance metrics are calculated correctly.

The tidymodels framework is a new package ecosystem, in which all steps of the machine learning workflow are implemented through dedicated R packages. The consistency of these packages ensures their interoperability and ease of use. Most importantly, the framework makes your machine learning workflow easier to understand and faster to implement. tidymodels should definitely be part of every R data scientist’s tool box. Additionally, it fits perfectly into the tidyverse package ecosystem and provides excellent compatibility with packages like dplyr or ggplot2.

Each lesson in the Machine Learning with Tidymodels course module covers one essential skill which together completes the entire machine learning workflow:

The tidymodels Machine Learning Workflow: Start your machine learning journey and learn the most fundamental building blocks of the tidymodels framework.
Data Preprocessing with recipes: Learn why data preprocessing is crucial in your machine learning workflow and create your first data transformations with the recipes package.
Model Fitting with parsnip: Fit machine learning models using the parsnip package including linear regression, decision trees and boosting trees.
Model Evaluation and Performance Metrics with yardstick: Estimate model quality based on different performance metrics using the yardstick package.
Resampling techniques using rsample: Avoid overfitting by using resampling techniques including cross-validation and bootstrap using the rsample package.
Model optimization using tune: Optimize your model parameters using the tune package to find models which predict new data well.

➔ Get started for Free: Machine Learning with Tidymodels

Get Your Personalized Cheat Sheets

With the latest update on our course platform you can create your own personalized cheat-sheets based on your progress. See also this blog post for more information.

Get Your Certificate with PRO

After completing Machine Learning with Tidymodels you get a unique certificate, which you can download as PDF and include in your portfolio!

Learn more about PRO

Free Data Science Training for People with Disabilities

Tue, 13 Apr 2021 09:30:00 +0000

Free Data Science Training for People with Disabilities

At Quantargo we’re on the mission to provide people with the best data science knowledge and (cloud-powered) tools so that they can find new jobs as data scientists, improve their skills in their current roles or do research in a powerful yet reproducible way.

However, we have realized that more often than not people with disabilities are left out of the equation in many ways. I’m very grateful to Iva Tsolova from Jamba who approached us and explained how she was able to organise trainings and successful job placements for Jamba’s students.

We are therefore very happy to announce that we are offering—together with Jamba and AI4DA—a completely free data science training for people with disabilities. The agenda of the trainings, quite similar to our corporate offerings, is centered around one course module with weekly onboarding/mentoring sessions, a final preparation/project session and a final exam.

The Introduction to R Module starts on April 22 at 4 PM CET and continues on a weekly basis until May 20. The live workshops will be held over Zoom.

There are still FREE SPOTS LEFT. If you want to join the training and additionally get a 6 month PRO subscription for free please create an account at www.quantargo.com and send us an email to courses@quantargo.com with your short resúme until April 20.

➔ Get started for Free: Introduction to R

Get Your Personalized Cheat Sheets

With the newest update on our course platform you can create your own personalized cheat-sheets based on your progress. See also this blog post for more information.

Get Your Certificate with PRO

After completing Machine Learning with Tidymodels you get a unique certificate, which you can download as PDF and include in your portfolio!

Learn more about PRO

Create Your Personal Cheat Sheets

Thu, 25 Mar 2021 12:00:00 +0000

Create Your Personal Cheat Sheets

Cheat Sheets are a handy way to have the most important facts right at your fingertips. Especially when learning new concepts or a whole programming language, cheat sheets can help to stay on top of all the new things you’ve just learned. When talking to learners we immediately sensed their big interest in getting additional materials to better keep track of important key concepts and code patterns.

Creating good cheat sheets is hard but there are many great examples in the R community. Most famously the ones created by RStudio should be mentioned here which are uniquely designed and very helpful. We carefully considered all the options concerning cheat sheets for our courses at Quantargo and ultimately took a quite different route.

The content in each of our course modules is structured through lessons which are further divided into different chapters. The key concept of each chapter is represented by a so-called recipe, which typically focuses on one code fragment at a time. For example, the recipe in the chapter Create a scatter plot with ggplot consists of the following fragment:

library(ggplot2)
ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___)
  )

We took advantage of our unique course structure and now show all of the completed recipes in one unified, interactive view in the course dashboard. This gives you a grand overview of the whole course – your personalized cheat sheet² 😮.

And yes, you can also download the cheat sheets as PDFs which are not only personalized but also reflect your current learning progress. With this latest update you can now download cheat sheets for all of our 15+ course lessons. The current PDF cheat sheets might not compare to hand-crafted ones in terms of design but we think that they help even better to repeat and memorize key concepts. Glad to hear your feedback!

If you’re not subscribed to PRO yet, the new cheat sheets and course dashboard updates are also available for our free lessons!

Start your data science now at quantargo.com/courses and join our community of 2000+ learners. If you have already started this journey head over to your course dashboard to download your new cheat sheets.

Inspecting Data Structures

Wed, 17 Mar 2021 20:00:00 +0000

The first step of any data related task is to inspect the data we are dealing with. This is crucial for data wrangling as well, since we need to explore the current structure of the data, in order to identify the required transformations.

Inspect tabular data interactively with View()
Examine the data structure of each object using str()

View(___)
str(___)

Interactive Inspection with View()

Before starting with any kind of data analysis, it is crucial to understand the data we are dealing with. Plotting is a very important tool to get a quick overview of the statistical properties of data and to detect possible outliers. However, visualization might not always be possible, due to the size or complexity of the data set.

As an alternative solution, it might be convenient to interactively dig through the data set. This could be done by a spreadsheet-like interface, similar to Microsoft Excel, which enables to filter, sort and inspect tabular data structures.

R provides the function View(), which shows an interactive data viewer. Depending on the used platform and editor, this viewer might look differently. Below you can see an example of the View() function in RStudio:

View(gapminder)

Quiz: Interactive Inspection with View()

Why should you inspect data sets with View() before starting with your analysis?

Get a first impression of the data quality.
Find outliers and missing values.
Interactively inspect the data set.
Create reproducible outputs for reports.

Start Quiz

Exercise: Interactive Inspection with View()

Use the View() function on the gapminder data set and determine the country with the highest life expectancy. Pay also attention to year the projection was made. Set the variables country and year accordingly!

Start Exercise

Examining Data Structures with str()

Sometimes we need to analyze very large and complex data structures. Displaying these data sources may already be overwhelming and simply not possible with interactive tools. In these cases, the str() function comes to the rescue and prints the structure, as well as the first few values of any R object. Even very large and complex data structures can easily be displayed in the console that way.

As an example, let’s take a look at structure of the TitanicSurvival data set:

library(carData)
str(TitanicSurvival)

'data.frame':   1309 obs. of  4 variables:
 $ survived      : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
 $ sex           : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age           : num  29 0.917 2 30 25 ...
 $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...

It consists of three factor columns (survived, sex and passengerClass) and one numeric column age. Note, that for factor columns both the labels (e.g. "no","yes") as well as the integer values are displayed.

Quiz: Examining Data Structures with str()

In which cases is it benefitial to use the str() function?

Get an overview of highly complex data sets.
Create summary statistics describing the data set.
Plot histograms.
Only for data.frames. str() can only handle data.frames and cannot be used for other objects.

Start Quiz

Quiz: Interpret the Output of str()

library(babynames)
str(babynames)

tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
 $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
 $ sex : chr [1:1924665] "F" "F" "F" "F" ...
 $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
 $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
 $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...

Examine the output of the str() function with the babynames dataset above. Which statements about the data set are correct?

The data set has five rows.
The data set has five columns.
The prop column is of type numeric.
The column sex is of type factor.

Start Quiz

Inspecting Data Structures is an excerpt from the course Advanced Data Transformation, which is available at quantargo.com

VIEW FULL COURSE

Complete the Introduction to Machine Learning Course for Free until March 21

Tue, 09 Mar 2021 08:30:00 +0000

Complete the Introduction to Machine Learning Course for Free until March 21

To all of you who want to get started with machine learning we have a special offer! Until March 21 you can finish all of the new Introduction to Machine Learning course lessons for free and collect the following recipes:

What is Machine Learning?: Differentiate between artificial intelligence, machine learning and deep learning. Identify machine learning use cases.
Machine Learning Techniques: Supervised-, unsupervised- and reinforcement learning: Learn about supervised-, unsupervised- and reinforcement learning techniques.
Supervised Learning with Regression and Classification: Know what predictors and outcome variables are. See how predictors differ in regression- and classification tasks.

The introduction to machine learning is an ideal preparation for our upcoming Machine Learning with Tidymodels course.

Stay tuned and have fun with the new course!

Get Your Free Certificate

Each lesson covers key concepts in small understandable chunks. After finishing all lessons you receive a unique certificate for completing the course. Download your certificate as a PDF and include it in your portfolio!

Happy learning!

START COURSE

The 3 Doors of Data Transformation

Thu, 04 Mar 2021 08:30:00 +0000

This course covers the three most popular package ecosystems for data transformation in R: base R, tidyverse and data.table. You will see which options are better suited for specific use cases in terms of stability, features, speed and consistency.

Get familiar with the main approaches for data handling in R
Understand the advantages and disadvantages of each option

Introduction

Data can come in many shapes and formats from various sources. The first step before any statistical analysis can be done, is to transform the data to the most suitable format. Depending on the use case, this step might require different packages.

In R, there exist three different package ecosystems to transform data, namely base R, tidyverse and data.table. Although functions can easily be combined across these ecosystems, it is not always possible due to subtle differences.

The most important difference lies in the fact, that each ecosystem has its own data frame object defined: data frames, tibbles and data tables. Although tibbles and data tables inherit behavior from their common ancestor data frame, some small differences make them hard to re-use in different ecosystems. Choose your door wisely.

The base R Package Ecosystem

The base R package is already integrated into the basic R installation. Thus, it can be easily used even within very restrictive IT landscapes. It is also an appropriate choice for environments, where frequent package installations and updates might be unfeasible.

The base R package has already stood the test of time and is considered to be very stable, with only very few changes even over major version updates. Chances are high, that some dated R code would still work after years, even on different machines or operating systems.

However, base R does not have the fastest performance for large data sets, compared to other packages and tools. In addition, due to its long history, some base R functions lack consistency and make common workflows harder to integrate. The feature set of base R for data manipulation tasks like joins or reshaping/pivoting, is also lacking behind other packages.

Since base R is installed on every machine running R, it is important for every data scientist to know its features. Its power might surprise you, and you never know which machine you end up working with.

The tidyverse Package Ecosystem

The tidyverse package ecosystem provides many packages for data manipulation—most importantly dplyr and tidyr. These packages are well maintained and already widely adopted in the R community. Its clear and consistent syntax makes learning a breeze. Moreover, all common functions (or verbs) can be combined using the pipe %>% operator.

The feature set of tidyverse for data reshaping and joins is unparalleled in the R ecosystem. Through extension packages like dbplyr and sparklyr, you can even write queries for database or hadoop cluster back ends. The respective queries get translated for the specific back end.

On the other hand, tidyverse has many package dependencies and it might be hard to install and maintain these dependencies in specific IT environments and production systems. The tidyverse packages are still subject to change but should become more stable in future versions.

The data.table Package Ecosystem

data.table is a highly optimized, in-memory transformation and query interface for tabular data. It is very well suited for operations like joins, value updates and filters on large tables (e.g. 10M rows+). The main reason for the large speed gains lies in the fact that data.table is very memory-efficient and tries to avoid copies of large tables as much as possible.

Data tables have some additional features compared to conventional data frames. One can apply data transformation functions directly inside the subset operator [ for example. However, these additional features might lead to constructs which are hard to understand for beginners or non- data table users.

Data table is still one of the fastest in-memory tabular format on the planet. The data.table function fread(), is currently the fastest function to read large comma-separated files within R (and also among other languages). The biggest reason for using data.table is simple: speed.

Pros and Cons

Depending on the requirements for the use cases, specific package ecosystems stand out against its peers:

In terms of stability of the code (over years), the base R package should be considered.
The feature set for data manipulation seems to be broadest in the tidyverse ecosystem.
The data.table package is (still) the speed champion.
Interoperability and consistency for different data transformation problems seems to be best handled by the tidyverse ecosystem.

	base R	tidyverse	data.table
Stability	✅✅	✅	✅
Features	🆗	✅✅	✅
Speed	❌	✅	✅✅
Consistency	❌	✅✅	🆗

Quiz: Which Package Ecosystem to Choose with Storage Backends?

Which R package ecosystem shall be chosen if data transformation code needs to be clean, fast and extensible through many storage backends?

base R
tidyverse
data.table

Start Quiz

Quiz: Which Package Ecosystem to Choose for Stability?

Which R package ecosystem shall be chosen if data transformation code shall be very stable and not many features are required?

base R
tidyverse
data.table

Start Quiz

Quiz: Which Package Ecosystem to Choose for Large Data Sets?

Which R package ecosystem shall be chosen if huge data sets need to be processed and therefore maximum performance is required?

base R
tidyverse
data.table

Start Quiz

The 3 Doors of Data Transformation is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

New Course Available Now: Advanced Data Transformation

Fri, 26 Feb 2021 08:30:00 +0000

New Course Available Now: Advanced Data Transformation

Data comes in many shapes and forms from all kinds of data sources. The first step before any statistical analysis can be done, is to bring the data into a suitable format. In R, there are three different package ecosystems to transform data, namely base R, tidyverse and data.table.

Advanced Data Transformation covers the most popular ways of transforming data into all kinds shapes and forms.

base R is already integrated into the R language itself
tidyverse provides many packages for data manipulation—-most importantly dplyr and tidyr
data.table is a highly optimized, in-memory transformation and query interface for tabular data

There is no on-size-fits-all solution to a problem, so in Advanced Data Transformation you will learn how to use the right tool for your data use cases. For each available package ecosystem it covers all essentials, including:

Data Filtering
Grouping and Aggregating
Pivoting
Joins
and more!

➔ View Course: Advanced Data Transformation

Get Your Certificate with PRO

After completing Advanced Data Transformation you get a unique certificate, which you can download as PDF and include in your portfolio!

Learn more about PRO

Complete the Introduction R Course for Free until March 7

Tue, 23 Feb 2021 08:30:00 +0000

Complete the Introduction to R Course for Free until March 7

To all of you who want to get started with data science and R we have a special offer! Until March 7 you can finish all of the new Introduction to R course lessons for free and collect the following badges:

R Basics: Start your R journey and learn the most fundamental building blocks
Data Frames and Tibbles: Create tabular data structures with data frames and see how they compare to tibbles.
Data Transformation with dplyr: Filter rows, select columns and sort/arrange datasets in combination with the pipe %>% operator.
Data Visualization with ggplot2: Understand the core principles of creating expressive visualizations.

The topics have been carefully selected to give you a contemporary introduction to data science with the R programming language. You will also learn how to create beautiful visualizations, plots and charts!

Get Your Free Certificate

The course gives a friendly introduction into data science topics with 4 in-depth lessons explaining key concepts and getting hands-on with code! After March 7th the course will be still available as part of our new PRO subscription, some chapters will remain free even afterwards.

Start Course and Get Certificate

Happy learning!

Create your first bar chart

Tue, 16 Feb 2021 08:30:00 +0000

Create your first bar chart using geom_col()
Fill bars with color using the fill aesthetic

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

Introduction to bar charts

Bar charts visualize numeric values grouped by categories. Each category is represented by one bar with a height defined by each numeric value.

Bar charts are well suited to compare values among different groups e.g. number of votes by parties, number of people in different countries or GDP per capita in different countries. Bar charts are a bit spacious and work best if the number of groups to compare is rather small.

Below you can find an example showing the number of people (in millions) in the five biggest countries by population in 2007:

Creating a simple bar chart

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

In ggplot2, bar charts are created using the geom_col() geometric layer. The geom_col() layer requires the x aesthetic mapping which defines the different bars to be plotted. The height of each bar is defined by the variable specified in the y aesthetic mapping. Both mappings, x and y are required for geom_col().

Let’s create our first bar chart with the gapminder_top5 dataset. It contains population (in millions) and life expectancy data for the biggest countries by population in 2007.

ggplot(gapminder_top5) + 
  geom_col(aes(x = country, y = pop))

We see that the resulting bars are sorted by the country names in alphabetical order by default.

Exercise: Plot life expectancy by country

Create a bar chart showing the life expectancy of the five biggest countries by population in 2007.

Use the ggplot() function and specify the gapminder_top5 dataset as input
Add a geom_col() layer to the plot
Plot one bar for each country (x aesthetic)
Use life expectancy lifeExp as bar height (y aesthetic)

Start Exercise

Filling bars with color

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

Like other geoms geom_col() allows users to map additional dataset variables to the color attribute of the bar. The fill aesthetic can be used to fill the entire bars with color. A usual confusion is the color aesthetic which specifies the line color of each bar’s border instead of the fill color.

Based on the gapminder_top5 dataset we plot the population (in millions) of the biggest countries and use the continent variable to color each bar:

ggplot(gapminder_top5) + 
  geom_col(aes(x = country, y = pop, fill = continent))

Since the continent variable is a categorical variable the bars have a clear color scheme for each continent. Let’s see what happens if we use a numeric variable like life expectancy lifeExp instead:

ggplot(gapminder_top5) + 
  geom_col(aes(x = country, y = pop, fill = lifeExp))

The bar colors have now changed according the continuous legend on the right. We see that also numeric variables can be used to fill bars.

Exercise: Plot population size by country

Create a bar chart showing the population (in millions) of the five biggest countries by population in 2007.

Use the ggplot() function and specify the gapminder_top5 dataset as input
Add a geom_col() layer to the plot
Plot one bar for each country (x aesthetic)
Use population pop as bar height (y aesthetic)
Use the GDP per capita gdpPercap as fill aesthetic

Start Exercise

Stacked bar charts

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

In some circumstances it might be useful to plot multiple numeric values variables within each bar. Examples are numeric values describing one specific entity (e.g. customers) split among various categories (customer segments) so that the bar height represents the total number (all customers).

The plot below shows the number of phones (in thousands) by continent from 1956 to 1961 as a stacked bar chart:

ggplot(world_phones) + 
  geom_col(aes(x = year, y = phones,
               fill = region))

Exercise: Plot number of crimes by US states

Create a bar chart showing the number of crimes by US state per 100,000 residents in 1973.

Use the ggplot() function and specify the us_arrests dataset as input
Add a geom_col() layer to the plot
Plot one bar for each state (x aesthetic)
Use the number of cases as bar height (y aesthetic)
Use the crime type as fill aesthetic.

Start Exercise

Create your first bar chart is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Create a line graph with ggplot

Sat, 05 Sep 2020 09:56:42 +0000

Use the geom_line() aesthetic to draw line graphs and customize its styling using the color parameter. Specify which coordinates to use for each line with the group parameter.

Create your first line graph using geom_line()
Define how different lines are connected using the group parameter
Change the line color of a line graph using the color parameter

ggplot(___) + 
  geom_line(
    mapping = aes(x = ___, y = ___, 
                  group = ___, 
                  color = ___)
)

Introduction to line graphs

Line graphs are used to visualize the trajectory of one numeric variable against another. Unlike scatter plots the x- and y-coordinates are not visualized through points but are instead connected through lines. Line graphs are most typically used if one variable changes continuously against another numeric variable which is the case for most time series charts (e.g. prices, customers, CO2 concentration, temperature over time), continuous functions (e.g. sine sin(x)) or other near-continuous relationships (real-world supply/demand curves).

Quiz: Line Graphs

Which of the following statements about line graphs are correct?

Line graphs are typically used to plot the relationship between categorical and numeric variables.
Line graphs are typically used to plot variables of type numeric.
For line graphs it is not necessary that the relationship between two variables shows continuity.
Line graphs can be used to plot time series.

Start Quiz

Creating a simple line graph

ggplot(___) + 
  geom_line(
    mapping = aes(x = ___, y = ___, 
                  group = ___, 
                  color = ___)
)

Japan is among the countries with the highest life expectancy. Using the gapminder_japan dataset we determine how the life expectancy in Japan has developed over time. We need to:

Specify the dataset within ggplot()
Define the geom_line() plot layer
Map the year to the x-axis and the life expectancy lifeExp to the y-axis with the aes() function

Note that the ggplot2 library needs to be loaded first with library(ggplot2).

library(ggplot2)
ggplot(gapminder_japan) + 
  geom_line(
    mapping = aes(x = year, y = lifeExp)
)

Exercise: Plot life expectancy of Brazil

Create your first line graph showing the life expectancy of people from Brazil over time.

Use the ggplot() function and specify the gapminder_brazil dataset as input
Add a geom_line() layer to the plot
Map the year to the x-axis and the life expectancy lifeExp to the y-axis with the aes() function

Start Exercise

Adding more lines

ggplot(___) + 
  geom_line(
    mapping = aes(x = ___, y = ___, 
                  group = ___, 
                  color = ___)
)

So far we only focused on single lines, but what if we have multiple countries in the dataset and want to somehow differentiate them?

Line graphs are often extended and used for the comparison of two or more lines. Multiple line graphs show the absolute differences between observations but also how the specific trajectories relate to each other. For example, let’s answer the question: How has life expectancy changed in the countries Austria and Hungary over time?

We first filter the dataset for both countries of interest. Then, we set the variable country as the group argument for the aesthetic mapping. The group argument tells ggplot which observations belong together and should be connected through lines. By specifying the country variable ggplot creates a separate line for each country. To make the lines easier to distinguish we also map color to the country so that each country line has a different color.

gapminder_comparison <- 
  filter(gapminder, country %in% c("Austria", "Hungary"))

ggplot(data = gapminder_comparison) + 
  geom_line(mapping = aes(x = year, y = lifeExp, 
                          group = country, 
                          color = country)
            )

Note that ggplot also separates the lines correctly if only the color mapping is specified (the group parameter is implicitly set).

Exercise: Compare life expectancy

Create a line graph to compare the life expectancy lifeExp in the countries Japan, Brazil and India.

Use the data set gapminder_comparison in your ggplot() function which contains only data for the countries Japan, Brazil and India.
Create a line graph with the geom_line() function
Map the year to the x-axis and the life expectancy lifeExp to the y-axis with the aes() function
Map the group and the color parameter to the country variable.

Start Exercise

Exercise: Compare populations

Compare the population growth over the last decades in the countries Austria, Hungary and Serbia.

Use the data set gapminder_comparison in your ggplot() function which contains only data for the countries in question.
Create a line graph with geom_line()
Map the year to the x-axis and the population pop to the y-axis with aes()
Map the group and the color parameter to the country variable.

Start Exercise

Quiz: Malformed Plot

gapminder_comparison <- filter(gapminder, country %in% c("Brazil", "China", "India"))
ggplot(data = gapminder_comparison) + 
  geom_line(mapping = aes(x = year, y = pop))

What has gone wrong in this plot?

The population numbers are scaled differently in the plotted countries
The group aesthetic should be used to map the population pop variable.
The color aesthetic should be used to map the population lifeExp variable.
The group aesthetic should be used to map the year variable.
The group aesthetic should be used to map the country variable.

Start Quiz

Create a line graph with ggplot is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Data Science Conference Austria 2020

Sat, 05 Sep 2020 00:00:00 +0000

Data Science Conference Austria 2020

Data Science Conference (DSC) Austria is knocking on YOUR door - and it is all for free! 👌💪🤞

DSC Austria will happen on September 8-9th and during the event, you will get a chance to listen to over 15 high-quality talks and 8 tech tutorials on the topic of AI & ML, Data-Driven Decision and Data & AI Literacy - but that is not all!

With the DSC Austria ticket you will get:

✅ Full access to DSC Austria 2020 talks and sessions

✅ Entry to virtual networking sessions

✅ Online certificate of attendance

Check it out and reserve your spot

RESERVE FREE TICKET

DSC Austria 2020 Program

On September 8th you are going to listen to 2 Tech Tutorials & 3 Data Discussion. You are going to listen to Use Julia for your Scientific Computing Jobs! by Przemyslaw Szufel from Nunatak Capital and Recommender Systems using Deep Graph Library and Apache MXNet by Cyrus Vahid from AWS. Also, you will get a chance to listen to the next data discussions: Are Robo Bankers on our Doorstep?, May AI be Profitable and Ethical at the Same Time? and How AI is Fostering Dehumanization of Decision Making?. Our Panelist will be Martin Moessler, Craig Matthews, Georg Koldorfer, Aleksandra Przegalinska & Wolfgang Kienreich.

On September 9th you will be able to listen to 7 excellent talks & participate in 2 networking sessions. We will start our day with the keynote talk Automated Machine Learning for Fast Experiments and Prototypes delivered by Philipp Singer & Dmitry Gordeev, from h2o.ai. After that, you will be able to listen to experts such as Dragan Pleskonjic, Ronald Hochreiter, Valentina Djordjevic and others.

CHECK FULL PROGRAM

Specify additional aesthetics for points

Tue, 28 Jul 2020 12:33:47 +0000

ggplot2 implements the grammar of graphics to map attributes from a data set to plot features through aesthetics. This framework can be used to adjust the point size, color and transparency alpha of points in a scatter plot.

Add additional plotting dimensions through aesthetics
Adjust the point size of a scatter plot using the size parameter
Change the point color of a scatter plot using the color parameter
Set a parameter alpha to change the transparency of all points
Differentiate between aesthetic mappings and constant parameters

ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___, 
                  color = ___, 
                  size  = ___),
    alpha  = ___
  )

Adding more plot aesthetics

In their most basic form scatter plots can only visualize datasets in two dimensions through the x and y aesthetics of the geom_point() layer. However, most data sets have more than two variables and thus might require additional plotting dimensions. ggplot() makes it very easy to map additional variables to different plotting aesthetics like size, transparency alpha and color.

Let’s consider the gapminder_2007 dataset which contains the variables GDP per capita gdpPercap and life expectancy lifeExp for 142 countries in the year 2007:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp))

Mapping the continent variable through the point color aesthetic and the population pop (in millions) through the point size we obtain a much richer plot including 4 different variables from the data set:

Quiz: geom_point() Aesthetics

Which aesthetics can be specified for geom_point()?

geom_line
color
point
alpha
size

Start Quiz

Adjusting point color

ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___, 
                  color = ___, 
                  size  = ___),
    alpha  = ___
  )

Typically, the point color is used to introduce a new dimension to a scatter plot. In ggplot we use the color aesthetic to specify the mapping of a variable to the color of the points.

For the gapminder_2007 dataset we can plot the GDP per capita gdpPercap vs. the life expectancy lifeExp as follows:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp))

To color each point based on the continent of each country we can use:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp,
                 color = continent))

We see that in the resulting plot each point is colored differently based on the continent of each country. ggplot uses the coloring scheme based on the categorical data type of the variable continent.

By contrast, let’s see how the plot looks like if we color the points by the numeric variable population pop:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp,
                 color = pop))

The scale immediately changes to continuous as it can be seen in the legend and the light-blue points are now the countries with the highest population number (China and India).

Exercise: Reconstruct Gapminder graph

Reconstruct the following graph which shows the relationship between GDP per capita and life expectancy for the year 2007:

Use the ggplot() function and specify the gapminder_2007 dataset as input
Add a geom_point layer to the plot and create a scatter plot showing the GDP per capita gdpPercap on the x-axis and the life expectancy lifeExp on the y-axis
Make the color aesthetic of the points unique for each continent

Start Exercise

Exercise: Create a colored scatter plot with DavisClean

The DavisClean dataset contains the height and weight measurements of 199 people.

Use the ggplot() function and specify the DavisClean dataset as input
Add a geom_point() layer to the plot and create a scatter plot showing the weight on the x- and the height on the y-axis
Make the color aesthetic of the points unique by the sex of each individual.

Start Exercise

Adjusting point size

ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___, 
                  color = ___, 
                  size  = ___),
    alpha  = ___
  )

For the gapminder_2007 dataset we can plot the GDP per capita gdpPercap vs. the life expectancy as follows:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp))

To adjust the point size based on the population (pop) of each country we can use:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp,
                 size = pop))

We see that the point sizes in the plot above do not clearly reflect the population differences in each country. If we compare the point size representing a population of 250 million people with the one displaying 750 million, we can see, that their sizes are not proportional. Instead, the point sizes are binned by default. To reflect the actual population differences by the point size we can use the scale_size_area() function instead. The scaling information can be added like any other ggplot object with the + operator:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp,
                 size = pop)) + 
  scale_size_area(max_size = 10)

Note that we have adjusted the point’s max_size which results in bigger point sizes.

Exercise: Create a Gapminder scatter plot using size

Create a scatter plot with ggplot2 which shows the relationship between GDP per capita and life expectancy for the year 2007 using the gapminder_2007 dataset.

Use the ggplot() function and specify the gapminder_2007 dataset as input
Add a geom_point() layer to the plot and create a scatter plot showing the GDP per capita gdpPercap on the x-axis and the life expectancy lifeExp on the y-axis
Use the size aesthetic to adjust the point size by the population pop
Use the scale_size_area() function so that the point sizes reflect actual population differences and set the max_size of each point to 10

Start Exercise

Setting global aesthetics: transparency

ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___, 
                  color = ___, 
                  size  = ___),
    alpha  = ___
  )

Plotting many points with similar x- and y-coordinates in one graph can produce dense point clouds. Many points in these clouds are over plotted and the true number of observations in a certain area is not visible any more. As a solution, we can set the transparency of each point using the ggplot parameter alpha.

Since we do not want to set the point transparency individually for each point but globally for all points we do not set the alpha parameter as an aesthetic mapping (within aes()) but outside.

We set the opacity of each point to 50% through the parameter alpha outside as a constant parameter:

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp, size = pop), 
             alpha = 0.5)

We can now clearly see how many points are overlapping each other and the opacity of each point is set to 0.5.

Quiz: Gapminder Plot

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp, size = pop, 
                 alpha = 0.5, 
                 color = "red"))

Which statements about the plot above are correct?

Constant plot parameters should be set outside of an aesthetic mapping aes().
The reason for the legend entries alpha and color are that they are set as aesthetic mappings instead of global parameters.
The parameter lifeExp should be set as a global parameter.
The parameter gdpPercap should be set as a global parameter.

Start Quiz

Exercise: Reproduce Gapminder scatter plot

Try to reproduce the following plot:

Use the ggplot() function and specify the gapminder_2007 dataset as input
Add a geom_point layer to the plot and create a scatter plot showing the GDP per capita gdpPercap on the x-axis and the life expectancy lifeExp on the y-axis
Use the color aesthetic to indicate each continent by a different color
Use the size aesthetic to adjust the point size by the population pop
Use scale_size_area() so that the point sizes reflect the actual population differences and set the max_size of each point to 15
Set the opacity/transparency of each point to 70% using the alpha parameter

Start Exercise

Specify additional aesthetics for points is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Create a scatter plot with ggplot

Wed, 22 Jul 2020 07:09:41 +0000

Make your first steps with the ggplot2 package to create a scatter plot. Use the grammar-of-graphics to map data set attributes to your plot and connect different layers using the + operator.

Define a dataset for the plot using the ggplot() function
Specify a geometric layer using the geom_point() function
Map attributes from the dataset to plotting properties using the mapping parameter
Connect different ggplot objects using the + operator

library(ggplot2)
ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___)
  )

Introduction to scatter plots

Scatter plots use points to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x- and y-axis. Let’s see an example of a scatter plot to understand the relationship between the speed and the stopping distance of cars:

Each point represents a car. Each car starts to break at a speed given on the y-axis and travels the distance shown on the x-axis until full stop. If we take a look at all points in the plot, we can clearly see that it takes faster cars a longer distance until they are completely stopped.

Quiz: Scatter Plot Facts

Which of the following statements about scatter plots are correct?

Scatter plots visualize the relation of two numeric variables
In a scatter plot we only interpret single points and never the relationship between the variables in general
Scatter plots use points to visualize observations
Scatter plots visualize the relation of categorical and numeric variables

Start Quiz

Specifying a dataset

library(ggplot2)
ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___)
  )

To create plots with ggplot2 you first need to load the package using library(ggplot2).

After the package has been loaded specify the dataset to be used as an argument of the ggplot() function. For example, to specify a plot using the cars dataset you can use:

library(ggplot2)
ggplot(cars)

Note that this command does not plot anything but a grey canvas yet. It just defines the dataset for the plot and creates an empty base on top of which we can add additional layers.

Exercise: Specify the gapminder dataset

To start with a ggplot visualizing the gapminder dataset we need to:

Load the ggplot2 package
Load the gapminder package
Define the gapminder dataset to be used in the plot with the ggplot() function

Start Exercise

Specifying a geometric layer

library(ggplot2)
ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___)
  )

We can use ggplot’s geometric layers (or geoms) to define how we want to visualize our dataset. Geoms use geometric objects to visualize the variables of a dataset. The objects can have multiple forms like points, lines and bars and are specified through the corresponding functions geom_point(), geom_line() and geom_col():

Quiz: Scatter Plot Layers

Which geometric layer should be used to create scatter plots in ggplot2?

point_geom()
geom()
geom_scatter()
geom_point()

Start Quiz

Creating aesthetic mappings

library(ggplot2)
ggplot(___) + 
  geom_point(
    mapping = aes(x = ___, y = ___)
  )

ggplot2 uses the concept of aesthetics, which map dataset attributes to the visual features of the plot. Each geometric layer requires a different set of aesthetic mappings, e.g. the geom_point() function uses the aesthetics x and y to determine the x- and y-axis coordinates of the points to plot. The aesthetics are mapped within the aes() function to construct the final mappings.

To specify a layer of points which plots the variable speed on the x-axis and distance dist on the y-axis we can write:

geom_point(
  mapping = aes(x=speed, y=dist)
)

The expression above constructs a geometric layer. However, this layer is currently not linked to a dataset and does not produce a plot. To link the layer with a ggplot object specifying the cars dataset we need to connect the ggplot(cars) object with the geom_point() layer using the + operator:

ggplot(cars) + 
  geom_point(
    mapping = aes(x=speed, y=dist)
  )

Through the linking ggplot() knows that the mapped speed and dist variables are taken from the cars dataset. geom_point() instructs ggplot to plot the mapped variables as points.

The required steps to create a scatter plot with ggplot can be summarized as follows:

Load the package ggplot2 using library(ggplot2).
Specify the dataset to be plotted using ggplot().
Use the + operator to add layers to the plot.
Add a geometric layer to define the shapes to be plotted. In case of scatter plots, use geom_point().
Map variables from the dataset to plotting properties through the mapping parameter in the geometric layer.

Exercise: Visualize the “cars” dataset

Create a scatter plot using ggplot() and visualize the cars dataset with the car’s stopping distance dist on the x-axis and the speed of the car on the y-axis.

The ggplot2 package is already loaded. Follow these steps to create the plot:

Specify the dataset through the ggplot() function
Specify a geometric point layer with the geom_point() function
Map the speed to the x-axis and the dist to the y-axis with aes()

Start Exercise

Exercise: Visualize the Gapminder dataset

Create a scatter plot using ggplot() and visualize the gapminder_2007 dataset with the GDP per capita gdpPercap on the x-axis and the life expectancy lifeExp of each country on the y-axis.

The ggplot2 package is already loaded. Follow these steps to create the plot:

Specify the gapminder_2007 dataset through the ggplot() function
Specify a geometric point layer with geom_point().
Map the gdpPercap to the x-axis and the lifeExp to the y-axis with aes()

Start Exercise

Create a scatter plot with ggplot is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Why data visualization is important

Wed, 15 Jul 2020 13:24:16 +0000

Data visualization is not only important to communicate results but also a powerful technique for exploratory data analysis. Each plot type like scatter plots, line graphs, bar charts and histograms has its own purpose and can be leveraged in a powerful way using the ggplot2 package.

Understand the different roles of data visualization
Understand the different plot types available
Get an overview of the ggplot2 package.

Introduction to data visualization

A picture is worth a thousand words.

Data visualization is the quickest and most powerful technique to understand new and existing information. During an initial exploration phase data scientists try to reveal the underlying features of a dataset like different distributions, correlations or other visible patterns. This process is also called exploratory data analysis (EDA) and marks the starting point of each data science project.

The graphs produced during the EDA show the data scientist the directions of the journey ahead. Revealed patterns can inspire hypothesis about the underlying processes, features of the dataset to be extracted or modelling techniques to be tested. Last but not least, visualizations uncover outliers and data errors which the data scientist needs to take care about.

The biggest role for data visualization is the communication of data science findings to colleagues and customers through presentations, reports or dashboards. Effort used for EDA and visualizations is time well spent since results can be directly used to communicate findings.

Quiz: Visualization Phase

For which phases is data visualization important in the data science workflow?

Explorative Data Analysis (EDA).
Detection of outliers.
Communication of Results.

Start Quiz

Available Plot Types

There are many plot types available which help to understand different features and relationships in the dataset.

During the exploratory data analysis phase we typically want to detect the most obvious patterns by looking at each variable in isolation or by detecting relationships of variables against others. The used plot type is also determined by the data type of the input variables like numeric or categorical.

Scatter Plots

Scatter plots are used to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x and y-axis.

Line Graphs

Line graphs are used to visualize the trajectory of one numeric variable against another which are connected through lines. They are well suited if values only change continuously - like temperature over time.

Bar Charts and Histograms

Bar charts visualize numeric values grouped by categories. Each category is represented by one bar with a height defined by each numeric value. Histograms are specific bar charts to summarize the number of occurrences of numeric values over a set of value ranges (or bins). They are typically used to determine the distribution of numeric values.

Others

Other frequently used plot types in data science include:

Box plots: Show distributional information of numeric values grouped in categories as boxes. Great to quickly compare multiple distributions.
Violin plots: Same as box plots but show distributions as violins.
Heat Maps: Show interactions of variables - typically correlations - as rastered image highlighting areas of high interaction.
Network Graphs: Show connections between nodes

Quiz: Distribution Comparison Plots

Which plot types are typically used to compare distributions of numeric variables?

Box plots
Network graphs
Violin plots
Line Graphs

Start Quiz

Introducing: ggplot2

Due to the importance of visualization for data science and statistics, R offers a rich set of tools and packages. The core R language already provides a rich set of plotting functions and plot types. These plotting functions require users to specify how to plot each element on the canvas step by step. By contrast, the ggplot2 package allows the specification of plots through set of plotting layers. This requires the package to figure out the required steps to take to produce the graph.

Through the pre-defined set of geometric layers, facets and themes ggplot2 enables users to create beautiful graphs in very short time. ggplot2 is also the most widely adopted plotting library in the R community.

Quiz: ggplot2 Facts

Which statements about data visualization and ggplot2 are correct?

ggplot2 is the only way to create plots in R.
ggplot2 facilitates the creation of good looking graphs quickly.
ggplot2 requires users to specify the plotting commands in a step-by-step fashion.
ggplot2 enables users to specify plots in a declarative way.

Start Quiz

Why data visualization is important is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Create a data transformation pipeline

Mon, 06 Jul 2020 21:59:52 +0000

All data transformation functions in dplyr can be connected through the pipe %>% operator to create powerful and yet expressive data transformation pipelines.

Use the pipe operator %>% to combine multiple dplyr functions into one pipeline

 %>%
  filter(___) %>%
  select(___) %>%
  arrange(___)

Using the %>% operator

The pipe operator %>% is a special part of the tidyverse universe. It is used to combine multiple functions and run them one after the other. In this setting the input of each function is the output of the previous function. Imagine we have the pres_results data frame and want to create a smaller, more transparent data frame for answering the question: In which states was the democratic party the most popular choice in the 2016 US presidential election? To accomplish this task we would need to take the following steps:

filter() the data frame for the rows, where the year variable equals 2016
select() the two variables state and dem, since we are not interested in the rest of the columns.
arrange() the filtered and selected data frame based on the dem column in a descending way.

The steps and functions described above should be run one after the other, where the input of each function is the output of the previous step. Applying the things you learned so far, you could accomplish this task by taking the following steps:

result <- filter(pres_results, year==2016)
result <- select(result, state, dem)
result <- arrange(result, desc(dem))
result

# A tibble: 51 x 2
  state   dem
   
1 DC    0.905
2 CA    0.617
3 HI    0.610
# … with 48 more rows

The first function takes the pres_results data frame, filters it according to the task description and assigns it to the variable result. Then, each subsequent function takes the result variable as input and overwrites it with its own output.

The %>% operator provides a practical way for combining the steps above into seemingly one step. It takes a data frame as the initial input. Then, it applies a list of functions, and passes on the output of each function for the input for the next function. The same task as above can be accomplished using the pipe operator %>% like this:

pres_results %>%
  filter(year==2016) %>%
  select(state, dem, rep) %>%
  arrange(desc(dem))

# A tibble: 51 x 3
  state   dem    rep
     
1 DC    0.905 0.0407
2 CA    0.617 0.316 
3 HI    0.610 0.294 
# … with 48 more rows

We can interpret the code in the following way:

We define the original data set as a starting point.
Using the %>% operator right after the data frame tells dplyr, that a function is coming, which takes the previously defined data frame as input.
We use each function as usual, but skip the first parameter. The data frame input is automatically provided by the output of the previous step.
As long as we add the %>% operator after a step, dplyr will expect an additional step.
In our example the pipeline closes with a arrange() function. It gets the filtered and selected version of the pres_results data frame as input and sorts it based on the dem column in a descending way. Finally, it gives back the output.

One difference between the two approaches is, that the %>% operator does not save permanently the intermediate or the final results. To save the resulting data frame we need to assign the output to a variable:

result <- pres_results %<>%
  filter(year==2016) %>%
  select(state, dem) %>%
  arrange(desc(dem))

result

# A tibble: 51 x 2
  state   dem
   
1 DC    0.905
2 CA    0.617
3 HI    0.610
# … with 48 more rows

Exercise: Austrian Life Expectancy

Use the %>% operator on the gapminder data set and create a simple data frame to answer the following question: How did the life expectancy in Austria change over the last decades? Required packages are already loaded.

Define the gapminder data frame as the base data frame
Filter only the rows where the country column is equal to Austria by piping gapminder to the filter() function.
Select only the columns: year and lifeExp from the filtered result.
Arrange the results based on the year column based on the selected columns.

Start Exercise

Exercise: European GDP Per Capita

Use the %>% operator on the gapminder dataset and create a simple tibble to answer the following question: Which European country had the highest GDP per capita in 2007? Required packages are already loaded.

Define the gapminder tibble as the input
Filter only the rows where the year column is equal to 2007
Use a second layer of filter and keep only the rows where the continent column is equal to Europe
Select only the columns: country and gdpPercap
Arrange the results based on the gdpPercap column in a descending way

Start Exercise

Exercise: Americas Population

Use the %>% operator on the gapminder dataset and create a simple tibble to answer the following question: Which country on the continent Americas had the largest population in 2007?

Define the gapminder tibble as the input
Filter only the rows where the year column is equal to 2007
Use a second layer of filter and keep only the rows where the continent column is equal to Americas
Select only the columns: country and pop
Arrange the results based on the pop column in a descending way

Start Exercise

Quiz: Malformed Code

gapminder %>%
  filter(year == 2007, continent == "Americas") %>%
  select(gapminder, country, pop) %>%
  arrange(desc(pop)) %>%

Take a look at the code above. What mistakes does it contain?

The gapminder tibble should not be defined in the select() function.
There should be no %>% applied after the last line.
There will be no output, because you cannot use these functions in this order.
The desc() function should be applied on the whole arrange() function and not on a single column.

Start Quiz

Create a data transformation pipeline is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Sort data frames by columns

Thu, 02 Jul 2020 10:48:36 +0000

To select areas of interest in a data frame they often need to be ordered by specific columns. The dplyr arrange() function supports data frame orderings by multiple columns in ascending and descending order.

Use the arrange() function to sort data frames.
Sort data frames by multiple columns using arrange().

arrange(, )
arrange(, , , ...)

The arrange() function with a single column

arrange(, )
arrange(, , , ...)

The arrange() function orders the rows of a data frame. It takes a data frame or a tibble as the first parameter and the names of the columns based on which the rows should be ordered as additional parameters. Let’s assume, we want to answer the question: Which states had the highest percentage of Republican voters in the 2016 US presidential election? To answer this question, in the following example we use the pres_results_2016 data frame, containing information only for the 2016 US presidential election. We arrange() the data frame based on the rep column (Republican votes in percentage):

arrange(pres_results_2016, rep)

# A tibble: 51 x 6
   year state total_votes   dem    rep  other
               
1  2016 DC         312575 0.905 0.0407 0.0335
2  2016 HI         437664 0.610 0.294  0.0958
3  2016 VT         320467 0.557 0.298  0.0737
# … with 48 more rows

As you can see in the output, the data frame is sorted in an ascending order based on the rep column. However, we would prefer to have the results in a descending order, so that we can instantly see the state with the highest rep percentage. To sort a column in a descending order, all we need to do is apply the desc() function on the given column inside the arrange() function:

arrange(pres_results_2016, desc(rep))

# A tibble: 51 x 6
   year state total_votes   dem   rep  other
              
1  2016 WV         713051 0.265 0.686 0.0489
2  2016 WY         258788 0.216 0.674 0.0830
3  2016 OK        1452992 0.289 0.653 0.0575
# … with 48 more rows

Arranging is not only possible on numeric values, but on character values as well. In that case, dplyr sorts the rows in alphabetic order. We can arrange character columns just like numeric ones:

arrange(pres_results_2016, state)

# A tibble: 51 x 6
   year state total_votes   dem   rep  other
              
1  2016 AK         318608 0.366 0.513 0.0928
2  2016 AL        2123372 0.344 0.621 0.0254
3  2016 AR        1130635 0.337 0.606 0.0577
# … with 48 more rows

Exercise: Use arrange() based on a single column

The gapminder_2007 dataset contains economic and demographic data about various countries for the year 2007. Arrange the tibble and inspect which country had the lowest life expectancy lifeExp in 2007! The dplyr package is already loaded.

Apply the arrange() function on the gapminder_2007 tibble
Order the tibble based on the lifeExp column

Start Exercise

Exercise: Use arrange() in combination with desc()

The gapminder_2007 dataset contains economic and demographic data about various countries for the year 2007. Arrange the tibble and inspect which countries had the largest population in 2007! The dplyr package is already loaded.

Apply the arrange() function on the gapminder_2007 tibble.
Sort the tibble in a descending order based on the pop column.

Start Exercise

The arrange() function with multiple columns

We can use the arrange() function on multiple columns as well. In this case the order of the columns in the function parameters, sets a hierarchy of ordering. The function starts by ordering the rows based on the first column defined in the parameters. In case there are several rows with the same value, the function decides the order based on the second column defined in the parameters. If there are still multiple rows with the same values, the function decides based on the third column defined in the parameters (if defined) and so on.

In the following example we use the pres_results_subset data frame, containing election results only for the states: "TX"(Texas),"UT"(Utah) and "FL"(Florida). First we sort the data frame in a descending order based on the year column. Then, we add a second level, and order the data frame based on the dem column:

arrange(pres_results_subset, year, dem)

# A tibble: 33 x 6
   year state total_votes   dem   rep   other
               
1  1976 UT         541218 0.336 0.624 0.0392 
2  1976 TX        4071884 0.511 0.480 0.00817
3  1976 FL        3150631 0.519 0.466 0.0143 
# … with 30 more rows

As you can see in the output, the data frame is overall ordered based on the year column. However, when the value of year is the same, the order of the rows is decided by the dem column.

Exercise: Use arrange() based on multiple columns

The gapminder_2007 tibble contains economic and demographic data about various countries for the year 2007. Arrange the tibble and inspect for each continent, which countries had the highest life expectancy in 2007! The dplyr package is already loaded.

Apply the arrange() function on the gapminder_2007 tibble.
Order the tibble based on the continent column!
In case there are rows with the same continent, sort the tibble in a descending order based on the lifeExp column!

Start Exercise

Quiz: arrange() Function

Which of the following statements are true about the arrange() function?

The arrange() function orders the rows of a data frame.
To arrange() the values of column in an ascending order, we need to use the asc() function.
To arrange() the values of column in a descending order, we need to use the desc() function.
You can only arrange() a data frame based on one column.

Start Quiz

Sort data frames by columns is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Filter data frame rows

Fri, 26 Jun 2020 18:56:10 +0000

We often want to operate only on a specific subset of rows of a data frame. The dplyr filter() function provides a flexible way to extract the rows of interest based on multiple conditions.

Use the filter() function to sort out the rows of a data frame that fulfill a specified condition
Filter a data frame by multiple conditions

filter(my_data_frame, condition)
filter(my_data_frame, condition_one, condition_two, ...)

The filter() function

filter(my_data_frame, condition)
filter(my_data_frame, condition_one, condition_two, ...)

The filter() function takes a data frame and one or more filtering expressions as input parameters. It processes the data frame and keeps only the rows that fulfill the defined filtering expressions. These expressions can be seen as rules for the evaluation and keeping of rows. In the majority of the cases, they are based on relational operators. As an example, we could filter the pres_results data frame and keep only the rows, where the state variable is equal to "CA" (California):

filter(pres_results, state == "CA")

# A tibble: 11 x 6
    year state total_votes   dem   rep  other
               
 1  1976 CA        7803770 0.480 0.497 0.0230
 2  1980 CA        8582938 0.359 0.527 0.114 
 3  1984 CA        9505041 0.413 0.575 0.0122
 4  1988 CA        9887065 0.476 0.511 0.0131
 5  1992 CA       11131721 0.460 0.326 0.213 
 6  1996 CA       10019469 0.511 0.382 0.107 
 7  2000 CA       10965822 0.534 0.417 0.0490
 8  2004 CA       12421353 0.543 0.444 0.0117
 9  2008 CA       13561900 0.610 0.370 0.0188
10  2012 CA       13038547 0.602 0.371 0.0246
11  2016 CA       14181595 0.617 0.316 0.0581

In the output, we can compare the election results in California for different years.

As another example, we could filter the pres_results data frame and keep only those rows, where the dem variable (percentage of votes for the Democratic Party) is greater than 0.85:

filter(pres_results, dem > 0.85)

# A tibble: 7 x 6
   year state total_votes   dem    rep   other
                
1  1984 DC         211288 0.854 0.137  0.00886
2  1996 DC         185726 0.852 0.0934 0.0513 
3  2000 DC         201894 0.852 0.0895 0.0563 
4  2004 DC         227586 0.892 0.0934 0.0125 
5  2008 DC         265853 0.925 0.0653 0.00582
6  2012 DC         293764 0.909 0.0728 0.0155 
7  2016 DC         312575 0.905 0.0407 0.0335

In the output we can see for each election year the states where the Democratic Party got over 85% of the votes. Based on the results, we could say that the Democratic Party has a solid voter base in the District of Columbia (known as Washington, D.C.).

Exercise: Use filter() with a single expression

The gapminder dataset contains economic and demographic data about various countries since 1952.

Inspect the data for a single year by using the filter() function.

Apply the filter() function on the gapminder dataset
Keep only the rows where the year is equal to 2007

Note that the dplyr and gapminder packages are already loaded.

Start Exercise

Quiz: filter() Function

Which of the following statements about the filter() function are correct?

Relational operators, such as == or >, are frequently part of the filtering expressions.
The filter() function comes in the dplyr package.
Only numeric variables can be filtered.
The filter() function works only on data frames, not on tibbles.

Start Quiz

Multiple filter expressions

filter(my_data_frame, condition)
filter(my_data_frame, condition_one, condition_two, ...)

The filter() function can take multiple filtering rules as input as well. These can be seen as a combination of rules with the & operator. In order for a row to be included in the output, all filtering rules must be fulfilled by it. In the following example, we filter the pres_results data frame for all rows where the state variable is equal to "CA" and the year variable is equal to 2016:

filter(pres_results, state == "CA", year==2016)

# A tibble: 1 x 6
   year state total_votes   dem   rep  other
              
1  2016 CA       14181595 0.617 0.316 0.0581

We get a single row as output, containing the 2016 US presidential election results for California state.

Exercise: Use filter() with multiple rules

The gapminder dataset contains economic and demographic data about various countries since 1952. Filter the tibble and inspect which countries had a life expectancy over 80 years in the year 2007! The required packages are already loaded.

Use the filter() function on the gapminder tibble.
Filter all rows where the year variable is equal to 2007 and the life expectancy lifeExp is greater than 80!

Start Exercise

Exercise

The gapminder dataset contains economic and demographic data about various countries since 1952. Filter the gapminder tibble and inspect which countries had a population of over 1.000.000.000 in the year 2007! The required packages are already loaded.

Use the filter() function on the gapminder tibble.
Filter all rows where the year variable is equal to 2007 and the population pop is greater than 1000000000!

Start Exercise

Filter data frame rows is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Select columns from a data frame

Fri, 19 Jun 2020 17:02:06 +0000

To select only a specific set of interesting data frame columns dplyr offers the select() function to extract columns by names, indices and ranges. You can even rename extracted columns with select().

Learn to use the select() function
Select columns from a data frame by name or index
Rename columns from a data frame

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

Selecting by name

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

In this chapter we will have a look at the pres_results dataset from the politicaldata package. It contains data about US presidential elections since 1976, converted to a Tibble for nicer printing.

# A tibble: 561 x 6
   year state total_votes   dem   rep   other
               
1  1976 AK         123574 0.357 0.579 0.0549 
2  1976 AL        1182850 0.557 0.426 0.0163 
3  1976 AR         767535 0.650 0.349 0.00134
# … with 558 more rows

For this example, we will have a look at the number of total votes in different states at different elections. Since we are only interested in the number of people who voted we would like to create a custom version of the pres_results data frame that only contains the columns year, state and total_votes. For such filtering, we can use the select() fiction from the dplyr package.

The select() function takes a data frame as an input parameter and lets us decide which of the columns we want to keep from it. The output of the function is a data frame with all rows, but containing only the columns we explicitly select.

We can reduce our dataset to only year, state and total_votes in the following way:

select(pres_results, year, state, total_votes)

# A tibble: 561 x 3
   year state total_votes
          
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

As the first parameter we passed the pres_results data frame, as the remaining parameters we passed the columns we want to keep to select().

Apart from keeping the columns we want, the select() function also keeps them in the same order as we specified in the function parameters.

If we change the order of the parameters when we call the function, the columns of the output change accordingly:

select(pres_results, total_votes, year, state)

# A tibble: 561 x 3
  total_votes  year state
          
1      123574  1976 AK   
2     1182850  1976 AL   
3      767535  1976 AR   
# … with 558 more rows

Exercise: Life expectancy in Austria

The gapminder_austria dataset contains information about the economic and demographic change in Austria over the last decades. To inspect how the life expectancy in Austria changed over time, create a subset of the tibble that contains only the necessary columns for this task:

Use the dplyr select() function and define gapminder_austria as the input tibble.
Keep only the columns year and lifeExp in the output dataset.

Note that the dplyr package is already loaded.

Start Exercise

Renaming columns

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

In addition to defining the columns we want keep, we can also rename them. To do this, we need to set the new column name inside the select() function using the command

new_column_name = current_column

In the following example, we select the columns year, state and total_votes but rename the year column to Election in the output:

select(pres_results, Election = year, state, total_votes)

# A tibble: 561 x 3
  Election state total_votes
             
1     1976 AK         123574
2     1976 AL        1182850
3     1976 AR         767535
# … with 558 more rows

Exercise: Rename columns

The gapminder_india dataset contains information about the economic and demographic change in India over the last decades. Inspect how the population in India changed over time:

Use the select() function and define gapminder_india as the input tibble.
Keep only the columns year and pop.
Rename the pop column to population in the output tibble.

Note that the dplyr package is already loaded.

Start Exercise

Selecting by name range

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

When we use the select() function and define the columns we want to keep, dplyr does not actually use the name of the columns but the index of the columns in the data frame. This means, when we define the first three columns of the pres_results data frame, year, state and total_votes, dplyr converts these names to the index values 1, 2 and 3. We can therefore also use the name of the columns, apply the : operator and define ranges of columns, that we want to keep:

select(pres_results, year:total_votes)

# A tibble: 561 x 3
   year state total_votes
          
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

What the year:total_votes does, can be translated to 1:3, which is simply creating a vector of numerical values from 1 to 3. Then, the select() function takes the pres_results data frame and outputs a subset of it, keeping only the first three columns.

Exercise: Select a name range

The gapminder_europe_2007 dataset contains economic and demographic information about European countries for the year 2007:

# A tibble: 30 x 6
  country continent  year lifeExp      pop gdpPercap
                      
1 Albania Europe     2007    76.4  3600523     5937.
2 Austria Europe     2007    79.8  8199783    36126.
3 Belgium Europe     2007    79.4 10392226    33693.
# … with 27 more rows

Create a subset of the tibble and compare the life expectancy in different European countries for the year 2007:

Apply the select() function on the gapminder_europe_2007 tibble.
Use the : operator and select the columns from country to lifeExp.

Note that the dplyr package is already loaded.

Start Exercise

Select() by indices

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

The select() function can be used with column indices as well. Instead of using names we need to specify the columns we want to select by their indices. Compared to other programming languages the indexing in R starts with one instead of zero. To select the first, fourth and fifth column from the pres_results dataset we can write

select(pres_results, 1,4,5)

# A tibble: 561 x 3
   year   dem   rep
    
1  1976 0.357 0.579
2  1976 0.557 0.426
3  1976 0.650 0.349
# … with 558 more rows

Similarly to defining ranges of columns using their names, we can define ranges (or vectors) of index values instead:

select(pres_results, 1:3)

# A tibble: 561 x 3
   year state total_votes
          
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

Exercise: Select by indices

The gapminder_europe_2007 dataset contains economic and demographic information about European countries for the year 2007.

# A tibble: 30 x 6
  country continent  year lifeExp      pop gdpPercap
                      
1 Albania Europe     2007    76.4  3600523     5937.
2 Austria Europe     2007    79.8  8199783    36126.
3 Belgium Europe     2007    79.4 10392226    33693.
# … with 27 more rows

Create a subset of the dataset and compare the GDP per capita of the European countries for the year 2007:

Apply the select() function on the gapminder_europe_2007 tibble.
Keep the columns country and gdpPercap, but use only the index of the columns (1and 6) for this step.

Note that the dplyr package is already loaded.

Start Exercise

Select columns from a data frame is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Introduction to dplyr

Tue, 16 Jun 2020 06:12:20 +0000

dplyr facilitates the data transformation process by providing a rich framework to manipulate data frames. dplyr functions can be concatenated to powerful transformation pipelines to select, filter, sort, join and aggregate data.

Learn what dplyr does
Get an overview of Select, Filter and Sort
Learn what Joins, Aggregations and Pipelines are

What is dplyr

There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data.

Anthony Goldbloom, Founder and CEO of Kaggle

Having clean data in any Data Science project is super important, because the results only get as good as is the data correct. Cleaning data is also the part which usually consumes most of the time and causes the biggest pains for data scientists. R already offers a broad set of tools and functions to manipulate data frames. However, due to its long history, the available base R tool set is fragmented and hard to use for new users.

The dplyr package facilitates the data transformation process through a consistent collection of functions. These functions support different transformations on data frames, including

filter rows
select columns
sort data
aggregate data

Multiple data frames can also be joined together by common attribute values.

The consistency of dplyr functions improves usability and enables user to connect transformations together to form data pipelines. These pipelines can also be seen as a high-level query language—much like e.g. the SQL language for database queries. Additionally, it is even possible to translate created data pipelines to other back-ends including databases.

Quiz: dplyr Facts

Which of the below statements are correct?

dplyr provides a consistent set of functions for data visualization
dplyr functions can be connected to data pipelines
dplyr queries can be translated to database queries
dplyr supports data transformations like aggregations and joins
dplyr is built for vector transformations

Start Quiz

Function Framework

Every data transformation function in dplyr accepts a data frame as its first input parameter and returns the transformed data frame back as an output. A blueprint for a typical dplyr function looks like this:

transformed <- dplyr_function(my_data_frame, 
                              param_one, 
                              param_two, 
                              ...)

The dplyr_function can be customized further through additional arguments (param_one, param_two) placed after the first data frame parameter (my_data_frame).

The real power of dplyr comes with the pipe operator %>% which allows users to concatenate dplyr functions to data pipelines. The pipe injects the resulting data frame from the previous calculation as the first argument of next one. A data transformation consisting of three functions looks like

dplyr_function_three(
  dplyr_function_two(
    dplyr_function_one(my_data_frame)))

but can be written with the pipe as

my_data_frame %>%
  dplyr_function_one() %>%
  dplyr_function_two() %>%
  dplyr_function_three()

The different reading order of data transformation functions in actual transformation order makes pipelines easier to read than nested function calls.

Quiz: Valid Functions

dplyr_function specifies the transformation function, param_one the parameter for the dplyr function and input_data_frame the data frame to be transformed. Which of the code lines below are valid according to the dplyr function framework?

dplyr_function(param_one, input_data_frame)
dplyr_function(input_data_frame, param_one)
input_data_frame(dplyr_function, param_one)
param_one(dplyr_function, input_data_frame)

Start Quiz

Introduction to dplyr is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Select first or last rows of a data frame

Fri, 12 Jun 2020 07:18:46 +0000

We often do not need to look at all the contents of a data frame in the console. Instead, only parts of it are sufficient like the top or bottom retrieved through the head() and tail() functions.

Select the top of a data frame
Select the bottom of a data frame
Specify the number of lines to select through the parameter n

head(___, n = ___)
tail(___, n = ___)

Selecting the top of a data frame

head(___, n = ___)
tail(___, n = ___)

Data frames can span a large number of rows and columns. Based on the printed output in the console it can be hard to get an initial impression of the data inside the data frame. This issue is not so much of a problem for tibbles which have a nicer console output. Additionally, it can be helpful to easily retrieve the first rows in one command without any indexing or additional packages.

The TitanicSurvival dataset contains data of 1309 passengers represented as rows. A simple print of the dataset would print all passengers, filling up the entire console. Instead, the head() function shows only the first 10 rows of a data frame including its column names:

head(TitanicSurvival)

                                survived    sex     age
Allen, Miss. Elisabeth Walton        yes female 29.0000
Allison, Master. Hudson Trevor       yes   male  0.9167
Allison, Miss. Helen Loraine          no female  2.0000
Allison, Mr. Hudson Joshua Crei       no   male 30.0000
Allison, Mrs. Hudson J C (Bessi       no female 25.0000
Anderson, Mr. Harry                  yes   male 48.0000
                                passengerClass
Allen, Miss. Elisabeth Walton              1st
Allison, Master. Hudson Trevor             1st
Allison, Miss. Helen Loraine               1st
Allison, Mr. Hudson Joshua Crei            1st
Allison, Mrs. Hudson J C (Bessi            1st
Anderson, Mr. Harry                        1st

The number of columns can be tuned using the parameter n. To extract only the first three rows from the data set you can write:

head(TitanicSurvival, n = 3)

                               survived    sex     age
Allen, Miss. Elisabeth Walton       yes female 29.0000
Allison, Master. Hudson Trevor      yes   male  0.9167
Allison, Miss. Helen Loraine         no female  2.0000
                               passengerClass
Allen, Miss. Elisabeth Walton             1st
Allison, Master. Hudson Trevor            1st
Allison, Miss. Helen Loraine              1st

Exercise: Select the top of a data frame

The salaries_sort dataset contains the 2008-09 nine-month academic salary for professors from a college in the US. The dataset is sorted by salary in ascending order.

Inspect the 10 lowest paid professors by selecting the first 10 rows using the head() function.

Start Exercise

Selecting the bottom of a data frame

head(___, n = ___)
tail(___, n = ___)

The tail() function can be used to select the bottom rows of a data frame. Similar to the head() function it also accepts a parameter n to specify the number rows to be returned.

For example, to select the last five rows from the TitanicSurvival dataset you can write:

tail(TitanicSurvival, n = 5)

                          survived    sex  age passengerClass
Zabour, Miss. Hileni            no female 14.5            3rd
Zabour, Miss. Thamine           no female   NA            3rd
Zakarian, Mr. Mapriededer       no   male 26.5            3rd
Zakarian, Mr. Ortin             no   male 27.0            3rd
Zimmerman, Mr. Leo              no   male 29.0            3rd

The head and tail functions can also be combined to select a fragment of the data set from the middle. To select the first five rows from the bottom 500 rows you can write:

head(tail(TitanicSurvival, n = 500), n = 5)

                                survived    sex age passengerClass
Ford, Mr. Edward Watson               no   male  18            3rd
Ford, Mr. William Neal                no   male  16            3rd
Ford, Mrs. Edward (Margaret Ann       no female  48            3rd
Fox, Mr. Patrick                      no   male  NA            3rd
Franklin, Mr. Charles (Charles        no   male  NA            3rd

Exercise: Select the bottom of a data frame

The salaries_sort dataset contains the 2008-09 nine-month academic salary for professors from a college in the US. The dataset is sorted by salary in ascending order.

Inspect the 20 highest paid professors by selecting the last 20 rows using the tail() function.

Start Exercise

Exercise: Select the top from the bottom data frame

The salaries_sort dataset contains the 2008-09 nine-month academic salary for 397 Professors from a college in the US. The dataset is sorted by the salary in ascending order.

Inspect the 10 professors around the median salary by

Selecting the bottom 200 professors using the tail() function
Selecting the top 10 professors out of the bottom 200

Start Exercise

Select first or last rows of a data frame is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Determine the size of a data frame

Tue, 09 Jun 2020 10:26:51 +0000

The size of a data frame, like the number of rows or columns, is often required and can be determined in various ways.

Get number of rows of a data frame
Get number of columns of a data frame
Get dimensions of a data frame

nrow(___)
ncol(___)
dim(___)
length(___)

Data Frame Dimensions

nrow(___)
ncol(___)
dim(___)
length(___)

The number of rows and columns in a data frame can be guessed through the printed output of the data frame. However, it is much easier to get this information directly through functions. Additionally, you might want to use this information in some parts of the code.

Data frames have two dimensions. The number of rows is considered to be the first dimension. It typically defines the number of observations in a data set. To get the number of rows from the Davis data frame in the carData dataset use the nrow() function:

nrow(Davis)

[1] 200

Similarly, the number of columns or attributes of the data frame can be retrieved with ncol():

ncol(Davis)

[1] 5

Exercise: Determine number of elements in data frame

                              survived    sex age passengerClass
Allen, Miss. Elisabeth Walton      yes female  29            1st
 [ reached 'max' / getOption("max.print") -- omitted 1308 rows ]

Determine the number of data values in the TitanicSurvival data frame above given as the number of rows multiplied by the number of columns.

Start Exercise

Retrieving data frame dimensions

nrow(___)
ncol(___)
dim(___)
length(___)

To retrieve the size of all dimensions from a data frame at once you can use the dim() function. dim() returns a vector with two elements, the first element is the number of rows and the second element the number of columns.

For example, the dimensions of the Davis dataset can be retrieved as

dim(Davis)

[1] 200   5

In addition to data frames dim() can also be used for other multi-dimensional R objects such as matrices or arrays. However, when used with vectors dim only returns NULL:

dim(c(1, 3, 5, 7))

NULL

Instead, the length of a vector is determined through length():

length(c(1, 3, 5, 7))

[1] 4

In the case of a data frame length() returns its number of columns:

length(Davis)

[1] 5

Quiz: Data Frame Dimensions

dim(Florida)

What does the above command return for the data set Florida from the carData package which has 11 columns and 67 rows?

67 11
11 67
11
67

Start Quiz

Determine the size of a data frame is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Extract or replace columns in a data frame using `$`

Tue, 02 Jun 2020 21:35:52 +0000

Columns in a data frame can be easily extracted and manipulated with the $ operator. Even new columns can be added by assigning a vector.

Extract columns from a data frame with the $.
Replace values of existing columns in a data frame.
Add new columns to a data frame.

___$___
___$___  <- ___

Extract columns with the $

___$___
___$___  <- ___

Data frames are tables resulting from the combination of column vectors. Users can interact with data frames through numerous operators to extract, add or recombine values. To extract single columns from a data frame R offers a very specific operator: the dollar $. It returns the column vector as indicated by its name based on a data frame preceding the $.

To see the $ operator in action let’s extract the population pop (in 1,000) from different states in the US based on the States dataset (from 1992) in the carData package:

States$pop

 [1]  4041   550  3665  2351 29760  3294  3287   666   607 12938
[11]  6478  1108  1007 11431  5544  2777  2478  3685  4220  1228
[21]  4781  6016  9295  4375  2573  5117   799  1578  1202  1109
[31]  7730  1515 17990  6629   639 10847  3146  2842 11882  1003
[41]  3487   696  4877 16987  1723   563  6187  4867  1793  4892
[51]   454

The command extracts the population column as vector from the data frame. From this vector we can calculate the sum() of the total population as:

sum(States$pop)

[1] 248709

Similarly, the average salary (in $1,000) for teachers can be calculated as the mean() from the pay column:

mean(States$pay)

[1] 30.94118

Quiz: Extract column from a data frame

      rank discipline yrs.since.phd yrs.service  sex salary
1     Prof          B            19          18 Male 139750
2     Prof          B            20          16 Male 173200
3 AsstProf          B             4           3 Male  79750
4     Prof          B            45          39 Male 115000
5     Prof          B            40          41 Male 141500
 [ reached 'max' / getOption("max.print") -- omitted 392 rows ]

Which R command can be used to calculate the average salary of professors in the Salaries dataset from the carData package?

mean(Salaries$salary)
mean(salary$Salaries)
Salaries(mean$salary)
TitanicSurvival(age$mean)

Start Quiz

Exercise: Extract column from a data frame

Calculate the average age of passengers in the TitanicSurvival dataset from the carData package. The carData package is already loaded.

Start Exercise

Extract or replace columns in a data frame using `$` is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

QBits Workspace: A New Online Editor to Share and Deploy R Code

Wed, 27 May 2020 00:00:00 +0000

QBits Workspace: A New Online Editor to Share and Deploy R Code

Today we are excited to announce the QBits Workspace to run and deploy R code in the browser. QBits enable you to run R in a serverless cloud environment and provide an easy and cost-effective way to develop, run, deploy and share data science projects at scale without the need to manage servers, software setup and package installations. They start up instantly, have very quick deployment times and can handle all sorts of data science projects. In fact, QBits already power our online course platform and even more exciting use cases will follow soon.

Why QBits

We created QBits to make the deployment experience for data scientists easier. Too many projects fail because data scientists struggle to deploy their results. Think of a simple ggplot2 example to reproduce the gapminder plots from Hans Rosling’s excellent presentation:

library(ggplot2)
library(dplyr)
library(gapminder)

gapminder_2007 <- filter(gapminder, year == 2007)
gapminder_2007$pop <- gapminder_2007$pop/1e6
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp, 
                 color = continent,
                 size = pop),
  alpha = 0.7) + 
  scale_size_area(max_size = 15)

This plot runs fine locally. However, to reproduce the plot in some interactive web application allowing users to filter the dataset by e.g. year == 1952 we need to

Create a docker container choosing the right operating system.
Install the correct language runtime, e.g. R 4.0.0.
Install all package dependencies (e.g. ggplot2, dplyr, gapminder)
Create a Shiny application or Plumber API for interactive or programmatic use.

You see that even for this simple example the deployment overhead is considerable. This leads to a deployment bottleneck leaving many data science projects unfinished and frustrated data scientists behind. The big difference with QBits is that they already provide the correct container, language runtime and packages. The only thing you have to do is to put your code on top. That’s it.

The QBits Workspace provides a development environment to rapidly develop your custom QBits. Since you are already working within your custom container the final deployment is then only a matter of a second—not weeks.

Check out the previous example Reproduce Gapminder scatter plot within the QBits Workspace here.

What’s Next

We are hard at work to expand the editor to fit more workflows and implement new features. Further updates will introduce the possibility to

Create your own QBits
Add and remove packages (all 15,000+ CRAN packages are available)
QBit deployment including versioning
… and more (yes, Python is coming as well)

For now, head over to our playgrounds and give them a try.

We would love to hear your feedback and feature requests:

Write us at support@quantargo.com or
Hit us up on Twitter

Cheers,

Your Quantargo Team

Create and convert tibbles

Fri, 22 May 2020 18:12:30 +0000

Tibbles are the modern reimagination of data frames and share many commonalities with their ancestors. The most visible difference is how tibble contents are printed to the console. Tibbles are part of the tidyverse and used for their more consistent behaviour compared to data frames.

Learn the difference between data frames and tibbles
Create tibbles from vectors
Convert data frames into tibbles

tibble(___ = ___, 
       ___ = ___, 
       ...)
as_tibble(___)

Introduction to Tibbles

A modern reimagining of the data frame

https://tibble.tidyverse.org

Tibbles are in many ways similar to data frames. In fact, they are inherited from data frames which means that all functions and features available for data frames also work for tibbles. Therefore, when we speak of data frames we also mean tibbles.

In addition to everything a data frame has to offer, tibbles have a more consistent behaviour with better usability in many cases. Most importantly, when a tibble object is printed to the console it automatically shows only the first 10 rows and condenses additional columns. By contrast, a data frame fills up the entire console screen with values which can lead to confusion. Let’s take a look the the gapminder dataset from the gapminder package:

gapminder

# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
                           
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

We immediately see that the gapminder dataset is a tibble consisting of 1,704 rows and 6 columns on the top line. In the second line we can see the column names and their corresponding data types directly below.

For example, the column country has the type <fct> (which is short for “factor”), year is an integer <int> and life expectancy lifeExp is a <dbl>—a decimal number.

Quiz: Tibbles versus Data Frames

Which answers about data frames and tibbles are correct?

The printed output to the console is the same for tibbles and data frames
All functions defined for data frames also work on tibbles.
Tibbles also show the data types in the console output.
To use tibble objects the tibbles package needs to be loaded.
The table dimensions are not shown in the console output for tibbles.

Start Quiz

Creating Tibbles

tibble(___ = ___, 
       ___ = ___, 
       ...)
as_tibble(___)

The creation of tibbles works exactly the same as for data frames. We can use the tibble() function from the tibble package to create a new tabular object.

For example, a tibble containing data from four different people and three columns can be created like this:

library(tibble)
tibble(
  id = c(1, 2, 3, 4),
  name = c("Louisa", "Jonathan", "Luigi", "Rachel"),
  female = c(TRUE, FALSE, FALSE, TRUE)
)

# A tibble: 4 x 3
     id name     female
        
1     1 Louisa   TRUE  
2     2 Jonathan FALSE 
3     3 Luigi    FALSE 
4     4 Rachel   TRUE

Converting data frames to Tibbles

If you prefer tibbles to data frames for their additional features they can also be converted from existing data frames with the as_tibble() function.

For example, the Davis data frame from the carData package can be converted to a tibble like so:

as_tibble(Davis)

# A tibble: 200 x 5
   sex   weight height repwt repht
         
 1 M         77    182    77   180
 2 F         58    161    51   159
 3 F         53    161    54   158
 4 M         68    177    70   175
 5 F         59    157    59   155
 6 M         76    170    76   165
 7 M         76    167    77   165
 8 M         69    186    73   180
 9 M         71    178    71   175
10 M         65    171    64   170
# … with 190 more rows

Exercise: Convert data frame to Tibble

  speed dist
1     4    2
2     4   10
3     7    4
 [ reached 'max' / getOption("max.print") -- omitted 47 rows ]

The data frame cars reports the speed of cars and distances taken to stop. To have a nicer printed output in the console use the as_tibble() function and create a tibble object out of it.

Start Exercise

Create and convert tibbles is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Build a data frame from vectors

Mon, 18 May 2020 17:18:16 +0000

Tabular data is the most common format used by data scientists. In R, tables are represented through data frames. They can be inspected by printing them to the console.

Understand why data frames are important
Interpret console output created by a data frame
Create a new data frame using the data.frame() function
Define vectors to be used for single columns
Specify names of data frame columns

data.frame(___ = ___, 
           ___ = ___, 
           ...)

Introduction to Data Frames

In analysis and statistics, tabular data is the most important data structure. It is present in many common formats like Excel files, comma separated values (CSV) or databases. R integrates tabular data objects as first-class citizens into the language through data frames. Data frames allow users to easily read and manipulate tabular data within the R language.

Let’s take a look at a data frame object named Davis, from the package carData, which includs height and weight measurements for 200 men and women:

Davis

  sex weight height repwt repht
1   M     77    182    77   180
2   F     58    161    51   159
3   F     53    161    54   158
 [ reached 'max' / getOption("max.print") -- omitted 197 rows ]

From the printed output we can see that the data frame spans over 200 rows (3 printed, 197 omitted) and 5 columns. In the example above, each row contains data of one person through attributes, which correspond to the columns sex, weight, height, reported weight repwt and reported height repht.

For example, the first row in the table specifies a Male weighing 77kg and has a height of 182cm. The reported weights are very close with 77kg and 180cm, respectively.

The rows in a data frame are further identified by row names on the left which are simply the row numbers by default. In the case of the Davis dataset above the row names range from 1 to 200.

Quiz: Data Frame Output

      rank discipline yrs.since.phd yrs.service  sex salary
1     Prof          B            19          18 Male 139750
2     Prof          B            20          16 Male 173200
3 AsstProf          B             4           3 Male  79750
 [ reached 'max' / getOption("max.print") -- omitted 394 rows ]

The data frame above shows the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

Which answers about the data frame printed above are correct?

The data frame has 3 rows.
The data frame has 394 rows.
The data frame has 397 rows.
The data frame has 6 attributes.
The attribute names contain Prof and AsstProf

Start Quiz

Quiz: Data Frame Output (2)

      rank discipline yrs.since.phd yrs.service  sex salary
1     Prof          B            19          18 Male 139750
2     Prof          B            20          16 Male 173200
3 AsstProf          B             4           3 Male  79750
 [ reached 'max' / getOption("max.print") -- omitted 394 rows ]

The data frame above shows the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

Which answers about the first three faculty members are correct?

All three are male.
The salaries of all three members are about the same.
The Professor in row three is most probably be the oldest.
All shown professors are from the same discipline.
The highest salary amongst the three Professors is $139,750.

Start Quiz

Creating Data Frames

data.frame(___ = ___, 
           ___ = ___, 
           ...)

Data frames hold tabular data in various columns or attributes. Each column is represented by a vector of different data types like numbers or characters. The data.frame() function supports the construction of data frame objects by combining different vectors to a table. To form a table, vectors are required to have equal lengths. A data frame can also be seen as a collection of vectors connected together to form a table.

Let’s create our first data frame with four different people including their ids, names and indicators if they are female or not. Each of these attributes is created by a different vector of different data types (numeric, character and logical). The attributes are finally combined to a table using the data.frame() function:

data.frame(
  c(1, 2, 3, 4),
  c("Louisa", "Jonathan", "Luigi", "Rachel"),
  c(TRUE, FALSE, FALSE, TRUE)
)

  c.1..2..3..4. c..Louisa....Jonathan....Luigi....Rachel..
1             1                                     Louisa
2             2                                   Jonathan
3             3                                      Luigi
4             4                                     Rachel
  c.TRUE..FALSE..FALSE..TRUE.
1                        TRUE
2                       FALSE
3                       FALSE
4                        TRUE

The resulting data frame stores the values of each vector in a different column. It has four rows and three columns. However, the column names printed on the first line seem to include the column values separated by dots which is a very strange naming scheme!

Column names can be included into the data.frame() construction as argument names preceding the values of column vectors. To improve the column naming of the previous data frame we can write

data.frame(
  id = c(1, 2, 3, 4),
  name = c("Louisa", "Jonathan", "Luigi", "Rachel"),
  female = c(TRUE, FALSE, FALSE, TRUE)
)

  id     name female
1  1   Louisa   TRUE
2  2 Jonathan  FALSE
3  3    Luigi  FALSE
4  4   Rachel   TRUE

The resulting data frame includes the column names needed to see the actual meaning of the different columns.

Exercise: Creating Your First Data Frame

weekday	temperature	hot
Monday	28	FALSE
Tuesday	31	TRUE
Wednesday	25	FALSE

Let’s create a data frame as shown above using the data.frame() function. The resulting data frame should consist of the three columns weekday, temperature and hot:

The first column named weekday contains the weekday names "Monday", "Tuesday", "Wednesday".
The second column named temperature contains the temperatures (in degrees Celsius) as 28, 31, 25.
The third column named hot contains the logical values FALSE, TRUE, FALSE.

Store the final data frame in the variable temp and print its output to the console:

Start Exercise

Quiz: Which statements are true about this data frame?

price <- c(28, 31, 25)
data.frame(
  weekday = c("Monday", "Tuesday", "Wednesday", "Thursday"),
  price = price,
  expensive = price > 30
)

Which statements are true about the data frame above?

The data.frame() function will fail because the column expensive is no vector.
The data.frame() function will not fail
The data.frame() function fails because the lengths of the vectors are different
The command would work if weekday had the values c("Monday", "Tuesday", "Wednesday")

Start Quiz

Build a data frame from vectors is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Use existing functions and data through packages

Thu, 14 May 2020 09:51:58 +0000

Packages give you access to a huge set of functions and datasets, most of which are provided by the generous R community. They are the secret sauce which makes it possible to use R for pretty much anything you can imagine. Additionally, lots of packages are open source which can be a great learning resource.

Get to know the concept of packages in R
Learn how to call functions from packages

library(___)
data(___)

Introduction to packages

Packages are one of the best things in R. They add new functions and features to the language environment and extend its applications over many different use cases and domains. Packages are supported by a large community of developers and allow R to connect to many different external algorithms and libraries—many of them even written in different programming languages.

Contributors all over the world including developers or domain experts in physics, finance, statistics etc. create a lot of additional content, such as custom functions for specific use cases. These functions, together with documentation, help files and datasets can be gathered into packages. Packages can be made public through package repositories so that anyone can install and use them. The most popular package repository is CRAN which hosts over 15,000 packages.

Calling a package

As a demonstration we will use the generate_primes() function from the primes package. This function takes two numbers as parameters and outputs all prime numbers inside their range.

In order to use a package we first need to load it. This can be done by applying the library() function and inserting the name of the package as the first argument of the function. After that, we have access to all of the content in the package and can use functions from it as usual.

library(primes)
generate_primes(min = 500, max = 550)

[1] 503 509 521 523 541 547

Exercise: Check for leap year

Load the lubridate package.
Use the leap_year function to check if 2020 is leap year or not. (Hint: the function takes the year in the form of a number as the first parameter date )

Start Exercise

Use existing functions and data through packages is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Call existing R code through functions

Mon, 11 May 2020 20:11:16 +0000

When you write code, functions are your best friends. They can make hard things very easy or provide new functionality in a nice way. Through functions you gain access to all the powerful features R has to offer.

Call functions with function names and round brackets
Use basic mathematical functions on vectors
Customize functions through parameters
Create number sequences using seq()
Create random numbers using runif()
Sample vectors using sample()

abs(___)
sqrt(___)
seq(___)
runif(___)

Introduction to functions

Functions in any programming language can be described as predefined, reusable code intended to accomplish a specific task. Functions in R can be used by using their name and round brackets right after the that. Inside the brackets, we can specify parameters for the function. One function we have already used extensively is the concatenate function c().

A simple function for example is abs() which is used to get the absolute value of a number. In the following example, the function is given -3 as input and returns the result 3:

abs(-3)

[1] 3

Exercise: Use the sqrt() function

Use the sqrt() function to get the square-root of 8.

Start Exercise

Customizing functions through parameters

Functions take parameters, that customize them for the given task. For example, the runif() function generates uniformly distributed values, which means that all outcomes have the same probability. By default, it takes the following parameters:

runif(n, min = 0, max = 1)

The first parameter n is the number of values we want to generate. This is a mandatory parameter, that we need to define, in order for the function to work.

On the other hand, we can see that some of the parameters have default values defined by the equals sign =. This means that if we don’t explicitly specify these parameter in the brackets, the function will take the default ones. Let’s take a look at an example:

runif(n = 5)

[1] 0.08988000 0.07848433 0.59898103 0.57674865 0.62216434

The output is a numeric vector of 5 numbers. Each of them is between 0 and 1, since we did not change the default setting. If we changed the parameters min and max as well, we could further customize the output:

runif(n = 5, min = 8, max = 9)

[1] 8.963653 8.789039 8.520760 8.614895 8.852204

It is also possible to leave out the name of the parameters and simply type in the input values like this:

runif(5, 8, 9)

[1] 8.714105 8.409777 8.189146 8.849575 8.224963

However, in this case we must be cautious about the order of inputs, since each function has a default order for the parameters. If we don’t explicitly name the parameters we are setting, R will assume, that we set them in the predefined order.

Exercise: Use the sample() function

The sample() function takes a vector and returns a random sample from it. The first two of its parameters are:

x, which defines the vector
size, which defines the number of elements we want to include in the random sample

Use the sample() function and sample 5 random values from the full variable.

Start Exercise

Exercise: Use the seq() function

The seq() function creates a sequence of whole numbers. The first three of its parameters are: from, to and by.

from defines the start of the sequence
to defines the end of the sequence
by sets the steps between the single values

Use the seq() function and create a sequence of numbers from 2 to 10 but only include every second value. Thus, the output should be: 2, 4, 6, 8, 10.

Start Exercise

Call existing R code through functions is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Use basic operators

Fri, 08 May 2020 11:31:23 +0000

R is not only good for analysing and visualizing data, but also for solving maths problems or comparing data with each other. Plus you can use it just like a pocket calculator.

Use R as a pocket calculator
Use arithmetic operators on vectors
Use relational operators on vectors
Use logical operators on vectors

___ + ___
___ - ___
___ / ___
___ * ___
___ ^ ___

___ == ___
___ != ___
___ < ___
___ > ___
___ <= ___
___ >= ___

___ & ___
___ | ___

___ %in% ___

Using R as a pocket calculator

___ + ___
___ - ___
___ / ___
___ * ___

R is a programming language mainly developed for statistics and data analysis. Within R you can use mathematical operators just like you would use on a calculator. For example, you can add + and subtract - numbers from each other:

5 + 5

[1] 10

7 - 3.5

[1] 3.5

Similarly, you can multiply * or divide / numbers:

5 * 7

[1] 35

8 / 4

[1] 2

You can take the power of a number by using the ^ sign:

2 ^ 3

[1] 8

According to the rules of mathematics, you can use round brackets to specify the order of evaluation in more complex tasks:

5 * (2 + 4 / 2)

[1] 20

Exercise: Use basic arithmetic

To calculate the mean of the numbers 2, 3, 7 and 8:

Add all the numbers together using +.
Divide the result by the number of elements.
Make sure that the result of the addition is calculated first by using braces ().

Start Exercise

Applying arithmetic operators on vectors

___ + ___
___ - ___
___ / ___
___ * ___

Operations, such as addition, subtraction, multiplication and division are called arithmetic operations. They can not only operate with single values but also with vectors. If you use arithmetic operations on vectors, the operation is done on each individual number from the first vector and the individual number at the same position from the second vector.

In the following example we create two numeric vectors and assign them to the variables a and b. We then add them together:

a <- c(1, 3, 6, 9, 12, 15)
b <- c(2, 4, 6, 8, 10, 12)
a + b

[1]  3  7 12 17 22 27

As the output shows, the first elements of the two vectors were added together and resulted in 1 + 2 = 3. The second elements added up to 3 + 4 = 7, the third elements to 6 + 6 = 12 and so on.

We can apply any other arithmetic operation in a similar way:

a <- c(22, 10, 7, 3, 14, 4)
b <- c(4, 5, 2, 6, 14, 8)
a / b

[1] 5.5 2.0 3.5 0.5 1.0 0.5

Using the same principle, the first element of the result is 22 / 4 = 5.5, the second is 10 / 5 = 2 and so on.

Quiz: Vector Multiplication

odd <- c(1, 3, 5)
even <- c(2, 4, 6)
odd * even

Inspect the code chunk above. What is the result of the multiplication?

108
54
15, 48
2, 12, 30
18, 36, 54

Start Quiz

Exercise: Multiply numeric vectors

Multiply the numeric vectors ascending and descending:

Create a vector with the numbers 1, 2, 3 and 4 and assign it to the variable ascending.
Create a vector with the numbers 4, 3, 2 and 1 and assign it to the variable descending.
Multiply (*) the variable ascending with the variable descending.

Start Exercise

Using relational operators

___ == ___
___ != ___
___ < ___
___ > ___
___ <= ___
___ >= ___

Relational operators are used to compare two values. The output of these operations is always a logical value TRUE or FALSE. We distinguish six different types relational operators, as we’ll see below.

The equal == and not equal != operators check whether two values are the same (or not):

2 == 1 + 1

[1] TRUE

2 != 3

[1] TRUE

The less than < and greater than > operators check, whether a value is less or greater than another one:

2 > 4

[1] FALSE

2 < 4

[1] TRUE

The less than or equal to <= and the greater than or equal to >= operators combine the check for equality with either the less or the greater than comparison:

2 >= 2

[1] TRUE

2 <= 3

[1] TRUE

All of these operators can be used on vectors with one or more elements as well. In that case, each element of one vector is compared with the element at the same position in the other vector, just as with the mathematical operators:

vector1 <- c(3, 5, 2, 7, 4, 2)
vector2 <- c(2, 6, 3, 3, 4, 1)
vector1 > vector2

[1]  TRUE FALSE FALSE  TRUE FALSE  TRUE

Therefore, the output of this example is based on the comparisons 3 > 2, 5 > 6, 2 > 3 and so on.

Exercise: Compare numeric values

Use the appropriate relational operator and check whether 3 is greater than or equal to 2

Start Exercise

Exercise: Compare temperatures

In the following exercise, we make use of the weather data gathered by the city of Innsbruck over the last decades. You are given two variables, avgtemp_1997_2006 and avgtemp_2007_2016, each containing the monthly average temperatures in Innsbruck for the years 1997 to 2006 and 2007 to 2016.

Use an appropriate relational operator and check in which months there was an increase in the average temperature.

Start Exercise

Using logical operators

___ & ___
___ | ___

The AND operator & is a used for checking whether multiple statements are TRUE at the same time. Using a simple example, we could check whether 3 is greater than 1 and at the same time if 4 is smaller than 2:

3 > 1 & 4 < 2

[1] FALSE

3 is in fact greater than 1, but 4 is not smaller than 2. Since one of the statements is FALSE, the output of this joined evaluation is also FALSE.

The OR operator | checks only, whether any of the statements is TRUE.

3 > 1 | 4 < 2

[1] TRUE

In an OR statement, not all elements have to be TRUE. Since 3 is greater than 1, the output of this evaluation is TRUE as well.

The ! operator is used for the negation of logical values, which means it turns TRUE values to FALSE and FALSE values to TRUE. If we have a statement resulting in a logical TRUE or FALSE value, we can negate the result by applying the ! operator on it. In the following example we check whether 3 is greater than 2 and then negate the result of this comparison:

!3 > 2

[1] FALSE

Logical operators, just like arithmetic and relational operators, can be used on longer vectors as well. In the following example we use three different vectors a, b and c and try to evaluate multiple relations in combination.

a <- c(1, 21, 3, 4)
b <- c(4, 2, 5, 3)
c <- c(3, 23, 5, 3)

a>b & b
[1] FALSE  TRUE FALSE FALSE
First, both relational comparisons a>b and b<c are evaluated and result in two logical vectors. Therefore, we essentially compare the following two vectors:
c(FALSE, TRUE, FALSE, TRUE) & c(FALSE, TRUE, FALSE, FALSE)
[1] FALSE  TRUE FALSE FALSE
The & operator checks whether both values at the same position in the vectors are TRUE. If any value of the pairs is FALSE, the combination is FALSE as well.
The | operator checks whether any of the values at the same position in the vectors is TRUE.
c(FALSE, TRUE, FALSE, TRUE) | c(FALSE, TRUE, FALSE, FALSE)
[1] FALSE  TRUE FALSE  TRUE
Exercise: Use the & operator
You are given three variables alpha, beta and gamma. Use an appropriate logical operator and check whether alpha is greater than beta and at the same time gamma is smaller than beta.
Start Exercise
Exercise: Use the | operator
You are given three variables alpha, beta and gamma. Each contains a numeric vector of two elements. Use the appropriate logical operator and check whether alpha is greater than beta OR gamma is less than beta. (Hint: use the logical OR operator |) 
Start Exercise
Using the %in% operator
___ %in% ___
One additional, often used special operator is the %in% operator. It checks whether or not the contents of one vector are present in another one as well.
In the following example we use the variable EU containing the abbreviation of all countries in the European Union. Then, we check whether or not the character "AU" is present in the EU variable.
EU <- c("AU","BE","BG","CY","CZ","DE","DK","EE","ES","FI","FR","GR","HR","HU",
        "IE","IT","LT","LU","LV","MT","NL","PO","PT","RO","SE","SI","SK")
"AU" %in% EU
[1] TRUE
The following example extends the search and compares multiple elements with the contents of the EU variable. It outputs a logical vector as a result containing a logical value for each element:
c("AU","HU","UK") %in% EU
[1]  TRUE  TRUE FALSE
As the output shows, the first two character elements "AU" and "HU" are present in the variable EU, however the third element "UK" is not.
Exercise: Use the %in% operator
You are standing in the supermarket and need to determine which you can check-off your shopping_list:

Use the %in% operator and determine which shopping_list items you can check-off your list based on the items in your basket.
Print the output of the resulting vector to the console.

Start Exercise
Use basic operators is an excerpt from the course Introduction to R, which is available for free at quantargo.com
VIEW FULL COURSE

Create variables through assignments

Tue, 05 May 2020 08:25:31 +0000

Usually you want to store vectors and other objects into variables so you can work with them more easily. Variables are like a box with a name. You can then refer to the name to see what is stored inside.

Learn how to create a variable
Use variables to store objects and vectors
Reuse assigned objects through a variable name

___ <- ___

Assigning variables

Usually you want to use objects like vectors more than once. In order to save the trouble of retyping and recreating them all the time we would like to save them somewhere and reuse them later.

To do this we can assign them to a variable name. R uses the special arrow operator <- for assigning values to a variable. The arrow is simply the combination of a smaller-than character (<) and a minus sign (-).

Let’s take a look at an example, in which we assign a numeric vector to a variable named numbers:

numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)

Now we can use the variable’s name below to see its contents:

numbers

[1] 1 2 3 4 5 6 7 8 9

Note, that when we assign something to a variable that already exists, it gets overwritten. All previous contents are automatically removed:

numbers <- c(10, 11, 12, 13)
numbers

[1] 10 11 12 13

Once you have defined a variable you can use it just like you would use the underlying vector itself. In the following example we create two numeric vectors and assign them to the variables low and high. Then we use these variables and concatenate the two vectors into a single one and assign it to the variable named sequence. Finally we call the sequence variable and inspect its contents:

low <- c(1, 2, 3)
high <- c(4, 5, 6)
sequence <- c(low, high)
sequence

[1] 1 2 3 4 5 6

As you can see, the vectors 1, 2, 3 and 4, 5, 6 stored in the variables low and high, were combined into a single vector that is now the the content of the variable sequence.

Exercise: Assign numeric vector to variable

Use the concatenate function c() and create a vector containing the numbers 2, 3, 5 and 7.
Assign this vector to a variable named primes.

Start Exercise

Exercise: Assign character vector to variable

Use the concatenate function c() and create a vector containing the words

"programming"
"R" and
"variables"

Assign this vector to the variable fun.

Start Exercise

Quiz: Variable Overriding

fun <- c("programming", "in", "R") 
fun <- c("Have", "fun")
fun

Inspect the code chunk above. What is the content of the variable fun in the last step?

"programming" "in" "R"
"Have" "fun"
"programming" "in" "R" "Have" "fun"
There is no output, only an error message.

Start Quiz

Quiz: Vector Concatenation

fun <- c("programming", "in", "R") 
fun2 <- c("Have", "fun")
fun3 <- c(fun2, fun)
fun3

Inspect the code chunk above. What is the content of the variable fun3 in the last step?

"programming" "in" "R" "Have" "fun"
"Have" "fun"
"Have" "fun" "programming" "in" "R"
There is no output, only an error message.

Start Quiz

Naming rules

There are a few rules we need to consider when creating variables.

Variable rules

Can contain letters: example
Can contain numbers: example1
Can contain underscores: example_1
Can contain dots: example.1
Cannot start with numbers: 2example
Cannot start with underscores: _example
Cannot start with a dot if directly followed by a number: .2example

Quiz: Naming Rules

Which of the following variable names are valid?

weekly+tasks
task2Do
24hour
.task

Start Quiz

Create variables through assignments is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Combine values into a vector

Fri, 01 May 2020 08:24:40 +0000

R always creates lists of values—even when there is only one value in a list. These lists are called vectors and they make working with data much easier.

Everything is a vector
Get to know different data types in R
Learn how to create vectors
Use the : operator to create numeric sequences
Use the concatenate function c() to create vectors of different data types

1:100
c(1, 2, 3, 4)
c("abc", "def", "ghi")
c(TRUE, FALSE, TRUE)

Introduction to Vectors

A vector is a collection of elements of the same kind and the most basic data structure in R. For example, a vector could hold the four numbers 1, 3, 2 and 5. Another vector could be formed with the three text strings "Welcome", "Hi" and "Hello". These different kinds of values (numbers, text) are called data types.

A single value is also treated as a vector - a vector with only one element in it. As we will see throughout the course, this concept makes R very special. We can manipulate vectors and its values through plenty of operations that are provided by R.

One key advantage of vectors is that we can apply an operation (e.g. a multiplication) to all its values at once instead of going through each item individually. This is called vectorization.

Types of vectors

Vectors can only hold elements of the same data type. In this course we will work with the following three main data types:

Numeric values are numbers. Although they can be further split into whole numbers (integers) and numbers with decimals (doubles), R automatically converts between these sub-types if needed. Therefore, we will collectively refer to them as just numeric values.

Character values contain textual content. These can be letters, symbols, spaces and numbers as well. They must be enclosed by quotation marks - either single quotes '___' or double quotes "___".

Logical values can either be TRUE or FALSE. They are also often referred to as boolean or binary values. Because a logical value can only be TRUE or FALSE they are most often used to answer simple questions like “Is 1 greater than 2?” or “Is it past 3 o’clock?”. These kind of questions only need answers like “Yes” (TRUE) or “No” (FALSE). Importantly, in R logical values are case sensitive, which means they have to be written with capital letters.

Quiz: Data Types

Which of the following options are valid data types in R?

Numeric
Bytes
Logical
Simples

Start Quiz

Creating a sequence of numbers

1:100
c(1, 2, 3, 4)
c("abc", "def", "ghi")
c(TRUE, FALSE, TRUE)

In R, even a single value is considered a vector. Creating a vector of one element is as simple as typing its value:

[1] 4

To create a sequence of numeric values we can use the : operator, which takes two numbers and outputs a vector of all whole numbers in that range:

2:11

 [1]  2  3  4  5  6  7  8  9 10 11

The : operator creates a vector from the number on the left-hand side to the number on the right-hand side. Therefore, the order of numbers is important. If we define the previous example the other way around, we get a vector of descending numbers, instead of ascending:

11:2

 [1] 11 10  9  8  7  6  5  4  3  2

The : operator comes handy when we need a vector of every whole number in a given range. However, if we need a vector where the numbers aren’t linear, we require something different.

Exercise: Use the : operator

Use the : operator and create a vector from 2 to 6

Start Exercise

Concatenating numeric values to a vector

1:100
c(1, 2, 3, 4)
c("abc", "def", "ghi")
c(TRUE, FALSE, TRUE)

We can combine multiple numbers into a single vector using the concatenate function c() which links elements between the round braces together into a chain. Multiple elements need to be separated by commas.

To create our first vector holding seven different numbers we can use the concatenate function c() like so:

c(7, 4, 2, 5, 5, 22, 1)

[1]  7  4  2  5  5 22  1

Note, that the “[1]” sign before the output above is added by R, and is always added automatically when printing out vectors. If your vectors become bigger you will see more of these prefixes. Just know that they are only added for informational purposes by R, and that they are there to help you while coding. They are not part of the vector itself.

You can see this more clearly, when the output spans over multiple lines:

1:60

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
[22] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Exercise: Concatenate numbers

Use the concatenate function c() and create a vector containing the numbers 2, 3, 6 and 7

Start Exercise

Creating character vectors

1:100
c(1, 2, 3, 4)
c("abc", "def", "ghi")
c(TRUE, FALSE, TRUE)

To create a character vector of one element, all we need to do is to type out the text. Remember that we need to use quotation marks (" ") around character values:

"golden retriever"

[1] "golden retriever"

To create a character vector of multiple elements, we can again use the concatenate function c(). This time we will use it with characters instead of numbers:

c("golden retriever", "labrador is a family dog", "beagle")

[1] "golden retriever"         "labrador is a family dog"
[3] "beagle"

Exercise: Create a character vector

Create a character vector with the single element: "R is awesome!"

Start Exercise

Exercise: Concatenate text

Use the concatenate function c() and create a vector containing four elements:

"wombat",
"fennec fox",
"bearded dragon" and
"tasmanian devil"

Start Exercise

Creating logical vectors

1:100
c(1, 2, 3, 4)
c("abc", "def", "ghi")
c(TRUE, FALSE, TRUE)

Logical vectors can only hold the values TRUE and FALSE. To create a logical vector with a single value, type out one of the valid values TRUE or FALSE. Remember that they must be written with capital letters:

TRUE

[1] TRUE

Similarly to other types of vectors, we can use the concatenate function c() to create a logical vector of multiple elements:

c(TRUE, FALSE, TRUE, FALSE, TRUE)

[1]  TRUE FALSE  TRUE FALSE  TRUE

Exercise: Concatenate logical values

Use the concatenate function c() and create a vector containing the three elements: TRUE, FALSE and TRUE

Start Exercise

Quiz: Vectors Recap

Which of the following statements about vectors are correct?

In R a single value is a vector as well
A vector can contain numbers and characters simultaneously
Elements of a character vector must be enclosed by quotation marks
TRUE and true are both the same logical value

Start Quiz

Combine values into a vector is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

R is everywhere

Mon, 27 Apr 2020 08:07:34 +0000

R is widely popular and incredibly useful for people working as Data Scientists or in companies. But you can also use R for more simple things, like creating a nice chart or making a quick calculation. Getting started is pretty straight forward, too.

Learn what R is all about
Get an overview of why R is useful
Submit your first code exercise

Introduction to R

The most powerful statistical computing language on the planet.

Norman Nie, Founder of SPSS

R is a programming language and environment to work with data. It is loved by statisticians and data scientists for its expressive code syntax and plentiful external libraries and tools and works on all major operating systems.

It is the Swiss army knife for data analysis and statistical computing (and you can make some pretty charts, too!). The R language is easily extensible with packages written by a large and growing community of developers around the world. You can find it pretty much anywhere—it is used by academic institutions, start-ups, international corporations and many more.

This is also reflected by looking at its adoption. Here we can see a large increase in both downloads and number of packages available over the years:

In 2020 R celebrates its 20th birthday with the release of version 4.0. And yes, it’s free and open source 😀

Quiz: R Facts

Which of the following statements about R are correct?

R only works on the Linux operating system.
R cannot be used in corporate environments.
R is a programming language geared towards data analysis.
R is extensible through packages developed by the community.

Start Quiz

Why Use R?

R is a popular language for solving data analysis problems and is also used by people who traditionally do not consider themselves as programmers. When creating charts and visualizations with R, you will find that you have a much greater creative possibilities as opposed to graphical applications, such as Excel.

Here are some of the features R is most famous for:

Visualization: Creating beautiful graphs and visualizations is one of its biggest strengths. The core language already provides a rich set of tools used for plotting charts and for all kinds of graphics. The sky’s the limit.

Reproducibility: Unlike spreadsheet software, R code is not coupled to specific datasets and can easily be reused across different projects - even when exceeding more than 1 million rows. Easily build reusable reports and automatically generate new versions as the data changes.

Advanced modelling: R provides the biggest and most powerful code base for data analysis in the world. The richness and depth of available statistical models is unparalleled and growing by the day, thanks to the huge community of open source package developers and contributors.

Automation: R code can also be used to automate reports or to perform data transformations and model computations. It can also be integrated in automated production workflows, cloud computing environments and modern database systems.

Quiz: Using R

What are the main reasons to use R compared to spreadsheet software?

Easy to reproduce results
Use huge datasets with more than 1 million rows
Support for advanced modelling techniques including Machine Learning
Create beautiful visualizations

Start Quiz

You R in Good Company

R is the de facto standard for statistical computing at academic institutions and companies around the world. Its great support for literate programming (code that can be combined with human-readable text) enables researchers and data scientists to create publication-ready reports which are easy to reproduce for reviewers.

The language has seen a wide adoption in various industries—see some examples below:

Information Technology

Microsoft: Microsoft R Open, TrueSkill(TM), more here
Google: R for Marketing Research and Analytics, Predicting the Present with Google Trends
Facebook: Visualizing Friendships, The Formation of Love, Prophet Package for time series forecasting.
Others (with links to projects): AirBnB, Uber, Oracle, IBM, Twitter,

Pharma: Merck, Genentech (Roche), Novartis, Pfizer

Newspapers: The Economist, The New York Times, Financial Times

Finance

Banks: Bank of America, J.P.Morgan, Goldman Sachs, Credit Suisse, UBS, Deutsche Bank
Insurances: Lloyd’s, Allianz

See also the R Consortium page for further information about industrial partners and initiatives.

Building Blocks

In the next chapters we will have a look at the most important features and concepts:

Vectors
Variables
Operators
Functions
Packages

So, let’s write your fist code in R!

Exercise: Submit your first code

This course has code exercises to help you learn and quickly explore new concepts. After entering code in the editor, hit the “Submit” button to execute it. The editor will give you feedback on your submission and displays any output below the editor. If you need some additional help use the “Get Hint” button.

To finish your first exercise, press the “Submit” button.

Start Exercise

R is everywhere is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

Launch of New Course Platform

Tue, 21 Apr 2020 00:00:00 +0000

Launch of New Course Platform

After months of hard work we are really excited to launch our brand-new course platform to learn and apply data science. Together with the new platform we also developed the first new course Introduction to R which is available for free now!

This release is a big milestone for us on our path to provide people with the best knowledge and tools available to apply data science. We think that programming—and data science in particular—should be taught interactively by seeing and writing real code. Our online course platform is built on an exhaustive amount of interactive coding exercises and quizzes.

But the technical achievement is not the only novelty. We diverged from a traditional course outline and completely changed how we structure our course content. Each chapter features a so-called recipe which learners can collect by finishing exercises. Recipes depend on each other and together form a knowledge graph. In the future, learners will be able to create their own learning paths based on the dependency structure of the graph and their progress. Collected recipes are available in your cookbook which gives you an overview of your progress.

Another cool feature is the achievement system. Code recipes can be collected in a cookbook so that learners can review their achievements. Once all recipes from a topic have been collected users get a badge. The course is finally finished when all the badges have been collected.

New Course Available for Free: Introduction to R

The first course module Introduction to R is perfect for newcomers who want to get started with data science. The course teaches the programming language R and covers the language basics so that you can transform data and make professional looking graphs and charts with little effort.

OPEN COURSE

We would love to hear your feedback - either through the feedback buttons on each page (visible for logged-in users) or via e-mail:

Course Content: courses@quantargo.com
Technical Issues: support@quantargo.com

Cheers,

Your Quantargo Team

Create your first bar chart

Tue, 11 Feb 2020 08:30:00 +0000

Create your first bar chart using geom_col()
Fill bars with color using the fill aesthetic

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

Introduction to bar charts

Bar charts visualize numeric values grouped by categories. Each category is represented by one bar with a height defined by each numeric value.

Below you can find an example showing the number of people (in millions) in the five biggest countries by population in 2007:

Creating a simple bar chart

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

Let’s create our first bar chart with the gapminder_top5 dataset. It contains population (in millions) and life expectancy data for the biggest countries by population in 2007.

ggplot(gapminder_top5) + 
  geom_col(aes(x = country, y = pop))

We see that the resulting bars are sorted by the country names in alphabetical order by default.

Exercise: Plot life expectancy by country

Create a bar chart showing the life expectancy of the five biggest countries by population in 2007.

Use the ggplot() function and specify the gapminder_top5 dataset as input
Add a geom_col() layer to the plot
Plot one bar for each country (x aesthetic)
Use life expectancy lifeExp as bar height (y aesthetic)

Start Exercise

Filling bars with color

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

Based on the gapminder_top5 dataset we plot the population (in millions) of the biggest countries and use the continent variable to color each bar:

ggplot(gapminder_top5) + 
  geom_col(aes(x = country, y = pop, fill = continent))

ggplot(gapminder_top5) + 
  geom_col(aes(x = country, y = pop, fill = lifeExp))

The bar colors have now changed according the continuous legend on the right. We see that also numeric variables can be used to fill bars.

Exercise: Plot population size by country

Create a bar chart showing the population (in millions) of the five biggest countries by population in 2007.

Use the ggplot() function and specify the gapminder_top5 dataset as input
Add a geom_col() layer to the plot
Plot one bar for each country (x aesthetic)
Use population pop as bar height (y aesthetic)
Use the GDP per capita gdpPercap as fill aesthetic

Start Exercise

Stacked bar charts

ggplot(___) + 
  geom_col(
    mapping = aes(x = ___, y = ___, 
                  fill = ___)
 )

The plot below shows the number of phones (in thousands) by continent from 1956 to 1961 as a stacked bar chart:

ggplot(world_phones) + 
  geom_col(aes(x = year, y = phones,
               fill = region))

Exercise: Plot number of crimes by US states

Create a bar chart showing the number of crimes by US state per 100,000 residents in 1973.

Use the ggplot() function and specify the us_arrests dataset as input
Add a geom_col() layer to the plot
Plot one bar for each state (x aesthetic)
Use the number of cases as bar height (y aesthetic)
Use the crime type as fill aesthetic.

Start Exercise

Create your first bar chart is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

ViennaR Meetup March - Full Talks Online

Mon, 29 Apr 2019 00:00:00 +0000

Introduction

The full talks of the ViennaR March Meetup are finally online: A short Introduction to ViennaR, Laura Vana introducing R-Ladies Vienna and Hadley Wickham with a great introduction to tidy(er) data and the new functions pivot_wider() and pivot_longer(). Stay tuned for the next ViennaR Meetups!

You can download the slides of the introduction here.

Ronald Hochreiter, Mario Annau, Walter Djuric

Laura Vana: R-Ladies

Laura introduced the R-Ladies Vienna - a program initiated by the R-Consortium to

achieve proportionate representation by encouraging, inspiring, and empowering the minorities currently underrepresented in the R community. R-Ladies’ primary focus, therefore, is on supporting the R enthusiasts who identify as an underrepresented minority to achieve their programming potential, by building a collaborative global network of R leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide.

from https://rladies.org.

Visit the Meetup site to find out more about upcoming R-Ladies events including workshops covering an R Introduction and Bayesian Statistics.

Laura Vana

Hadley Wickham: Tidy(er) Data

Hadley Wickham’s talk covered the tidyr package and two new functions: pivot_wide() and pivot_long() which have finally been renamed to pivot_wider() and pivot_longer() as a result of a survey, see also the posts on Twitter and Github. These function should replace gather() and spread() since they seem to be hard-to-remember for most users coming to tidyr.

Hadley Wickham

See you at one of our next Meetup/Course/Get-Together!

Cheers,

Your Quantargo Team

ViennaR Meetup March - Impressions

Thu, 11 Apr 2019 00:00:00 +0000

Introduction

For all who couldn’t make it to our last ViennaR Meetup on March 18, 2019 at Webster Vienna Private University here just a short summary of the talks and takeaways.

The Introduction covered a short history of the ViennaR Meetup and the special relationship between R and Vienna through the R-foundation, 2 R-core members (Kurt Hornik and Fritz Leisch) and one of the first organized R conferences - the DSC 1999.

You can download the slides of the introduction here.

Ronald Hochreiter, Mario Annau, Walter Djuric

R-Ladies

Laura Vana introduced the R-Ladies Vienna - a program initiated by the R-Consortium to

achieve proportionate representation by encouraging, inspiring, and empowering the minorities currently underrepresented in the R community. R-Ladies’ primary focus, therefore, is on supporting the R enthusiasts who identify as an underrepresented minority to achieve their programming potential, by building a collaborative global network of R leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide.

from https://rladies.org.

Visit the Meetup site to find out more about upcoming R-Ladies events including workshops covering an R Introduction and Bayesian Statistics.

Laura Vana

Tidy(er) Data

Hadley Wickham

First Impressions

Below you can find the first impressions of the Meetup on our new Youtube channel. The full talks are still being cut and optimized - so stay tuned for the full talks from Laura and Hadley - to be released on our Youtube Channel by next week - feel free to subscribe :-)

See you at one of our next Meetups/Courses/Get-Togethers!

Cheers,

Your Quantargo Team

ViennaR Meetup Announcement March 2019

Mon, 25 Feb 2019 00:00:00 +0000

ViennaR Meetup Announcement March 2019

For the next ViennaR Meetup on March 18 we are excited to announce Laura Vana (R-Ladies) and Hadley Wickham (RStudio). The meetup will take place at Webster Vienna Private University (http://webster.ac.at).

Registration is required at the Meetup Page: https://www.meetup.com/ViennaR/events/259235903/

Feel free to join the networking session with food and drinks afterwards. See you at the Meetup and happy R coding!

Laura Vana: R-Ladies

Laura is post-doc researcher at WU Vienna and chapter lead of R-Ladies Vienna. She will present ongoing initiatives and members of R-Ladies Vienna.

About R-Ladies

R-Ladies Vienna welcomes members of all R proficiency levels, whether you’re a new or aspiring R user, or an experienced R programmer interested in mentoring, networking & expert upskilling. Our community is designed to develop our members’ R skills & knowledge through social, collaborative learning & sharing. Supporting minority identity access to STEM skills & careers, the Free Software Movement, and contributing to the global R community! A local chapter of R-Ladies Global, R-Ladies Vienna exists to promote gender diversity in the R community worldwide. Please visit https://www.meetup.com/rladies-vienna/ for more information.

Hadley Wickham

Hadley is chief scientist at R-Studio and well known for his contributions to the R community to make the life of data scientists easier. His tidyverse consists of popular packages like ggplot2, dplyr, readr and the software development package devtools. He is the author of numerous (online) books like R for Data Science and Advanced R. Last but not least, he won the John Chambers award in 2006 for his former versions of the reshape and ggplot package¹.

Questions for the Q&A Session

If you want to ask Laura or Hadley any questions during the Q&A session or if you have some other question regarding the Meetup please e-mail your topic to viennar@quantargo.com.

See you at the Meetup and happy R-coding!

Why Management Loves Overfitting

Wed, 23 Jan 2019 00:00:00 +0000

Why Management Loves Overfitting

The role of a data scientist involves building and fine-tuning of models and improve processes and products in various business areas. Typical use cases involve marketing campaigns, customer churn prediction or fraud detection. Trained models should not only work on (seen) training data but also on new (unseen) real-world data. However, this requirement is typically not obvious to most decision makers involved, who tend to favour overfitted models and delude themselves with fabulous numbers and promises. The problems always arise straight after implementation when the results do not follow suit. It is thus the task of every responsible data scientist to manage expectations right and employ industries best practices as covered in our course on Machine Learning with R.

To see the problem of overfitting in action let’s look at a simple relationship in the famous mtcars dataset between the weight (wt) of a car in tonnes and its range per gallon (mpg, miles-per-gallon). Obviously, the heavier the car the less miles per gallon it goes (or the higher its fuel consumption). We have modeled the relationship using the smooth.spline() function in R and used the smoothing parameter (spar) as a parameter in the slider. We see that a spar close to zero seems to model the relationship quite well (smooth). By increasing the spar of the model it begins to fit observations more closely, thus its variance is increased. However, once spar gets closer to one the spline starts to loose its smooth shape and zig-zag—a sign of overfitting.

The same phenomenon can be shown in a classification example. We use the basic K-nearest neighbour model to differentiate 3 iris species among 50 flowers using the variables sepal length/width and petal length/width. The three classes can easily differentiated visually into three areas. By moving the number of neighbours closer to one we increase model variance and observe that decision boundaries get fragmented.

Even if more observations can be correctly classified in-sample, or similarly, the regression error could be reduced, we should always keep in mind that model performance is only judged by out-of-sample data. Thus, decision makers should always be much more aware of how the model has been be selected instead of how good the reported performance is. To be on the safe side, and if enough data is available, we can always keep a final test set aside (not available to model developers) to evaluate actual performance - pretty much like a Kaggle competition.

So my final recommendations would be that:

Don’t fool yourselves and be honest with out-of-sample data/performance.
Manage expectation of decision makers well - be realistic.
If results look extremely good at your first try - they most probably are wrong.

Happy (Correct) Fitting!

Let's play together: Collaborative Data Science

Wed, 19 Sep 2018 00:00:00 +0000

Why is it so hard?

From experience we’ve learned that most data science projects are not truly collaborative efforts but only driven by a few key players. Best (public) examples are most open source R and Python packages available on Github. However, collaboration of data science teams can be the determining factor driving innovation in a sustainable way. We highlight some common problems in data science projects and give guidance how collaboration can be improved to facilitate a data-driven transformation in organisations.

Data Science is an interdisciplinary field and requires diverse skill sets to deliver data products. On top of software- and data engineering skills a solid statistical background is needed to reveal interesting patterns and build models. However, we often see a clash of cultures in engineering vs data science/modelling teams. While the former group typically cares more about code quality, testing, and deployment the latter is mostly focused on methodology- and data correctness. Also the development process is quite different: Agile/SCRUM vs. research/hypothesis driven.

Last but not least we see strong opinions and conflicts in data science teams. Most of them are about tools (R vs. Python), methodology (statistical rigorous vs data mining/brute force) and project priorities. Data Science is a very new field and most of these questions depend on the specific problem and respective institutional/company background.

Why is it so important?

Having a large and diverse group of people working in a relatively new and unstructured environment like Data Science projects can lead to great ideas and innovation - or to utter chaos. The border here is typically very thin and can be positively influenced if you have

Open team spirit and transparency generating new ideas.
Teams working efficiently together on projects, reviewing each others ideas which are generated on a continuous basis - with room for failure.
A well-managed code base which is , maintained and reviewed leading to increased re-usability and positive network effects.

Ingredients leading to adverse effects are just the opposite:

Team rivalries and politically motivated decision making - fear of failure.
Teams not communicating with each other, working on redundant projects.
No managed and reviewed code base consisting of a handful of undocumented scripts/notebooks which leads to no re-usability.

In general the question remains what kind of environment can be created - either from the technical or human resources side - to improve long-lasting positive network effects, or in particular:

How can code be managed to have positive network effects?
How can teams efficiently communicate and collaborate together?

Case study: The CRAN package repository

To see the biggest (public) statistical code base in action let’s take a look at the CRAN package repository which has experienced an astonishing growth over the last decade. It hosts well over 10,000 R packages written by authors all over the world. A large part of its success is driven by the simple yet powerful package structure inspired by the Debian Linux package system. Each package is checked for errors by CRAN repository maintainers using R CMD check --as-cran <packagename> and released for all major platforms: Windows, Mac OS and Linux. Even compiled (C++) code within R packages is checked through Address Sanitizers (ASAN) and Undefined Behavior Sanitizers (UBSAN), see also CRAN Package Check Issue Kinds. These and many more procedures lead to a code base which is easier to re-use and maintain, see also Writing R Extensions and Hadley’s more verbose description of the R CMD check workflow.

The implemented function tools:::CRAN_package_db() has been used to extract all relevant package metadata.

CRAN Package Network

`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)

R packages can also depend on other packages as defined in the package DESCRIPTION file through Imports or Depends. This makes proper check procedures and interfaces between packages even more important since an error in one dependency can affect a large number of packages. The picture above shows the dependency graph of the most downloaded R packages on CRAN.

Interestingly, we observe that the vast majority of packages is developed by only one single author. The reasons for this could be manifold including the R package structure itself to be most suitable for single writers, scientific studies conducted by only few scientists and general social behavior of R-programmers.

However, the lack of communication between package authors and a clear overview of which packages actually exist can lead to redundant developments. While a wider variety of packages for different model implementations can be helpful it does not make much sense for infrastructure packages. A good example is the package universe dealing with Excel files including xlsx, XLConnect, gdata, openxlsx and readxl. The graph below shows how these packages create different clusters of reverse dependencies. Some of them even wrap functionalities of different Excel packages and act as connectors/wrappers like DataLoader or ImportExport.

Psychological Barriers

Last but not least there exist very basic barriers for authors not collaborating with each other: EGO. Not satisfied with most existing packages¹ dealing with HDF5 ² files to store high-frequency tick-data in a high-performance, language independent format I decided to create a new one: h5 (deprecated but still on Github and CRAN).

After having spent quite some time developing a h5 which was presented at R/Finance 2016 I received an E-mail from Holger which stated that he also developed a package to tackle the problem:

On June 21, 2016 Holger wrote:

… my name is Holger Hoefling, I have developed a new version of a wrapper library for hdf5 (R6 Classes, almost all function calls wrapped, full support for all datatypes including tables etc) …

Having overcome my own EGO barrier (which was quite hard) and after inspecting his package we agreed to work together on one HDF5 package and merge codebases (which sounds easier than it was) to

Maintain high-level interface and test cases from h5
Get low-level HDF5 support within R

The Joys Collaboration

The joys of collaboration (after overcoming psychological barriers) are great and typically lead to longer-term projects, regular code-reviews and in my case a merged package which is of higher quality than each of the previous ones.

My recommendations are thus as follows:

Q: How can code be managed to have positive network effects? y

Put it into re-usable package.
Continuous code-reviews and tests.
Use a transparent code platform to inspect source (like Github).

Q: How can teams efficiently communicate and collaborate together?

Have the right tools and mindset in place.
Incentivize collaborative efforts.
Accept unexpected hypotheses and failures
Open mindedness.