Wiki

Clone wiki

help / R

R for data analysis

Intro

R is an open source programming language that was developed for statistics and data analysis. It's easiest to use R through RStudio, which is an integrated development environment. RStudio offers lots of tools to make your life easier, so it's certainly worth using!

There are many great tutorials on using R and this wiki isn't aiming to better them, but to bring some of the information to the same place. For further learning consider the following resources:

Books (free online, or buy a physical copy):

Courses

Other

If you have questions Stack Overflow probably has an answer already, and the #rstats handle on Twitter is well used.

R uses packages to extend functionality, think of these like little toolboxes of ready made scripts. Amazingly, there are 10,000 of these available! I prefer to keep things simple and use as few as possible, but to help you choose which you might need there is a page which collects together packages for different tasks.

If you're new to programming, start simple. Think about breaking the task into sections: reading your data, transforming your data (if required), and then plotting some variables. This simple approach will help you build knowledge to do more complex things.

Hadley Wickham and associates have done some great work on lowering the bar to data analysis in R. He's grouped many of his packages together into the Tidyverse. You can read more about these on the dedicated website. To install the tidyverse, run the following command in your R(Studio) console:

install.packages("tidyverse")

HELP!

To get help using R type ?function_name in your console, or use the help pane in RStudio. This will pull up the help page for that particular function, which includes sections on what arguments a function takes, what the defaults are and some example uses. The tidyverse website is invaluable for working with tidyverse packages.

Reading data

There are many ways to read data into R. Some useful packages are:

  • readr for csv type files
  • readxl for Excel files
  • haven for Stata and SPSS files

Only readr is part of the core tidyverse, so if you need the other two you'll need to install them with:

install.packages("readxl")
install.packages("haven")

How do you read data in practice? When you run a read command in R it looks at the file you've pointed it to and extracts the information. If you have not told R to assign (put) this information to a variable it will print it to your console. For example, if you're reading a file called test.csv using the readr package, here's how you'd go about it (note anything after a # is a comment and is not run by R):

# Load the readr package
library(readr)

# print the file to the console
read_csv("test.csv")

# assign the file to x
x <- read_csv("test.csv")

In the RStudio environment pane you'll now have a new variable called x, along with a little information about it. The same method applies to the other packages/functions mentioned in this section for reading data into R. Note that R loads data into your RAM, so the size of file you can open is limited by how much RAM you have. A 2 MB file takes up approximately 4 MB of RAM when loaded into R.

Transforming data

Often when we've read data into R it isn't ready for analysis, but needs manipulating. There are a thousand and one ways to do this in R and learning them is time spent far better than copy/pasting in a spreadsheet.

The tidyr package is really useful for converting between wide and long format. See the gather and spread functions for examples.

dplyr mainly offers up database like functionality. This means you can do operations on subsets of your data. At its most simple this is like a pivot table in a spreadsheet. There are so many tricks here it's worth exploring the web link to see what's possible.

Another package to look at is stringr, which helps you work with character data.

Plotting data

There are three main plotting packages in R, these are base (what comes with R as standard), lattice and ggplot2. I'm not going to cover lattice as it's complicated and the times one needs it are limited. Plotting in base R is great and can give lots of control and some very quick plots. In contrast, ggplot2 lets you do complex graphics quite easily. In order to use ggplot2, you'll need your data in a tidy format, which is what the previous section on transforming data is for!

Modelling

Guide to anova on twitter


Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.

Updated