Intro intro R

Raphael Rehms

Intro

Why R?

Why R?

  • R is open source

  • All techniques for data analyses

  • State-of-the-art graphics capabilities

  • A platform for programming new statistical methods or analysis pipelines (in form of R-packages)

Programming (in general)

“Good programmers are made, not born.” (Gerald M. Weinberg - The Psychology of Computer Programming)

  • consequence I

    train…

  • consequence II

    train…

  • consequence III

    train more

Hands-on is important. Understanding is less that 30%

R and R Studio

Required tools for the course:

  • Programming language R

    • designed to make fast prototyping for statistical analysis
    • interpreted language
  • RStudio (optional, but recommended)

    • IDE tailored for R

    • Integrates a lot more (e.g. python, c++, etc.)

R packages

  • R comes with many useful packages by default

  • However, the strength lies in the huge collection of external packages

  • Most popular and default: CRAN

  • Install new packages in R using either

    • using a command:

      • install.packages("<package-name>") (e.g.install.packages("mvtnorm"))
    • RStudio

      • using built-in tools from the IDE

Basic operations

Addition, subtraction, etc

1+2
[1] 3
1-2
[1] -1
1*2
[1] 2
1/2
[1] 0.5
1^2
[1] 1

Note

What will happen?

1/0

Special symbols functions

Special symbols

pi
Inf

Mathematical functions

exp(1)
[1] 2.718282
log(1)
[1] 0

Special cases:

  • NaN is a data type that indicates an invalid number.
log(-1)
[1] NaN
NaN + 1
[1] NaN
  • NA is a missing value.
NA + 1
[1] NA
  • NULL means literally empty/nothing

Assigning objects

Assignment is done using <-

x <- 1
y <- 2
x + y
[1] 3

Alternatively, use =

x = sqrt(2)
y = sqrt(2)
x * y
[1] 2

Look at environment pane in R Studio, what can you see?

Naming objects

  • Objects in R have to start with a letter

Case sensitive

a <- 2
A <- 1
a-A
[1] 1

Overwrite variables with old ones

a <- a + 1

Combination of words

variable_name <- 1
variable.name <- 1
variableName <- 1

Comments

Sometimes it is useful, to comment code. Use a # to comment

Standard:

1+1
[1] 2

Comment a line (no output):

# 1+1

Comment after an expression (only 1+1 gets evaluated):

1+1 # +1
[1] 2

Function calling

So far we used expressions like f(...). This is a function. E.g.

exp(2)

We call the function exp with a value of 2. Or the (natural) logarithm:

log(exp(1))
[1] 1

We can specify the base as a second argument:

log(2, 2)
[1] 1

Note

What will happen?

Log(Exp(1))

Get documentation

Access the documentation using

  • <F1>

  • type ?function_name

  • use RStudio functionality

E.g. documentation for log() reveals that we calculate the natural logarithm.

?log
log(x, base = exp(1))

Function calling cont’d

You can ignore the argument name, when placements are clear. - We have done that for exp and log

Hence, this here

log(2, 2)

means, that we actually call

log(x=2, base=2)

If you specify the argument, order does not matter.

Example:

log(base=3, x=2)
log(3, 2)

Note

What will happen?

log <- 1
log(log)

Basic (primitive) data types

numeric

A (floating point) number. We used this so far (default).

1.0, 1.34, -33, pi


logical

A binary data type.

TRUE, FALSE, T, F


integer

Can be specified using an “L”.

1L, 100L, -99L


character

Represents letters OR sentences.

'a', "abc", "May the force be with you"

Exercises 1 Task 1

Vectors

Vectors

You can combine single values to a vector.

a <- c(1,2,3,4)
a
[1] 1 2 3 4
b <- c(TRUE, FALSE, TRUE)
b
[1]  TRUE FALSE  TRUE
c <- c("a", 'ab', "ab c")
c
[1] "a"    "ab"   "ab c"

Many operations in R are vectorized

a + a
[1] 2 4 6 8
a * a
[1]  1  4  9 16
exp(a)
[1]  2.718282  7.389056 20.085537 54.598150
-a
[1] -1 -2 -3 -4

Note

What will happen?

c("1",2,3)

Vectors cont’d

  • NA or NaNs can be part of a vector
a <- c(1,2,NA,4)
a + 1
[1]  2  3 NA  5
b <- c(1, -1, Inf)
log(b)
[1]   0 NaN Inf

Automatic recycling

a <- c(1,2,3,4)
a + 1
[1] 2 3 4 5
b <- c(2,2)
a + b
[1] 3 4 5 6

Warning

Note the behavior for vectors with different length! Example:

a <- c(1,2,3)
b <- c(1,2)
a + b
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 2 4 4

Vector creation

There are a lot of convenience functions to create vectors.

c(1,2,3,4)
[1] 1 2 3 4
1:4
[1] 1 2 3 4
seq(4)
[1] 1 2 3 4

More complex ones:

4:-3
[1]  4  3  2  1  0 -1 -2 -3
seq(-10, 10, by = 2)
 [1] -10  -8  -6  -4  -2   0   2   4   6   8  10
seq(-10, 10, length.out = 10) # vector of length 10
 [1] -10.000000  -7.777778  -5.555556  -3.333333  -1.111111   1.111111
 [7]   3.333333   5.555556   7.777778  10.000000

Select elements of a vector

Access elements of a vector using positional numbers within [...]:

x <- c(2,4,2,5)
x[1]
[1] 2

Multiple elements

selection <- c(1,4)
x[selection]
[1] 2 5
x[c(1,4)]
[1] 2 5

Negative values will be excluded

x[-c(1,3)]
[1] 4 5

Note

What will happen?

x[1:5]
x[-(5:10)]
x[0]

Logical values for comparison

Recall the very most basic data type logical, i.e. TRUE and FALSE.

  • We can create such an object by comparison:
1 == 2  # lhs equal rhs?
[1] FALSE
1 != 2  # lhs unequal rhs?
[1] TRUE
1 > 2  # lhs larger rhs?
[1] FALSE
1 >= 2  # lhs larger or equal rhs?
[1] FALSE
1 < 2  # lhs less than rhs?
[1] TRUE
1 <= 2  # lhs less or equal than rhs?
[1] TRUE

Swap value:

!TRUE
[1] FALSE
!FALSE
[1] TRUE

Note

What will happen?

1 == "1"
1 != NaN 
NA == NA  # we will learn the solution in a few slides

Filter elements of a vector

Comparison operators are vectorized:

c(T,F,T) == c(F,F,T)  # element-wise comparison
[1] FALSE  TRUE  TRUE

Check condition on a numeric vector

x <- c(2,4,2,5)
position_two <- x == 2  # logical vector showing, where the condition holds
position_two
[1]  TRUE FALSE  TRUE FALSE

Use logical values to filter a vector.

x[position_two]
[1] 2 2
# or directly
x[x == 2]
[1] 2 2

Filter for values less than 3

x[x < 3]
[1] 2 2

Combine filters with & and |

Combination operations…

TRUE & TRUE
[1] TRUE
FALSE & TRUE
[1] FALSE
TRUE | TRUE
[1] TRUE
FALSE | TRUE
[1] TRUE

…or vectorized

x <- c(T,F,T,F)
y <- c(T,T,F,F)
x & y
[1]  TRUE FALSE FALSE FALSE
x | y
[1]  TRUE  TRUE  TRUE FALSE

Use this to filter a vector for multiple conditions

x[(x < 5) & (x > 2)]
logical(0)

Assign new values in a vector

We can assign new values to a vector using a combination of selection and assignment

x <- 1:5
x[1] <- 2
x
[1] 2 2 3 4 5
x[x > 3] <- -99
x
[1]   2   2   3 -99 -99
x[-1] <- 100
x
[1]   2 100 100 100 100

Note

What will happen?

x[100] <- 1

Vector operations

x <- c(1,1,2,3)
length(x)
[1] 4
append(x, c(1,2,3))
[1] 1 1 2 3 1 2 3
rev(x)
[1] 3 2 1 1
sort(x)
[1] 1 1 2 3
unique(x)
[1] 1 2 3
sum(x) 
[1] 7

Exercises 1 Task 2

Complex structures

Factors

Consider a vector, that represents a categorical variable. Let’s say colors.

colors <- c("blue", "red", "blue", "red", "green", "black", "green", "white")
colors
[1] "blue"  "red"   "blue"  "red"   "green" "black" "green" "white"

We cast colors into a factor now:

colors <- as.factor(colors)
colors
[1] blue  red   blue  red   green black green white
Levels: black blue green red white
levels(colors)
[1] "black" "blue"  "green" "red"   "white"
as.numeric(colors)
[1] 2 4 2 4 3 1 3 5
class(colors)
[1] "factor"
typeof(colors)
[1] "integer"

Hence, a vector of integeres where each value corresponds to a character value.

Complex data structures

from *Ceballos and Cardiel, (2013). Data structure – First Steps in R. Retreived 25-11-2018 from http://venus.ifca.unican.es/Rintro/_images/dataStructuresNew.png*


Use str(...) to inspect the structure of complex data types!

Vector, Matrix, Array

We already got vectors. Lets combine them:

x <- 1:4
(x_rbind <- cbind(x,x)) # 4 rows, 2 columns
     x x
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
(x_cbind <- rbind(x,x)) # 2 rows, 4 columns
  [,1] [,2] [,3] [,4]
x    1    2    3    4
x    1    2    3    4
dim(x_rbind)
[1] 4 2
dim(x_cbind)
[1] 2 4
nrow(x_rbind)
[1] 4
ncol(x_rbind)
[1] 2

Vector, Matrix, Array cont’d

We can define a matrix using the matrix function:

matrix(1:6, nrow = 3, ncol = 2)
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
matrix(1:6, nrow = 3, ncol = 2, byrow = T)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

Arrays as a generalization with multiple dimensions

array(1:12, dim = c(3,2,2))
, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12

This is also sometimes called a tensor.

Select/filter elements on Arrays

As vectors, we can select and filter. Seperate dimensions with a ,, i.e. [... , ...]

(m <- matrix(1:6, nrow = 3, ncol = 2))
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
m[2,2]
[1] 5
m[nrow(m), ncol(m) ]
[1] 6

Defining no entry will return the full dimension:

m[2,]
[1] 2 5
m[,1]
[1] 1 2 3

Note

What will happen?

m[1,,2]
m[10]

List

A list is a collection of elements. These elements could be any object.

(l <- list(1, "2", 1:3, list(m)))
[[1]]
[1] 1

[[2]]
[1] "2"

[[3]]
[1] 1 2 3

[[4]]
[[4]][[1]]
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Access elements of a list with [[...]].

l[[2]]
[1] "2"

A sub-list can be accessed with [...].

l[1:3]
[[1]]
[1] 1

[[2]]
[1] "2"

[[3]]
[1] 1 2 3

List cont’d

You can define names for lists:

l <- list(slot1 = 1:3, slot2 = c("a", "b"), slot3 = l)
names(l)
[1] "slot1" "slot2" "slot3"

Access list elements using the name and a $:

l$slot3 # return the original list l before overwriting it
[[1]]
[1] 1

[[2]]
[1] "2"

[[3]]
[1] 1 2 3

[[4]]
[[4]][[1]]
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Delete elements by assigning a NULL to a slot

l[2:3] <- NULL
l
$slot1
[1] 1 2 3

Data Frame

A data frame is basically a list, where each element is a vector of the same length. However, it implements function to handle it as a matrix.

Let’s define a data set representing cars:

col <- as.factor(c("blue", "red", "blue", "red", "green", "black", "green", "white"))
pri <- c(10, 20, 9, 50, 0.4, 15, 160, 60) * 1000
is_el <- c(F,F,F,T,F,T,F,T)

car_ds <- data.frame(color = col, price = pri, is_electric = is_el)
car_ds
  color  price is_electric
1  blue  10000       FALSE
2   red  20000       FALSE
3  blue   9000       FALSE
4   red  50000        TRUE
5 green    400       FALSE
6 black  15000        TRUE
7 green 160000       FALSE
8 white  60000        TRUE
str(car_ds)
'data.frame':   8 obs. of  3 variables:
 $ color      : Factor w/ 5 levels "black","blue",..: 2 4 2 4 3 1 3 5
 $ price      : num  10000 20000 9000 50000 400 15000 160000 60000
 $ is_electric: logi  FALSE FALSE FALSE TRUE FALSE TRUE ...

Data Frame cont’d

We can work on a data set as we work with a matrix

# All rows with red cars
car_ds[car_ds$color == "red", ]
  color price is_electric
2   red 20000       FALSE
4   red 50000        TRUE
# price of all black cars
car_ds[car_ds$color == "black", "price"]
[1] 15000
# set a new price for the last car in the ds
car_ds[8, 2] <- 600
car_ds
  color  price is_electric
1  blue  10000       FALSE
2   red  20000       FALSE
3  blue   9000       FALSE
4   red  50000        TRUE
5 green    400       FALSE
6 black  15000        TRUE
7 green 160000       FALSE
8 white    600        TRUE

More on data structures

  • A data frame behaves like a matrix.
  • However, keep in mind that it is actually a list. We can easily prove that:
is.list(car_ds)
[1] TRUE

Use str(...) to check the data structure of any object:

str(car_ds)
'data.frame':   8 obs. of  3 variables:
 $ color      : Factor w/ 5 levels "black","blue",..: 2 4 2 4 3 1 3 5
 $ price      : num  10000 20000 9000 50000 400 15000 160000 600
 $ is_electric: logi  FALSE FALSE FALSE TRUE FALSE TRUE ...
m <- matrix(1:4, ncol = 2)
str(m)
 int [1:2, 1:2] 1 2 3 4

Load data

We can load a data set from a package using data(...).

data("iris", package = "datasets")  # look in the environment variables

We can load data from files. Use read.table(...), or wrapper functions with reasonable default values. E.g. We can read a file directly from the web:

d <- read.csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv")
head(d)  # show the first few lines of a data set
           rownames  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Note, that we can also use this to read a data set from a local directory! To do that we have to specify either the full path or define the path from the working directory. Use getwd(...) and setwd(...) to get or set the current working directory. See next slide for an example.

Save data sets

Consider a data set, you have worked with. You can save it using write functions.

write.csv(car_ds, file = "example_data.csv")  # we save our data set in the current working directory

We can again read the data as a new object:

d_loaded <- read.csv("example_data.csv")

all.equal(car_ds,d_loaded)  # test whether 2 (more complex) R object are the same
[1] "Names: 3 string mismatches"                          
[2] "Length mismatch: comparison on first 3 components"   
[3] "Component 1: 'current' is not a factor"              
[4] "Component 2: Modes: numeric, character"              
[5] "Component 2: target is numeric, current is character"
[6] "Component 3: Modes: logical, numeric"                
[7] "Component 3: target is logical, current is numeric"  

We can read other files as well. E.g. excel, SPSS, SAS, etc.

There are a lot of packages to do that.

I use the function load(...) from the rio package that tries to unify a lot of different formats.)

Save and load R objects

So far, we only worked with data frames for read and write operations. We can save general R objects using save(...) and load(...) using the .RData format.

a_list <- list(a = 42, data = iris, comment = "whatever")

save(a_list, file = "example_object.RData")

load("example_object.RData")

Exercises 1 Tasks 3