5 Vectorized Operations
In this lesson we will learn about vectorized operations. In specific, about element-wise operations and vector variables
While R would make a great calculator, it is designed to help statisticians and data scientists deal with data. And data never comes with one data point. Instead, data usually comes with several observations at a time for a given variable. So we are going to learn how to deal with these R objects.
Say I am interested in the relationship between height, weight, and gender of students with respect to the number of R workshops each student has taken. So I take a sample of size 10 out of the students population and I measure these students’ height, weight, and ask about they gender, and how many R courses/workshops they attended. Then our data would look like this:
Names | Age | Height | Weight | Gender | Courses |
---|---|---|---|---|---|
Alan | 23 | 170.6 | 76.9 | Male | 1 |
Brian | 31 | 179.6 | 59.6 | Male | 2 |
Carlos | 31 | 168.9 | 48.3 | Male | 0 |
Dalton | 25 | 164.9 | 78.6 | Male | 2 |
Ethan | 32 | 160.9 | 54.6 | Male | 0 |
Flora | 26 | 161.6 | 69.8 | Female | 4 |
Gaia | 35 | 194.2 | 56.0 | Female | 0 |
Helen | 26 | 171.3 | 86.5 | Female | 3 |
Ingrid | 27 | 165.1 | 62.9 | Female | 0 |
Jennifer | 20 | 165.6 | 59.4 | Female | 2 |
In analyzing a data-set, we are often interested in conducting operations for a whole set of numbers of a given variable (which we can call vectors). A vector can contain numbers, strings, logical values, or any other type. For example, if we take our participants Height, it would be an example of a numeric vector. If we take our participants’ Gender, it would be an example of a string vector. In this way, one can build a data-set with several types of vectors.
Let’s focus on male students’ heights as our example. Suppose we are interested in their Arithmetic mean, how can we calculate it in R? Here’s the formula: \[\frac{1}{n} \sum_{i=i}^{n} x_{i}\]
The first thing we need to do - according to the formula - is to add all the heights. So lets…
## [1] 832.3
… then we need to divide this sum by the number of observations, which is 5. So, what is the mean height of male students in our sample?
## [1] 166.46
Now, let’s do the same operation using a vector. In the previous section we learned how to store a mathematical expression into an object. Now, we are going to store more than one piece of data into an object (i.e., vector). So first, we need to name our vector, let’s call it “height”. And it will receive all five values. In R, we do this by “combining” or “concatenating” several values, so we use the “c” in front of a parenthesis, with values separated by commas.
Let’s check if we created our vector correctly. Type height
in you console. It should return the following:
## [1] 157.9 172.8 180.8 146.5 174.3
Now that we created our vector, we can do the same operations we did above for our height vector. This is one of the main advantages or R over other statistical software.
So, type in your Console or Source panel the following expressions:
Multiplication height * 2
## [1] 315.8 345.6 361.6 293.0 348.6
Division height / 2
## [1] 78.95 86.40 90.40 73.25 87.15
Exponentiation height ^ 2
## [1] 24932.41 29859.84 32688.64 21462.25 30380.49
Vectorized operations are one of the most important strengths of R because it facilitates immensely the process of dealing with data. For example, if we wanted to calculate the mean of height for males, all we have to do is to know the function in R that calculates the: mean(), and put our vector inside it.
Mean
## [1] 166.46
Variance
## [1] 194.743
Standard Deviation
## [1] 13.95503
Median
## [1] 172.8
Range
## [1] 146.5 180.8
Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146.5 157.9 172.8 166.5 174.3 180.8
We can also do regular operations with vectors without naming them
## [1] 2 4 6 8 10
Can you guess which mathematical operation the below code is yielding?
## [1] 1 4 9 16 25
How about this last one? Can you guess? height - c(1, 2, 3, 4, 5) * c(5, 4, 3, 2, 1)
## [1] 152.9 164.8 171.8 138.5 169.3
5.1 Practicals
5.1.1 Vector and R functions
Create a vector called height.female which is composed by the height of all women in our sample, and another vector called n.R.courses for the number of R courses/workshops for each participant. Then for these two vectors, calculate the mean, median, variance, standard deviation, and range. Now, choose at least 5 (five) other functions from the list below and apply to both created vectors. See if you understand what the function returns.
R function | Description |
---|---|
max(x) | Largest Value |
min(x) | Smallest Value |
mean(x) | Arithmetic Mean |
sum(x) | Sum |
median(x) | Median |
var(x) | Variance |
sd(x) | Standard Deviation |
abs(x) | Absolute value |
range(x) | Range |
length(x) | Length |
diff(x, lag=1) | lagged differences |
scale(x) | column center or standardize a matrix. |
sqrt(x) | square root |
ceiling(x) | ceiling(3.475) is 4 |
floor(x) | floor(3.475) is 3 |
round(x, digits=n) | round(3.475, digits=2) is 3.48 |
log(x) | natural logarithm |
log10(x) | common logarithm |
exp(x) | e^x |
summary(x) | Min, 1st Quan, Median, Mean, 3rd Quan, and Max |
quantile(x) | sample quantiles |
5.1.2 Boolean expressions in R
Boolean expressions evaluate to either TRUE or FALSE. A crucial part of computing involves conditional statements. Is this value bigger than other? Are two vectors the same size? etc. Questions can be joined together using words like ‘and’ ‘or’, ‘not’. In R, < means ‘less than’, > means ‘greater than’, and ! means ‘not’ (see Table below).
R function Symbol | Description |
---|---|
! | logical NOT |
& | logical AND |
logical OR | |
< | less than |
<= | less than or equal to |
> | greater than |
= | greater than or equal to |
== | logical equals (double =) |
!= | not equal |
&& | AND with IF |
double upright bars | OR with IF |
xor(x,y) | exclusive OR |
isTRUE(x) | an abbreviation of identical(TRUE,x) |
For all these logical statements, try to figure out (before running the code) the result/answer. Then run the code by pressing ctrl + enter on the desired line.
```
# Is true equal to false?
TRUE == FALSE
# T and F are shorthand for TRUE and FALSE. Try this:
T == TRUE
T == F
T != F
# Is 4 smaller than 4
3 < 4
# Is 2 + 2 equal to 5
2 + 2 == 5
# Is 2 smaller than 5
2 < 5
# Is 7 smaller or equal than minus 2
7 <= -2
# Try to figure these out
3 > (3+1)
4 >= 4
(3/4) == (9/12)
# The symbol '!' is a negation of a logical statement
!TRUE
!F
2^4 != 4^2
!(2 < 1)
!(3 < 6)
# The ampersand symbol & means "and". A statement is TRUE only if the expressions on both sides of the operator are TRUE. One can also think of "intersection" as in set operations
3*4==12 & 6/8<1
(3 < 5) & (2 > 0)
(2 < 3) & (5 > 5)
# The symbol | means "or". The | operator is TRUE if at least one of the expressions surrounding it is TRUE. You can also think in terms of set operations as in "union" of sets.
(3 < 5) | (2 > 3)
(2 < 1) | (5 > 5)
TRUE | FALSE
FALSE | TRUE
FALSE | 2+2==4
# Can you guess?
c(1, 2, 3, 4, 5) <= 3
((5>4) & !(3<2)) | (6>7)
# Most Boolean operators act element-wise.
# %in% is a matching operator
c(1, 2, 3, 4, 5) %in% c(1, 2, 3)
height %in% c(157.9, 172.8, 180.8, 146.5, 174.3)
height.female %in% c(161.6, 194.2, 171.3, 165.1, 165.6)
# %% is the symbol for modulus. In computing, the modulo operation finds the remainder after division of one number by another (sometimes called modulus)
5 %% 2
9 %% 3
V <- c(3,2,8,6,5,6,11,0)
I <- (V %in% 2 == 1)
# Lets try to use what we learned thus far with our vectors
height == height.female
height > height.female
height < height.female
```
5.2 Exercises
These exercises are slightly adapted (shamelessly copied with minor adjustments) from R-exercises, a website that offers many exercises for you to test your R skills. They also offer a R Course Finder which catalogs several R courses on MOOCs (Massive Open Online Courses) such as Coursera and Khan Academy and other online learning platforms (e.g. Udemy, EdX, Lynda.com).
5.2.1 Exercise 1
There are two main different type of interest, simple and compound. To start let’s create 3 variables, initial investment (S = 100), annual simple interest (i1=.02), annual compound interest (i2=.015), and the years that the investment will last (n=2).
Simple Interest: define a variable called simple equal to \(S * (1 + i1 * n)\)
Compound Interest: define a variable called compound equal to \(S * (1 + i2) ^ n\)
5.2.2 Exercise 2
It’s natural to ask which type of interest for this values gives more amount of money after 2 years (n = 2). Using logical functions <,>, == check which variable is bigger between simple and compound
## [1] TRUE
5.2.3 Exercise 3
Using logical functions <,>, ==, |, & find out if simple or compound is equal to 120
Using logical functions <,>, ==, |, & find out if simple and compound is equal to 120
## [1] TRUE
## [1] FALSE
5.2.4 Exercise 4
Formulas can deal with vectors, so let’s define a vector and use it in one of the formulas we defined earlier. Let’s define S as a vector with the following values 100, 96. Remember that c() is the function that allow us to define vectors.
Apply to S the simple interest formula and store the value of the vector in simple
5.2.5 Exercise 5
Using logical functions <,>, == check if any of the simple values is smaller or equal to compound
## [1] FALSE TRUE
5.2.6 Exercise 6
Using the function %/% find out how many $20 candies can you buy with the money stored in compound
## [1] 5
5.2.7 Exercise 7
Using the function %% find out how much money is left after buying the candies.
## [1] 18.81
5.2.8 Exercise 8
Let’s create two new variables, ode defined as rational=1/3 and decimal=0.33. Using the logical function != Verify if this two values are different.
## [1] TRUE
5.2.9 Exercise 9
There are other functions that can help us compare two variables.
Use the logical function == verify if rational and decimal are the same.
Use the logical function isTRUE verify if rational and decimal are the same.
Use the logical function identical verify if rational and decimal are the same.
## [1] FALSE
## [1] FALSE
## [1] FALSE
5.3 Advanced Exercises
- Calculate square root of 729
- Create a new variable ‘b’ with value 5124
- Create a vector numbers from 1 to 21 and find out its class
- Create a vector containing following mixed elements {2131, 24, ‘j’, 2, ‘b’} and find out its class
- Initialise a character vector of length 26
- Assign the character ‘a’ to the first element in above vector
- Create a vector with some of your class mates names (at least 5)
- Get the length of above vector
- Get the first two friends from above vector
- Get the 2nd and 3rd friends
- Sort your friends by names
- Reverse direction of the above sort
- Create with rep() and seq() R functions the following vector: ‘a’,‘a’,‘a’, 1,2,3,4,5,11,13,15,17,19,21
- Sample 50 random numbers between 1 to 100
- Sample 50 random numbers between 1 to 500, with replacement
- Find the class of ‘iris’ dataframe, find the class of all the columns of ‘iris’, get the summary of ‘iris’, get the top 6 rows, view it in a spreadsheet format, get row names, get column names, get number of rows and get number of columns.
- Apply the above functions and inspect results on ‘iris’ (a base R dataframe)
- Get the last 2 rows in last 2 columns from iris dataset
- Get rows with Sepal.Width > 3.5 from iris
- Get the rows with ‘versicolor’ species using subset() from iris
Below you find the answers
sqrt(729)
b <- 5124
one_to_21 <- 1:21
class(one_to_21)
my.vector <- c(2131, 24, 'j', 2, 'b')
class(my.vector)
charHundred <- character(26)
charHundred
charHundred[1] <- "a"
myFriends <- c("alan", "bala", "amir", "tsong", "chan")
length(myFriends)
myFriends[1:2]
myFriends[c(2,3)]
sort(myFriends)
myFriends[order(myFriends)]
sort(myFriends, decreasing=TRUE)
myFriends[rev(order(myFriends))]
out <- c(rep('a', 3), seq(1, 5), seq(11, 21, by=2))
mySample <-sample(1:100, 50)
mySample <-sample(1:500, 50, replace=T)
class(iris) # get class
sapply(iris, class) # get class of all columns
str(iris) # structure
summary(iris) # summary of airquality
head(iris) # view the first 6 obs
fix(iris) # view spreadsheet like grid
rownames(iris) # row names
colnames(iris) # columns names
nrow(iris) # number of rows
ncol(iris) # number of columns
numRows <- nrow(iris)
numCols <- ncol(iris)
iris[(numRows-1):numRows, (numCols-1):numCols]
iris[iris$Sepal.Width > 3, ]
iris[which(iris$Sepal.Width > 3), ]
subset(iris, Species == "versicolor")