8 For loops

We will often want to perform the same task on a number of different items, such as cleaning every column in a data set. On effective way to do this is through “for loops”. Earlier in this course we learned how to scrape a website containing information on movies. We did so for a single date, if we wanted to get movie data for a week or a years-worth of data, typing out each date would be excessively slow, even with the function we made in Section 7.3. In this lesson we will use a for loop to scrape movie data for a an entire year of dates.

8.1 Basic for loops

We’ll start with a simple example, making R print the numbers 1-10.

for (i in 1:10) {
   print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10

The basic concept of a for loop is you have some code that you need to run many times with slight changes to a value or values in the code - somewhat like a function. Like a function, all the code you want to use goes in between the { and } squiggly brackets. And you loop through all the values you specify - meaning the code runs once for each of those values.

Let’s look closer at the (i in 1:10). The i is simply a placeholder object which takes the value 1:10 each iteration of the loop. It’s not necessary to call it i but that is convention in programming to do so. It takes the value of whatever follows the in which can range from a vector of strings to numbers to lists of data.frames. Especially when you’re an early learner of R it could help to call the i something informative to you about what value it has.

Let’s go through a few examples with different names for i and different values it is looping through.

for (a_number in 1:10) {
   print(a_number)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10
animals <- c("cat", "dog", "gorilla", "buffalo", "lion", "snake")
for (animal in animals) {
   print(animal)
}
#> [1] "cat"
#> [1] "dog"
#> [1] "gorilla"
#> [1] "buffalo"
#> [1] "lion"
#> [1] "snake"

Now let’s make our code a bit more complicated, adding the number 2 every loop.

for (a_number in 1:10) {
   print(a_number + 2)
}
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10
#> [1] 11
#> [1] 12

We’re keeping the results inside of print() since for loops do not print the results by default. Let’s try combining this with some subsetting using square bracket notation []. We will look through every value in numbers a vector we will make with the values 1:10 and replace each value with it’s value plus 2.

The object we’re looping through is numbers. But we’re actually looping through every index it has, hence the 1:length(numbers). That is saying, i takes the value of each index in numbers which is useful when we want to change that element. length(numbers) finds how long the vector numbers is (were this a data.frame we could use nrow()) to find how many elements it has. In the code we take the value at each index numbers[i] and add 2 to it.

numbers <- 1:10
for (i in 1:length(numbers)) {
  numbers[i] <- numbers[i] + 2
}
numbers
#>  [1]  3  4  5  6  7  8  9 10 11 12

We can also include functions we made in for loops. Here’s a function we made last lesson which adds 2 to each inputted number.

add_2 <- function(number) {
  number <- number + 2
  return(number)
}

Let’s put that in the loop.

for (i in 1:length(numbers)) {
  numbers[i] <- add_2(numbers[i])
}
numbers
#>  [1]  5  6  7  8  9 10 11 12 13 14

8.2 Scraping multiple days of movie data

Below is the function copied from Section 7.3 where we made a function that took a single date and scraped the site The-Numbers for movie ticket sales data for that day. If we wanted to get data from multiple days, we would need to run the function multiple times. Here we will use a for loop to get data for an entire year.

scrape_movie_data <- function(date) {
  url <- "http://www.the-numbers.com/box-office-chart/daily/"
  url_date <- paste(url, date, sep = "")
  
  movie_data <- read_html(url_date)
  movie_data <- html_nodes(movie_data, "#page_filling_chart > center:nth-child(2) > table")
  movie_data <- html_table(movie_data)
  movie_data <- movie_data[[1]]
  
  return(movie_data)
}

With any for loop you need to figure out what is going to be changing, in this case it is the date. And since we want a year’s worth of movie data, we need to make an object with an entire year of dates. We can use the function seq() in association with the lubridate package to make that object.

seq() produces a vector of every value between two points (either numbers or Dates) based on the increments we specify, in this case daily.

We want a year of data, from January 1th, 2018 to December 31th, 2018 so those will be our start and end points. And we want Dates returned so we will use the ymd() function from lubridate to turn those values into dates.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
year_of_dates <- seq(ymd("2018-1-1"), ymd("2018-12-31"), by = "days")

Check the first 6 values to see if it did it right.

head(year_of_dates)
#> [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"
#> [6] "2018-01-06"

It worked. However, there is one important problem. We need to make sure the url is exactly correct for the page we want to scrape. In the object year_of_dates it uses “-”; in the website we are scraping, it uses “/”. It may seem like a minor point but if we try to use “-” instead of “/” we will have an error. Luckily, we know enough gsub() to quickly replace all “-” with “/”.

year_of_dates <- gsub("-", "/", year_of_dates)

Now we can write the for loop to go through every single date in year_of_dates and use the function scrape_movie_data we made to scrape data for that date.

for (date in year_of_dates) {
  movie_data <- scrape_movie_data(date)
}

Don’t run this yet because there are two issues remaining. The first is that if we run it as it is, it will will scrape the website for each date, save the results into the object “movie_data” and keep overwriting this object for each date. We need to create an object that doesn’t get overwritten every iteration of the loop. A solution is to create an object outside of the for loop and every time the for loop iterates (in our case runs for a single date) we add the data scraped that time to this object. I prefer to call the object outside the loop something_final and the object that gets overwritten something_temp, where “something” is a descriptive word for the data. In this case we will use movie_data_final and movie_data_temp.

We start by creating the object “movie_data_final” and saying it gets the value data.frame(). That’s just a way to say it is a data.frame type but is empty (hence the () being empty). Now we need some way to add the movie_data_temp data to movie_data_final for each date. We will use the function rbind() which allow us to combine two data.frames together. Think of it like the c() function but for data.frames. So every iteration of the loop we scrape a single date then add those results to the movie_data_final object.


movie_data_final <- data.frame()
for (date in year_of_dates) {
  
  movie_data_temp <- scrape_movie_data(date)
  movie_data_final <- rbind(movie_data_final, movie_data_temp)

}

The second issue is that there is no variable indicating what day it that was scraped. When adding many days together, we need a variable to be able to distinguish the day. This can easily be fixed by making a column in the data which says the date. When we used gsub() on year_of_dates we changed it from a Date type to a character type. Let’s change it back in the new variable we made by putting it in ymd() before saving to to the column.


movie_data_final <- data.frame()
for (date in year_of_dates) {
  
  movie_data_temp <- scrape_movie_data(date)
  movie_data_temp$date <- ymd(date)
  
  movie_data_final <- rbind(movie_data_final, movie_data_temp)
}

Now we are ready to run the for loop and get movie data for an entire year.