Introduction Into Web Scraping with Portland Burger Week

I take the Portland food weeks very seriously. My partner and I make a list of our must-trys and then I make out plan of attack. We usually hit up 2-3 places a night until we’ve exhausted all the places that we wanted to try. Burger Week is coming up in a couple weeks and I wanted to streamline things a bit: I wanted a spreadsheet of all of the burgers. I knew that web scraping in R was a thing but honestly I had always been a bit intimidated. I had some free time on my hands and figured: now is the time to learn. So I googled a few tutorials and dove straight in. It wound up being so much easier than I thought it would be. Let’s learn together.

The tutorial I wound up reading in most detail is this one. The reason I liked this tutorial is that it turned me on to this nifty little tool to kinda cheat a little - it’s a Chrome extension called SelectorGadget that makes finding the relevant CSS elements easy peasy.

Step 1: install the SelectorGadget extension here.

Step 2: In R, install {rvest}.

install.packages("rvest")

Step 3: Have R read in your website of interest - mine is the Portland 2021 Burger Week list.

library(rvest)
library(tidyverse)
url <- 'https://www.portlandmercury.com/sponsored/burgerweek2021'
webpage <- read_html(url)

Step 4: Now go to your website and activate SelectorGadget. Click on the element you want to scrape. For me, I did the restaurant names first. Scroll through the whole page and make sure the only things highlighted green/yellow are things you want. If there are any other elements highlighted, click on them and they will turn red/be excluded.

The SelectorGadget tells me that the CSS element that I want is .headline a. Back in R, lets use that to scrape the restaurant names. Use html_nodes() to scrape the element and html_text2() to read the text.

restaurant <- html_nodes(webpage,'.headline a') %>% html_text2()
restaurant
## character(0)

Step 5: Do the same for the rest of the elements you want and you’re good to go! Here’s my code in entirety:

library(tidyverse)
library(rvest)

url <- 'https://www.portlandmercury.com/sponsored/burgerweek2021'
webpage <- read_html(url)

restaurant <- html_nodes(webpage,'.headline a') %>% html_text2()
burger_name <- html_nodes(webpage, '.blog-body p:nth-child(1)') %>% html_text2()
description <- html_nodes(webpage, '.blog-body p:nth-child(2)') %>% html_text2()
address_hours <- html_nodes(webpage, 'p:nth-child(5) , p:nth-child(4)') %>% html_text2()

# ugh sometimes address is 4th...sometimes it's 5th...must filter.
address_hours <- as.tibble(address_hours) %>% filter(grepl("Address/Hours of Availability",value)) %>% pull()

burger_week <- tibble(restaurant = restaurant,
                      burger_name = burger_name,
                      description = description,
                      address_hours = address_hours)
# cleaning
burger_week <- burger_week %>% 
  separate(burger_name, into = c("trash", "burger_name"), sep = ":", extra = "merge") %>% 
  separate(description, into = c("trash", "description"), sep = ":", extra = "merge") %>%
  separate(address_hours, into = c("trash", "address_hours"), sep = ":", extra = "merge") %>%
  select(restaurant, burger_name, description, address_hours)

write_csv(x = burger_week, path = "burger_week_2021.csv")

You can find the resultant .csv file on my github.

Happy scraping!

Jennifer Jahncke
Jennifer Jahncke
Postdoctoral Scholar

I am a neuroscientist studying how brain cells form and maintain connections.