How can we scrape missing values from IMDB in R?

0

Issue

library(rvest)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)

Basically the code above is for scraping the titles and ratings of 50 movies. I want missing values also to be included as NAs.

However, it doesn’t happen as IMDB hasn’t included them in the HTML tag which only has actual values present (I have used SelectorGadget for getting the tags). So the observation count is 50 for titles and just 33 for ratings which is not what I want. I have tried using html_node() along with html_nodes() but R gives an error saying cannot use css and xpath together. I have also tried the trim=TRUE and replace(!nzchar(.), NA) but they don’t work either.

Is there a way to solve this and ensure I get 50 ratings (including NAs or empty values)?

Solution

You need to perform this parsing in 2 steps. First collect the parent nodes for all 50 of the movies with html_nodes(). Then you parse this collection of nodes with html_node() (without the s) to obtain a result for all 50 including the nodes missing the attribute.

library(rvest)
library(dplyr)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")

#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")

#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()

Note the update from html_node(s) to html_element(s) the current style in rvest 1.0

Answered By – Dave2e

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More