how to properly identify specific value to parse using rvest

0

Issue

Dear Collective Wisdom

I’m struggling trying to use rvest to parse a table from https://www.1944.pl/powstancze-biogramy,ord,nazwisko,0,strona,1.html

I need to loop through all nodes of the table and extract its values one by one. Then iterate to next page and repeat.

I intend to read the table values separately, because I need to add a variant loop in the code – for each row, if value in the "Data urodzenia" column equals "-" then the program should enter the webpage corresponding to that row and extract some other value (tagged "Rocznik") instead.

For now, I’m having trouble with forcing the rvest to read values from the table. I think I don’t quite follow the idea of html selectors… I’m able to read the entire table (per page) using the (".museumTableRow") tag in the following function:

library(rvest) 
library(tidyverse)

page <- read_html("https://www.1944.pl/powstancze-biogramy,ord,nazwisko,0,strona,1.html")
getPage <- function(html){
  html %>% 
    html_nodes(".museumTableRow") %>%      
    html_text() %>% 
    str_trim() %>%                       
    unlist()                             
}

Append_page <- getPage(page)

…but as I try to use selectors for specific cells of the table, I get an empty ("character(0)") response. I was trying to find relevant tags by inspecting the page manually and by using the selectorgadget plugin as suggested by the library creators. These seem odd (to me), ex. for the first name in the "Nazwisko" column the selectorgadget suggests:

".footable-even:nth-child(1) .footable-first-column .museumTableRow"

so I was also trying to play with them, but I with no success. I guess I don’t fully understand how it works. I would appreciate any suggestions on how to force rvest to read this table cell by cell and append values from subsequent cells to a data.table.

I hope this is specific enough.

Solution

This should work:

library(glue)
page <- read_html("https://www.1944.pl/powstancze-biogramy,ord,nazwisko,0,strona,1.html")
dat <- page %>% html_elements(css="tbody tr")  
txt <- dat %>% html_text()
hrefs <- dat %>% html_element("a") %>% html_attr("href")
s <- lapply(1:length(txt), function(i)trimws(strsplit(txt[i], split="\\n")[[1]]))
out_txt <- t(sapply(s, function(x)x[which(x != "")]))
stem <- "https://www.1944.pl"
for(i in 1:nrow(out_txt)){
  if(out_txt[i,6] == "-"){
    u <- paste0(stem, hrefs[i])
    h <- read_html(u)
    btxt <- h %>% html_elements(css="div.biogram--info div.tag") %>% html_text()
    ind <- grep("Rocznik", btxt)
    if(length(ind) > 0){
      btxt2 <-   h %>% html_elements(css=glue("div.biogram--info div.info")) %>% html_text()
      out_txt[i,6] <- btxt2[ind]
    }else{
      out_txt[i,6] <- NA_character_
    }
  }
}
head(out_txt)
#             [,1]          [,2]         [,3]      [,4]            [,5]                [,6]         [,7]        
# [1,] "Abajew"      "Aleksander" "-"       "-"             "-"                 "1916-06-06" "-"         
# [2,] "Abakanowicz" "Piotr"      "-"       "-"             "\"Grey\""          "1890-06-21" "1948-06-01"
# [3,] "Abakanowicz" "Maria"      "-"       "-"             "\"Lena\""          "1901"       "-"         
# [4,] "Abczyńska"   "Alicja"     "Henryka" "sanitariuszka" "\"Ciocia Stasia\"" "1900-02-09" "1989-04-26"
# [5,] "Abczyńska"   "Janina"     "-"       "pielęgniarka"  "\"Julia\""         "1883-06-15" "1944-08-30"
# [6,] "Abczyński"   "Stanisław"  "-"       "-"             "\"Stefan\""        NA           "-"         

In the code above, it grabs the data and the href for the first <a> tag in the row. It then goes to that reference if the sixth column of the ith row is "-". If there is an entry labelled "Rocznik", it grabs the year if it exists, otherwise it replaces the entry with a missing value.


Edit: Details about CSS selectors

I’ll assume that everything up until the stem <- "https://www.1944.pl" part is pretty straightforward as it more or less follows the path you were already on. So, let’s dig into the for loop and how those things work. As you noted, originally the third row had a "-" as its entry in the sixth column and this should trigger another lookup. So, that means the third href should be followed:

hrefs[3]
# [1] "/powstancze-biogramy/maria-abakanowicz,845.html"

To do that, we can paste this onto the stem defined earlier to make a valid URL, which we read into h. If you visit that URL in your browser, you see this:

enter image description here

You want to be able to identify the number associated with "Rocznik". If you right-click on the word "Rocznik" and choose "inspect", you’ll see this:

enter image description here

You’ll notice that all of the entries are in a <div> with class "biogram--info". Further, the heading for each entry is in a <div> with class "tag". I edited the answer above to make the result a bit cleaner by using the "tag" div as well. The results of btxt look as follows:

btxt <- h %>% 
  html_elements(css="div.biogram--info div.tag") %>% 
  html_text()
btxt
# [1] "Pseudonim:"                      
# [2] "Data urodzenia:"                 
# [3] "Data śmierci:"                   
# [4] "Funkcja:"                        
# [5] "Rocznik:"                        
# [6] "Stopień:"                        
# [7] "Pseudonimy:"                     
# [8] "Udział w konspiracji 1939-1944:" 
# [9] "Oddział:"                        
# [10] "Szlak bojowy:"                   
# [11] "Miejsce (okoliczności) śmierci :"
# [12] "Uwagi:"                          
# [13] "Publikacje :"   

Then, you can figure out which one is "Rocznik" – in this case, the fifth one. In the original answer (which I’ve edited to make a bit cleaner), I had

btxt2 <-   h %>% 
  html_elements(css=glue("div.biogram--info:nth-child({ind-1})")) %>%   
  html_text()

First, the meachanics – the glue() function is like paste() or paste0(), but sometimes a bit clearer in what it’s doing. In the statement above:

glue("div.biogram--info:nth-child({ind-1})")

would be equivalent to either of these:

paste0("div.biogram--info:nth-child(", ind-1, ")")
paste("div.biogram--info:nth-child(", ind-1, ")", sep="")

In glue(), anything that shows up within the braces {} gets evaluated in R and the result is then placed in the text. From the code above ind identifies the value <div> associated with "Rocznik", but when we look at the thildren of div.biogram--info, we see that some of the children have multiple entries. It’s the fourth child ind-1 that holds the "Rocznik" entry.

h %>% 
  html_elements(css="div.biogram--info:nth-child(1)") %>% 
  html_text()
# [1] "\n                        Pseudonim:\n                        \"Lena\"\n                    "                      
# [2] "\n                                Data urodzenia:\n                                -\n                            "
# [3] "\n                                Data śmierci:\n                                -\n                            "  

h %>% 
  html_elements(css="div.biogram--info:nth-child(2)") %>% 
  html_text()
# character(0)

h %>% 
  html_elements(css="div.biogram--info:nth-child(3)") %>% 
  html_text()
# [1] "\n                        Funkcja:\n                        -\n                    "

h %>% 
  html_elements(css="div.biogram--info:nth-child(4)") %>% 
  html_text()
# [1] "\n                                Rocznik:\n                                1901\n                            "

Then the code just extracts whatever sequential digits it finds. I’ve edited the answer above to something that is probably a bit more robust. If you look back at the picture of the html source, you’ll see that for each "biogram--info" div, there is a "tag" class div that holds the heading and an "info" class div that holds the value associated with that heading. Using the following code, you can retrieve all of the "info" entries that correspond with each "tag" entry:

btxt2 <- h %>% 
  html_elements(css=glue("div.biogram--info div.info")) %>% 
  html_text()
btxt2
# [1] "\"Lena\""                                                                                                                                                                                                                                         
# [2] "-"                                                                                                                                                                                                                                                
# [3] "-"                                                                                                                                                                                                                                                
# [4] "-"                                                                                                                                                                                                                                                
# [5] "1901"                                                                                                                                                                                                                                             
# [6] "starszy strzelec"                                                                                                                                                                                                                                 
# [7] "\"Lenarska\", \"Lena\""                                                                                                                                                                                                                           
# [8] "Narodowe Siły Zbrojne - Okręg I A Warszawa-Miasto"                                                                                                                                                                                                
# [9] "Armia Krajowa - Grupa \"Północ\" - zgrupowanie \"Sienkiewicz\" - następnie odcinek bojowy \"Kuba\" - \"Sosna\" - kompania P-20, następnie w 1. batalionie szturmowym KB \"Nałęcz\" . W Śródmieściu w batalionie KB \"Sokół\"  - patrol sanitarny."
# [10] "Stare Miasto - kanały - Śródmieście Północ"                                                                                                                                                                                                       
# [11] " Poległa na ul. Kopernika (VIII-IX 1944). Inne wersja - zmarła 1992-10-10\n"                                                                                                                                                                      
# [12] "Inny spotykany przydział  - Narodowe Siły Zbrojne - Grupa \"Topór\""                                                                                                                                                                              
# [13] "Wielka Ilustrowana Encyklopedia Powstania Warszawskiego Suplement, Warszawa, 2009"  

To get the value associated with "Rocznik", you can take the ind entry of btxt2 as is now done in the answer above.

Answered By – DaveArmstrong

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More