How to scrape paginated table with BeautifulSoup and store results in csv?

0

Issue

I want to scrape https://www.airport-data.com/manuf/Reims.html and iterate through all and extract the results into AircraftListing.csv

The code runs without error, but the results are incorrectly populated and not all the records are extract from the webpage to the .csv file

How can I get out all Reims aviation records to the AircraftListing.csv ?

import requests
from bs4 import BeautifulSoup
import csv

root_url = "https://www.airport-data.com/manuf/Reims.html"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')

paging = soup.find("table",{"class":"table table-bordered table-condensed"}).find_all("td")

start_page = paging[1].text
last_page = paging[len(paging)-2].text


outfile = open('AircraftListing.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Tail_Number", "Year_Maker_Model", "C_N","Engines", "Seats", "Location"])


pages = list(range(1,int(last_page)+1))
for page in pages:
    url = 'https://www.airport-data.com/manuf/Reims:%s.html' %(page)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')

    print ('https://www.airport-data.com/manuf/Reims:%s' %(page))

    product_name_list = soup.find("table",{"class":"table table-bordered table-condensed"}).find_all("td")

    # Each row has 6 elements in it.
    # Loop through every sixth element. (The first element of each row)
    # Get all the other elements in the row by adding to index of the first.
    for i in range(int(len(product_name_list)/6)):
        Tail_Number = product_name_list[(i*6)].get_text('td')
        Year_Maker_Model = product_name_list[(i*6)+1].get_text('td')
        C_N = product_name_list[(i*6)+2].get_text('td')
        Engines = product_name_list[(i*6)+3].get_text('td')
        Seats = product_name_list[(i*6)+4].get_text('td')
        Location = product_name_list[(i*6)+5].get_text('td')

        writer.writerow([Tail_Number, Year_Maker_Model, C_N, Engines, Seats, Location])

outfile.close()
print ('Done')

Solution

To improve your code, especially the part with the for loop, try to select more specifically. Instead of the <td> select the <tr>, this minimises the effort you put into iterating and is more generic.

for row in soup.select('table tbody tr'):
    writer.writerow([c.text if c.text else '' for c in row.select('td')])

Example

import requests, csv
from bs4 import BeautifulSoup

url = 'https://www.airport-data.com/manuf/Reims.html'

with open('AircraftListing.csv', "w", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Tail_Number", "Year_Maker_Model", "C_N","Engines", "Seats", "Location"])

    while True:
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')
        for row in soup.select('table tbody tr'):
            writer.writerow([c.text if c.text else '' for c in row.select('td')])


        if soup.select_one('li.active + li a'):
            url = soup.select_one('li.active + li a')['href']
        else:
            break

Output

Tail Number,Year Maker Model,C/N,Engines,Seats,Location
0008,1987 Reims F406 Caravan II,F406-0008,2,14.0,France
0010,1987 Reims F406 Caravan II,F406-0010,2,12.0,France
13701,0000 Reims FTB337G,0002,2,4.0,Portugal
13705,0000 Reims FTB337G,0016,2,4.0,Portugal
13710,0000 Reims FTB337G,0011,2,4.0,Portugal
...,...,...,...,...,...
ZS-OHP,0000 Reims FR172J Reims Rocket,0496,1,4.0,South Africa
ZS-OTT,1989 Reims F406 Caravan II,F406-0040,2,12.0,South Africa
ZS-OXS,0000 Reims FR172J Reims Rocket,0418,1,4.0,South Africa
ZS-SSC,1988 Reims BPSW,F406-0032,2,12.0,South Africa
ZS-SSE,1990 Reims F406 Caravan II,F406-0043,2,12.0,South Africa

Alternative with pandas

An alternative approach iterating over all 51 pages would be to use pandas.read_html to get the tables, append them to a list, concat() the dataframes from all the pages and save them as csv file including all 5085 records.

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.airport-data.com/manuf/Reims.html'

data = []

while True:
    #print(url)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    data.append(pd.read_html(soup.select_one('table').prettify())[0])

    if soup.select_one('li.active + li a[href]'):
        url = soup.select_one('li.active + li a')['href']
    else:
        break
df = pd.concat(data)
df.to_csv('AircraftListing.csv',index=False)

Answered By – HedgeHog

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More