How Web Scraping Is Used To Extract Amazon Prime Data Using Selenium And Beautifulsoup?

webscreenscraping

492
views

How Web Scraping Is Used To Extract Amazon Prime Data Using Selenium And Beautifulsoup?

We will use BeautifulSoup and Selenium to scrape movie details from Amazon Prime Video in several categories, such as description, name, and ratings, and then filter the movies depending on the IMDB ratings.

Selenium is a great tool of web scraping, but has some flaws which is normal because it was designed primarily for testing online applications. However, BeautifulSoup was created particularly for web scraping and is also an excellent tool.

But even BeautifulSoup has its own flaws as when data to be scraped is behind the wall and it requires user authentication or some other actions from user.

This is where Selenium may be used to automate user interactions with the website, and Beautiful Soup will be used to scrape the data once we are in the wall.

When BeautifulSoup and Selenium are combined, you get a perfect web scraping tool. Selenium can also scrape data but BeautifulSoup is far better.

Let’s discuss the process of scraping Amazon Prime data.

Firstly, import the necessary modules

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from time import sleep
from selenium.common.exceptions import NoSuchElementException
import pandas as pd

Make three empty lists to keep track of the movie information.

movie_names = []
movie_descriptions = []
movie_ratings = []

Chrome Driver must be installed to work this program properly. Make sure you install the driver that relates to your browser version of chrome.

Now, define a function called open_site() that opens the sign-in page of Amazon Prime.

def open_site():
options = webdriver.ChromeOptions()
options.add_argument("--disable-notifiactions")
driver = webdriver.Chrome(executable_path='PATH/TO/YOUR/CHROME/DRIVER',options=options)
driver.get(r'https://www.amazon.com/ap/signin?accountStatusPolicy=P1&clientContext=261-1149697-3210253&language=en_US&openid.assoc_handle=amzn_prime_video_desktop_us&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.primevideo.com%2Fauth%2Freturn%2Fref%3Dav_auth_ap%3F_encoding%3DUTF8%26location%3D%252Fref%253Ddv_auth_ret')
sleep(5)
driver.find_element_by_id('ap_email').send_keys('ENTER YOUR EMAIL ID')
driver.find_element_by_id('ap_password').send_keys('ENTER YOUR PASSWORD',Keys.ENTER)
sleep(2)
search(driver)

Let's create a search() function that looks for the genre specified.

def search(driver):
    driver.find_element_by_id('pv-search-nav').send_keys('Comedy Movies',Keys.ENTER)
    
    
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("scrollTo(0, document.body.scrollHeight);")
        sleep(5)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            Break
        last_height = new_height
    html = driver.page_source
    Soup = soup(html,'lxml')
    tiles = Soup.find_all('div',attrs={"class" : "av-hover-wrapper"})
    
    for tile in tiles:
        movie_name = tile.find('h1',attrs={"class" : "_1l3nhs tst-hover-title"})
        movie_description = tile.find('p',attrs={"class" : "_36qUej _1TesgD tst-hover-synopsis"})
        movie_rating = tile.find('span',attrs={"class" : "dv-grid-beard-info"})
        rating = (movie_rating.span.text)
        try:
            if float(rating[-3:]) > 8.0 and float(rating[-3:]) < 10.0:
                movie_descriptions.append(movie_description.text)
                movie_ratings.append(movie_rating.span.text)
                movie_names.append(movie_name.text)
                print(movie_name.text, rating)
        except ValueError:
            Pass

The function searches genre and scrolls till the bottom of page because Amazon Prime Video scrolls endlessly as it uses JavaScript executor and then call driver to acquire page_source. This source is then utilized and feeded to BeautifulSoup.

To be sure that the if statement is looking for movies with a rating of more than 8.0 but less than 10.0.

Let's make a pandas data frame to hold all our movie information.

def dataFrame():
	    
details = {
    'Movie Name' : movie_names,
    'Description' : movie_descriptions,
    'Rating' : movie_ratings
}
data = pd.DataFrame.from_dict(details,orient='index')
data = data.transpose()
data.to_csv('Comedy.csv')

Now let’s try the function we already discussed

open_site()

Result

The result you get will not look the same as here it is formatted such as Column width, text wrap, etc. otherwise it will look almost same.

Conclusion

Hence, BeautifulSoup and Selenium work well together and provides best results considering Amazon Prime Video but Python has other tools also like Scrapy and it is also equally strong.

Looking for best web scraping services to get Amazon Prime data? Contact Web Screen Scraping today!

Request for quote!