
Data Fields to Be Scrapped
There is an example shown of the artwork drawn by hand in pencil by some artist and then another artist inks the drawing over them. Typically, 11 × 17-inch panels are used. The vitality of the drawing style, as well as the obvious skill, appeal to everyone.
Get two panels of original art for inside pages from Spiderman comics from the 1980s a few years ago, around 2010. You can pay perhaps $200 or $300 for them and made slightly more than twice that much when you sell them a year later.
Nonetheless, if you are interested in purchasing several pieces in the $200 level right now and wanted to get additional information before doing so.
Below written is the full code with the main output in two csv files.
The leading 800 listings of original comic art from Marvel comics in the form of internal pages, covers, or splash pages are ordered by price in the first csv file. The following fields are scraped from eBay in the csv:
- the title (which usually includes a 20-word description of the item, character, type of page)
- Price
- Link to the item's full eBay sales page complete list of all figures in the artwork *just after first eBay search, the software cycles through the page numbers of new matches at the bottom. eBay flags the application as a bot and prevents it from scraping pages with numbers greater than four. This is fine because it only includes goods that are normally sold for less than $75, and nearly none of them are original comic art – they are largely copies or fan art.
The second file format is doing the same thing, but for things that have previously been sold, using the same search criteria. Because it requires Selenium.
If you execute Selenium more than two or three times in an hour, eBay will disable it and you will have to manually download the HTML of sold comic art.
Expected Result
You can check the result by executing the code once a day and looking through the csv file for mostly lesser-known characters of $100-$300 US dollar range currently for the sale.

Tools that are used: Python, requests, BeautifulSoup, pandas
Here are the below steps that we will follow:
We will scrape the following product Using the “original comic art” as the search string
- only cover, interior pages, or splash pages
- only comic art from Marvel or DC
- comics above the price of $50
- sorted by price + shipping and highest to lowest
- 200 results per page
We'll find a comprehensive of available original comic art based on your search parameters. We'll retrieve the title / brief explanation of the listing (as a single string), the page URL of the real listing, and the price for each listing.
We'll get the main comic book character's name in one field and the identities of all the characters in the image in a second field for each listing.
We'll make a CSV file using an eBay product data scraper in the following format: a title, a price, a link, a character, and a character with several characters.
Installing all the Packages for the Project
!pip install requests --upgrade --quiet !pip install bs4 --upgrade --quiet !pip install pandas --upgrade --quiet !pip install datetime --upgrade --quiet !pip install selenium --upgrade --quiet !pip install selenium_stealth --upgrade --quiet
Initially use the time package so that you can keep the record of the program’s progress and slowly use the date and time in the csv file name
import time from datetime import date from datetime import datetime now = today = today = today.strftime("%b-%d-%Y") date_time = now.strftime("%H-%M-%S") today = today + "-" + date_time print("date and time:", today) date and time: Jul-17-2021-15-14-55
Create a Function to Print the Data and Time
def update_datetime(): global now global today global date_time now = today = today = today.strftime("%b-%d-%Y") date_time = now.strftime("%H-%M-%S") today = today + "-" + date_time print("date and time:", today)
Next Scrape the search URL
- To download the page, use the requests package.
- Employ Beautiful Soup (BS4) to look for appropriate HTML tags, parse them.
- Transform the artwork information to a Pandas dataframe.
import requests from bs4 import BeautifulSoup # original comic art, marvel or dc only, buy it now, over 50, interior splash or cover, sorted by price high to low orig_comicart_marv_dc_50plus_200perpage = '' orig_comicart_marv_dc_50plus_200perpage_sold = '' search_url = orig_comicart_marv_dc_50plus_200perpage # there is a way to use headers in this function call to change the # user agent so the site thinks the request is coming from # different computers with different broswers but I could not get this working # response = requests.get(url, headers=headers) if (response.status_code != 200): raise Exception('Failed to load page {}'.format(url)) page_contents = response.text doc = BeautifulSoup(page_contents, 'html.parser')
Unless there is an error, the response function will return 200. If this is the case, display the error code; otherwise, continue. doc is a BeautifulSoup (BS4) object that makes searching for HTML tags and navigating the Document Object Model a breeze (DOM)
Now Save the HTML Files
# first use the date and time in the file name filename = 'comic_art_marvel_dc-' + today + '.html' with open(filename, 'w') as f: f.write(page_contents)
We can use h3 tags with the class's-item title' to acquire the listing's title/description.
title_class = 's-item__title' title_tags = doc.find_all('h3', {'class': title_class})
This locates all of the h3 tags in the BS4 documentation.
# make a list for all the titles title_list = []
loop through the tags and obtain only the contents of each one
for i in range(len(title_tags)): # make sure there are contents first if (title_tags[i].contents): title_contents = title_tags[i].contents[0] title_list.append(title_contents) len(title_list) 202 print(title_list[:5]) ['WHAT IF ASTONISHING X-MEN #1 ORIGINAL J. SCOTT CAMPBELL COMIC COVER ART MARVEL', 'CHAMBER OF DARKNESS #7 COVER ART (VERY FIRST BERNIE WRIGHTSON MARVEL COVER) 1970', 'MANEELY, JOE - WILD WESTERN #46 GOLDEN AGE MARVEL COMICS COVER (LARGE ART) 1955', 'Superman vs Captain Marvel Double page splash art by Rich Buckler DC 1978 YOWZA!', 'SIMON BISLEY 1990 DOOM PATROL #39 ORIGINAL COMIC COVER ART PAINTING DC COMICS'] since the price is in the same area of the html page as the title, let's use the findNext function. this time we will search for a 'span' element with class = 's-item__price'. also, when I tried to run separate functions to find the title, and then the price, there were sometimes duplicate title tags -- to the length of the lists would not match. I would get a title list with 202 items and a price list 200 items -- so these could not be joined in a dataframe. Also, I imagine using findNext() and findPrevious() might make the whole search process a little faster.
We'll use the findNext function because the price is in the same section of the html page as the title. We'll look for a'span' element with the class's-item price' this time. Furthermore, whenever I tried to execute separate functions to get the title and then the price, there were occasionally duplicate page titles - the lengths of the lists didn't match. You would get a 202-item title list and a 200-item price list, which couldn't be combined in a data frame.
In addition, You can use findNext() and findPrevious()that will speed up the entire search process.
price_class = 's-item__price' price_list = [] for i in range(len(title_tags)): # make sure there are contents first if (title_tags[i].contents): title_contents = title_tags[i].contents[0] title_list.append(title_contents) price = title_tags[i].findNext('span', {'class': price_class}) if(i==1): print(price)
This displays the price information during the last item listed on the first search page, out of a total of 200.
print(price.contents) ['$60.00']
Now you need to check if you are getting a string and not a tag, and if so Strip the Dollar sign
from __future__ import division, unicode_literals import codecs from re import sub if (isinstance(price_string, str)): price_string = sub(r'[^\d.]', '', price_string) else: price_string = price.contents[0].contents[0] price_string = sub(r'[^\d.]', '', price_string) print(price_string) 60.00
Converting the Price into a Floating-Point Decimal
price_num = float(price_string) print(price_num) 60.0
for i in range(len(title_tags)): if (title_tags[i].contents): title_contents = title_tags[i].contents[0] title_list.append(title_contents) price = title_tags[i].findNext('span', {'class': price_class}) if price.contents: price_string = price.contents[0] if (isinstance(price_string, str)): price_string = sub(r'[^\d.]', '', price_string) else: price_string = price.contents[0].contents[0] price_string = sub(r'[^\d.]', '', price_string) price_num = float(price_string) price_list.append(price_num) print(len(price_list)) 202 print(price_list[:5]) [50000.0, 45000.0, 18000.0, 16000.0, 14999.99]
now find an anchor tag with a reference and add the links to each distinct art listing
item_page_link = title_tags[i].findPrevious('a', href=True) link_list = []
Clearing the Other Lists
title_list.clear() price_list.clear() for i in range(len(title_tags)): if (title_tags[i].contents): title_contents = title_tags[i].contents[0] title_list.append(title_contents) price = title_tags[i].findNext('span', {'class': price_class}) if price.contents: price_string = price.contents[0] if (isinstance(price_string, str)): price_string = sub(r'[^\d.]', '', price_string) else: price_string = price.contents[0].contents[0] price_string = sub(r'[^\d.]', '', price_string) price_num = float(price_string) price_list.append(price_num) item_page_link = title_tags[i].findPrevious('a', href=True) # {'class': 's-item__link'}) if item_page_link.text: href_text = item_page_link['href'] link_list.append(item_page_link['href']) len(link_list) 202 print(link_list[:5])
Creating a DataFrame using the Dictionary
import pandas as pd title_price_link_df = pd.DataFrame(title_and_price_dict) len(title_price_link_df) 202 print(title_price_link_df[:5]) title ... link 0 WHAT IF ASTONISHING X-MEN #1 ORIGINAL J. SCOTT... ... 1 CHAMBER OF DARKNESS #7 COVER ART (VERY FIRST B... ... 2 MANEELY, JOE - WILD WESTERN #46 GOLDEN AGE MAR... ... 3 Superman vs Captain Marvel Double page splash ... ... 4 SIMON BISLEY 1990 DOOM PATROL #39 ORIGINAL COM... ... [5 rows x 3 columns]
We're simply interested in the top six pages of results produced by our search address for now. We would potentially obtain 1200 listings ordered by price if the URL returned 200 listings per page. Unfortunately, eBay stops processing requests after the fourth page, resulting in 800 listings. Given the current traffic on eBay, this should be enough to get all products over $75. The listings below this amount are almost entirely made up of fan art rather than actual comic art.
So, the quick and simple method is to check for the pages in the lower-left corner and click on each one to receive the connections to that page.
links_with_pgn_text = [] for a in doc.find_all('a', href=True): if a.text: href_text = a['href'] if (href_text.find('pgn=') != -1): links_with_pgn_text.append(a['href']) len(links_with_pgn_text) 7 print(links_with_pgn_text[:3])
Converting this into Function
def build_pagelink_list(url): response = requests.get(url) if (response.status_code != 200): raise Exception('Failed to load page {}'.format(url)) page_contents = response.text doc = BeautifulSoup(page_contents, 'html.parser') for a in doc.find_all('a', href=True): if a.text: href_text = a['href'] if (href_text.find('pgn=') != -1): links_with_pgn_text.append(a['href']) #below gets run if there is only 1 page of listings if (len(links_with_pgn_text) < 1): links_with_pgn_text.append(url) links_with_pgn_text.clear() build_pagelink_list(orig_comicart_marv_dc_50plus_200perpage) len(links_with_pgn_text) 7 print(links_with_pgn_text)
Extracting the Old Items
Now we'll scrape the internet for auctioned listings and prices. The long-term aim is to be able to detect products listed for sale and compare their pricing to those of recently sold items to determine whether current listings are reasonably priced or underpriced and worth considering purchasing.
This second link only returns results for things that have already been sold, according to eBay. However, because this search yields fewer than 200 results, we'll have to manually download the file for this notebook. This procedure, however, is automated using Selenium, and the code for it can be found below.
orig_comicart_marv_dc_50plus_200perpage_sold = '
Select File->Save Page as webpage HTML only from Chrome if you need to save the page manually.
"sold listings.html" is the name of the file.
!apt update !apt install chromium-chromedriver --quiet from selenium import webdriver from selenium_stealth import stealth Hit:1 bionic-cran40/ InRelease Ign:2 InRelease Hit:3 bionic-security InRelease Ign:4 InRelease Hit:5 Release Hit:6 bionic InRelease Hit:7 Release Hit:8 bionic InRelease Hit:9 bionic-updates InRelease Hit:10 bionic InRelease Hit:11 bionic-backports InRelease Hit:12 bionic InRelease Hit:14 bionic InRelease Reading package lists... Done Building dependency tree Reading state information... Done 41 packages can be upgraded. Run 'apt list --upgradable' to see them. Reading package lists... Building dependency tree... Reading state information... chromium-chromedriver is already the newest version (91.0.4472.101-0ubuntu0.18.04.1). 0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded. def selenium_run(url): options = webdriver.ChromeOptions() options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) # open it, go to a website, and get results driver = webdriver.Chrome('chromedriver',options=options) # uncomment below and change paths if running locally (and comment the line above) #PATH = '/Users/jmartin/Downloads/chromedriver' #driver = webdriver.Chrome(options=options, executable_path=r"/Users/jmartin/Downloads/chromedriver") stealth( driver, user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36', languages = "en", vendor = "Google Inc.", platform = "Win32", webgl_vendor = "Intel Inc.", renderer = "Intel Iris OpenGL Engine", fix_hairline = False, run_on_insecure_origins = False ) driver.delete_all_cookies() driver.get(url) update_datetime() #html_file_name = "sold_page_source-" + today + ".html" html_file_name = "sold_listings.html" with open(html_file_name, "w") as f: f.write(driver.page_source) return html_file_name fname = selenium_run(orig_comicart_marv_dc_50plus_200perpage_sold) date and time: Jul-17-2021-15-17-25 print(fname) sold_listings.html with open(fname) as fp: doc = BeautifulSoup(fp, 'html.parser') def selenium_run(url): options = webdriver.ChromeOptions() options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) # open it, go to a website, and get results driver = webdriver.Chrome('chromedriver',options=options) # uncomment below and change paths if running locally (and comment the line above) #PATH = '/Users/jmartin/Downloads/chromedriver' #driver = webdriver.Chrome(options=options, executable_path=r"/Users/jmartin/Downloads/chromedriver") stealth( driver, user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36', languages = "en", vendor = "Google Inc.", platform = "Win32", webgl_vendor = "Intel Inc.", renderer = "Intel Iris OpenGL Engine", fix_hairline = False, run_on_insecure_origins = False ) driver.delete_all_cookies() driver.get(url) update_datetime() #html_file_name = "sold_page_source-" + today + ".html" html_file_name = "sold_listings.html" with open(html_file_name, "w") as f: f.write(driver.page_source) return html_file_name fname = selenium_run(orig_comicart_marv_dc_50plus_200perpage_sold) date and time: Sep-27-2021-13-19-13 print(fname) sold_listings.html with open(fname) as fp: doc = BeautifulSoup(fp, 'html.parser')
For the sold products page, the classes for the title, link, and price tags are a little different.
title_class = 'lvtitle' price_class = 'bold bidsold' link_class = 'vip'
obtain a session URL, and then remove cookies from the session to avoid website blocking
s = requests.session()
Place it all into one function which will scrape for current or sold listings based on the function arguments.
def scrape_titles_and_prices(url, document): s.cookies.clear() update_datetime() if document: using_local_doc=True doc = document title_class = 'lvtitle' price_class = 'bold bidsold' link_class = 'vip' else: print('processing a link: ', url) using_local_doc = False response = requests.get(url) if (response.status_code != 200): raise Exception('Failed to load page {}'.format(url)) page_contents = response.text doc = BeautifulSoup(page_contents, 'html.parser') filename = 'comic_art_marvel_dc' + today + '.html' if searching_sold: sold_html_file = filename with open(filename, 'w') as f: f.write(page_contents) title_class = 's-item__title' price_class = 's-item__price' link_class = 's-item__link' title_tags = doc.find_all('h3', {'class': title_class}) title_list = [] price_list = [] link_list = [] for i in range(len(title_tags)): if (title_tags[i].contents): if using_local_doc: title_contents = title_tags[i].contents[0].contents[0] else: title_contents = title_tags[i].contents[0] title_list.append(title_contents) price = title_tags[i].findNext('span', {'class': price_class}) if price.contents: if len(price.contents)>1 and using_local_doc: price_string = price.contents[1].contents[0] else: price_string = price.contents[0] if (isinstance(price_string, str)): price_string = sub(r'[^\d.]', '', price_string) else: price_string = price.contents[0].contents[0] price_string = sub(r'[^\d.]', '', price_string) price_num = float(price_string) price_list.append(price_num) item_page_link = title_tags[i].findPrevious('a', href=True) # {'class': 's-item__link'}) if item_page_link.text: href_text = item_page_link['href'] link_list.append(item_page_link['href']) title_and_price_dict = { 'title': title_list, 'price': price_list, 'link': link_list } title_price_link_df = pd.DataFrame(title_and_price_dict) # returns a data frame return title_price_link_df result = scrape_titles_and_prices("", doc) date and time: Jul-17-2021-15-18-43 print(result[:10]) Empty DataFrame Columns: [title, price, link] Index: []
Exporting the Result to a .csv File
You might get an issue csv in future tests after starting this project, therefore You will have to reduce the version of pandas to get this to work.
!pip uninstall pandas !pip install pandas==1.1.5 Found existing installation: pandas 1.3.3 Uninstalling pandas-1.3.3: Would remove: /usr/local/lib/python3.7/dist-packages/pandas-1.3.3.dist-info/* /usr/local/lib/python3.7/dist-packages/pandas/* Proceed (y/n)? y Successfully uninstalled pandas-1.3.3 Collecting pandas==1.1.5 Downloading pandas-1.1.5-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB) |████████████████████████████████| 9.5 MB 7.3 MB/s Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas==1.1.5) (1.19.5) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas==1.1.5) (2.8.2) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas==1.1.5) (2018.9) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas==1.1.5) (1.15.0) Installing collected packages: pandas ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible. Successfully installed pandas-1.1.5 update_datetime() fname = "origcomicart" + "-sold-" + today + ".csv" result.to_csv(fname, index=None) print(fname) date and time: Sep-27-2021-13-25-01 origcomicart-sold-Sep-27-2021-13-25-01.csv

Go into each link and visit the individual listing page to collect the identity of the character, as well as all characters on the art, now that we have a.csv file with all the sold listings (the same goes for a csv file with all the current listings).
import csv def indiv_page_link_cycler(csv_name): with open(csv_name, newline='') as f: reader = csv.reader(f) data = list(reader) # go through each link and add character to each list # skip header row for i in range(1, len(data)): if(i%50==0): update_datetime() print(i,' :links processed') link = data[i][2] response = requests.get(link) if (response.status_code != 200): raise Exception('Failed to load page {}'.format(url)) page_contents = response.text doc = BeautifulSoup(page_contents, 'html.parser') searched_word = 'Character' selection_class = 'attrLabels' character_tags = doc.find_all('td', {'class': selection_class}) for j in range(len(character_tags)): if (character_tags[j].contents): fullstring = character_tags[j].contents[0] if ("Character" or "character") in fullstring: character = character_tags[j].findNext('span') data[i].append(character.text) data[0].append('characters') data[0].append('multi-characters') fname = csv_name[:-4] fname = fname + "_chars.csv" with open(fname, 'w') as file: writer = csv.writer(file) writer.writerows(data)
Copy and paste the csv format files name from the previous output
indiv_page_link_cycler(fname) date and time: Sep-27-2021-13-26-48 50 :links processed date and time: Sep-27-2021-13-27-27 100 :links processed date and time: Sep-27-2021-13-28-08 150 :links processed date and time: Sep-27-2021-13-28-47 200 :links processed
Each entry is added with the identities of the characters in a new csv file. The file is identical to the one above, with the addition of "_chars" at the end.
!pip install requests --upgrade --quiet !pip install bs4 --upgrade --quiet !pip install pandas --upgrade --quiet !pip install datetime --upgrade --quiet !pip install selenium --upgrade --quiet !pip install selenium_stealth --upgrade --quiet !apt update !apt install chromium-chromedriver
Get:1 bionic-security InRelease [88.7 kB] Get:2 bionic-cran40/ InRelease [3,626 B] Ign:3 InRelease Get:4 bionic InRelease [15.9 kB] Ign:5 InRelease Hit:6 Release Hit:7 Release Hit:8 bionic InRelease Get:9 bionic-updates InRelease [88.7 kB] Hit:10 bionic InRelease Get:11 bionic-security/main amd64 Packages [2,221 kB] Hit:12 bionic InRelease Get:13 bionic-backports InRelease [74.6 kB] Hit:14 bionic InRelease Get:15 bionic-security/universe amd64 Packages [1,418 kB] Get:18 bionic/main Sources [1,780 kB] Get:19 bionic-updates/main amd64 Packages [2,658 kB] Get:20 bionic/main amd64 Packages [911 kB] Get:21 bionic-updates/universe amd64 Packages [2,188 kB] Fetched 11.4 MB in 3s (3,327 kB/s) Reading package lists... Done Building dependency tree Reading state information... Done 41 packages can be upgraded. Run 'apt list --upgradable' to see them. Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: chromium-browser chromium-browser-l10n chromium-codecs-ffmpeg-extra Suggested packages: webaccounts-chromium-extension unity-chromium-extension The following NEW packages will be installed: chromium-browser chromium-browser-l10n chromium-chromedriver chromium-codecs-ffmpeg-extra 0 upgraded, 4 newly installed, 0 to remove and 41 not upgraded. Need to get 86.0 MB of archives. After this operation, 298 MB of additional disk space will be used. Get:1 bionic-updates/universe amd64 chromium-codecs-ffmpeg-extra amd64 91.0.4472.101-0ubuntu0.18.04.1 [1,124 kB] Get:2 bionic-updates/universe amd64 chromium-browser amd64 91.0.4472.101-0ubuntu0.18.04.1 [76.1 MB] Get:3 bionic-updates/universe amd64 chromium-browser-l10n all 91.0.4472.101-0ubuntu0.18.04.1 [3,937 kB] Get:4 bionic-updates/universe amd64 chromium-chromedriver amd64 91.0.4472.101-0ubuntu0.18.04.1 [4,837 kB] Fetched 86.0 MB in 4s (19.2 MB/s) Selecting previously