Webscrapping in 2024

Webscrapping in 2024

Why Selenium?

Selenium is an amazing tool that far surpasses other modern webscraping tools and libraries. It's main caveat being the ability to automate browsers, although this might make it the primary candidate for website testing, do not ignore using it's capability to crawl through the web and obtain data.

How do I use is it?

Here's a step-by-step guide on how to use Selenium WebDriver for web scraping:

  1. Setting Up Selenium WebDriver: First, you need to set up Selenium WebDriver. You can install it using pip with the command pip install Selenium. Also, you need to download the WebDriver for the browser you plan to use (like ChromeDriver for Google Chrome).
  1. Loading a Web Page: After setting up Selenium WebDriver, you can start loading a web page. For example, to load Google's homepage, you would use the following code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.google.com/")

In this code, driver.get() is used to navigate to the specified URL.

  1. Interacting with Web Page Elements: Selenium allows you to interact with web page elements. For instance, you can fill a search box and submit a form:
search = driver.find_element(by=By.NAME, value="q")
search.send_keys("Selenium")
search.send_keys(Keys.ENTER)

In this code, driver.find_element() is used to locate the search box on the page, and send_keys() is used to type into the search box.

  1. Parsing HTML Content: After interacting with the web page, you can retrieve the HTML content of the page using driver.page_source. This content can then be parsed using a library like BeautifulSoup to extract the data you need.
  1. Closing the Browser: Finally, after you're done with your operations, you should close the browser using driver.quit().

Selenium WebDriver is especially useful for scraping dynamic web pages that rely heavily on JavaScript. It can interact with elements on the page and simulate user interactions, making it easier to scrape data from these types of websites.

Tips and tricks

Check out a few snippets from an implemented bot scrapper to parse sneaker sites to obtain their size and prices

Let's break down each function in the provided Selenium script:

  1. get_random_user_agent(): This function generates a random user agent string from a predefined list. User agents are used to identify the client software originating the request, and in this case, it helps mimic a regular browser visit.

     def get_random_user_agent():
         user_agents = [
             'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
             'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_15) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5392.175 Safari/537.36',
             'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.4.263.6 Safari/537.36',
             # Add more user-agents as needed
         ]
         return random.choice(user_agents)
    
  2. get_stockx_sneakerlinks(codeid): This function takes a product code id as an argument, sets up a headless Chrome browser with a random user agent, navigates to the corresponding StockX product page, and extracts all product links on that page. The extracted links are returned as a list.

     def get_stockx_sneakerlinks(codeid):
         # Configure Chrome options
         options = Options()
         # Set user-agent
         options.add_argument(f"user-agent={get_random_user_agent()}")
         options.add_argument("--headless")
         # Initialize WebDriver
         driver = webdriver.Chrome(options=options)
    
         rep = driver.get(f"https://stockx.com/search?s={codeid}")
         WebDriverWait(rep, 20)
    
         page_source = driver.page_source
         soup = BeautifulSoup(page_source, 'html.parser')
    
         product_tile_links = soup.find_all('a', {'data-testid': 'productTile-ProductSwitcherLink'})
         href_values = [link.get('href') for link in product_tile_links]
    
         print(href_values)
         return href_values
    
  3. perform_sneakersize_scraping(link): This function takes a product link as an argument, sets up another headless Chrome browser with a random user agent and geolocation, navigates to the corresponding product page, waits for the page to load, and then extracts the product sizes and prices. The extracted data is returned as a dictionary.

     def perform_sneakersize_scraping(link):
         # Configure Chrome options
         options = Options()
         # Set user-agent
         options.add_argument(f"user-agent={get_random_user_agent()}")
         options.add_argument("--headless")
         options.add_argument("--geolocation=41.9028,12.4964") # Coordinates for Rome, Italy
         # Initialize WebDriver
         driver = webdriver.Chrome(options=options)
         driver.execute_script("""
            navigator.geolocation.getCurrentPosition = function(success) {
                var position = {"coords" : {
                    "latitude": "41.9028", 
                    "longitude": "12.4964"
                }};
                success(position);
            }
         """)
    
         size_data = {}
    
         rep = driver.get(f"https://stockx.com{link}")
         # Wait for the page to load (adjust the timeout as needed)
         WebDriverWait(rep, 20)
         time.sleep(10)
    
         page_source = driver.page_source
         soup = BeautifulSoup(page_source, 'html.parser')
         sneaker_code = soup.find('p', class_='chakra-text css-wgsjnl')
    
         if sneaker_code.text not in saved_codes:
             print(f"ignore {sneaker_code.text}")
             driver.quit()
             return
    
         size_button = driver.find_element(By.CSS_SELECTOR, "#menu-button-pdp-size-selector")
         size_button.click()
         time.sleep(8)
    
         size_tiles = driver.find_elements(By.CSS_SELECTOR, "#menu-list-pdp-size-selector > div.css-1kgaafq > div > button")
         for tile in size_tiles:
             sneaker_size = tile.find_element(By.CSS_SELECTOR, "dt").text
             sneaker_price = tile.find_element(By.CSS_SELECTOR, "dd").text
             size_data[sneaker_size] = sneaker_price
         print(size_data)
         driver.quit()
    
         return {sneaker_code.text: size_data}
    

These functions demonstrate how Selenium can be used to interact with a website just like a human user would. Use them in your next project to seamlessly avoid detection.