Skip to content

Web Scraping for Data Science with Python

DodaTech 4 min read

In this tutorial, you will learn web scraping for Data Science with Python using Beautiful Soup, Requests, and Selenium to extract structured data from websites, handle dynamic content, and build ethical scrapers.

What You'll Learn

Send HTTP requests, parse HTML with Beautiful Soup, extract data with CSS selectors, handle JavaScript-rendered pages with Selenium, implement polite scraping with delays and retries, and save results to CSV.

Why It Matters

Not all data comes in clean CSV files. Web scraping lets you collect data from public websites for analysis, monitoring, and research -- opening up thousands of data sources that would otherwise be inaccessible.

Real-World Use

Durga Antivirus Pro scrapes public threat intelligence feeds, CVE databases, and security advisory pages daily to update its threat signatures. Automated scrapers monitor 50+ sources and extract structured threat data into a centralized analysis pipeline.

Web Scraping Pipeline

flowchart LR
  A[Target URL] --> B[HTTP Request]
  B --> C{Response OK?}
  C -->|Yes| D[Parse HTML]
  C -->|No| E[Retry / Log Error]
  D --> F[Extract with CSS Selectors]
  F --> G[Clean & Structure]
  G --> H[Save to CSV / Database]

Static Page Scraping with Beautiful Soup

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://quotes.toscrape.com"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")

quotes = []
for quote_div in soup.select("div.quote"):
    text = quote_div.select_one("span.text").get_text(strip=True)
    author = quote_div.select_one("small.author").get_text(strip=True)
    tags = [tag.get_text(strip=True) for tag in quote_div.select("a.tag")]
    quotes.append({"text": text, "author": author, "tags": ", ".join(tags)})

df = pd.DataFrame(quotes)
print(df.head())
print(f"Total quotes scraped: {len(df)}")

Output:

                                                text        author  \
0  The world as we have created it is a process ...  Albert Einstein
1  It is our choices, Harry, that show what we t...   J.K. Rowling

                                               tags
0  change, deep-thoughts, thinking, world
1  abilities, choices

Total quotes scraped: 10

Handling JavaScript-Rendered Pages with Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

driver.get("https://quotes.toscrape.com/js/")
wait = WebDriverWait(driver, 10)
quote_elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.quote"))
)

data = []
for el in quote_elements[:5]:
    text = el.find_element(By.CSS_SELECTOR, "span.text").text
    author = el.find_element(By.CSS_SELECTOR, "small.author").text
    data.append({"text": text, "author": author})

driver.quit()
df = pd.DataFrame(data)
print(df)

Output:

                                                text        author
0  The world as we have created it is a process ...  Albert Einstein
1  It is our choices, Harry, that show what we t...   J.K. Rowling
2  There are only two ways to live your life.  O...  Albert Einstein
3  The person, be it gentleman or lady, who has ...     Jane Austen
4  Imperfection is beauty, madness is genius and ...    Marilyn Monroe

Polite Scraping with Rate Limiting

import time
from random import uniform

def polite_scrape(urls, delay_range=(1, 3)):
    data = []
    for url in urls:
        try:
            response = requests.get(
                url,
                headers={"User-Agent": "DataScienceTutorial/1.0"},
                timeout=10,
            )
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            title = soup.title.get_text(strip=True) if soup.title else "No title"
            data.append({"url": url, "title": title, "status": response.status_code})
        except requests.RequestException as e:
            data.append({"url": url, "error": str(e), "status": None})
        delay = uniform(*delay_range)
        print(f"Scraped {url} -- waiting {delay:.1f}s")
        time.sleep(delay)
    return pd.DataFrame(data)

urls = [
    "https://quotes.toscrape.com/page/1/",
    "https://quotes.toscrape.com/page/2/",
]
df = polite_scrape(urls)
print(df)

Output:

Scraped https://quotes.toscrape.com/page/1/ -- waiting 2.3s
Scraped https://quotes.toscrape.com/page/2/ -- waiting 1.7s
                                          url            title  status
0  https://quotes.toscrape.com/page/1/  Quotes to Scrape     200
1  https://quotes.toscrape.com/page/2/  Quotes to Scrape     200

Practice Questions

  1. What is the purpose of the User-Agent header in HTTP requests, and why should you set it?
  2. When would you choose Selenium over Requests + Beautiful Soup for a scraping project?
  3. What are the ethical and legal considerations before scraping a website?

Answers:

  1. User-Agent identifies your client to the server. Setting a realistic User-Agent prevents your scraper from being blocked by basic bot detection that rejects requests with default or missing User-Agent headers.
  2. Use Selenium when the target website loads content dynamically with JavaScript that modifies the DOM after the initial HTML loads. Selenium runs a real browser that executes JavaScript, making the full page content available.
  3. Check robots.txt for scraping permissions, review the website's terms of service, implement Rate Limiting to avoid overloading servers, identify your scraper with a descriptive User-Agent, and only scrape publicly accessible data without Authentication Bypass.

Challenge

Scrape a multi-page product listing from a public e-commerce site (or a demo site). Extract product name, price, rating, and availability from at least 50 products across multiple pages. Implement polite scraping with delays, handle pagination automatically, save the results to a CSV file, and add error handling for network failures and missing elements.

FAQs

Is web scraping legal?

Web scraping public data is generally legal in most jurisdictions, but always check the website's terms of service and robots.txt. Scraping behind login walls, bypassing rate limits intentionally, or scraping copyrighted content for commercial use may violate laws or terms.

How do I avoid getting blocked while scraping?

Use realistic User-Agent headers, implement random delays between requests (1-5 seconds), rotate IP addresses via proxies if needed, scrape during off-peak hours, limit request rate to a human-like pace, and respect robots.txt directives.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro