Web Scraping for Data Science with Python
In this tutorial, you will learn web scraping for Data Science with Python using Beautiful Soup, Requests, and Selenium to extract structured data from websites, handle dynamic content, and build ethical scrapers.
What You'll Learn
Send HTTP requests, parse HTML with Beautiful Soup, extract data with CSS selectors, handle JavaScript-rendered pages with Selenium, implement polite scraping with delays and retries, and save results to CSV.
Why It Matters
Not all data comes in clean CSV files. Web scraping lets you collect data from public websites for analysis, monitoring, and research -- opening up thousands of data sources that would otherwise be inaccessible.
Real-World Use
Durga Antivirus Pro scrapes public threat intelligence feeds, CVE databases, and security advisory pages daily to update its threat signatures. Automated scrapers monitor 50+ sources and extract structured threat data into a centralized analysis pipeline.
Web Scraping Pipeline
flowchart LR
A[Target URL] --> B[HTTP Request]
B --> C{Response OK?}
C -->|Yes| D[Parse HTML]
C -->|No| E[Retry / Log Error]
D --> F[Extract with CSS Selectors]
F --> G[Clean & Structure]
G --> H[Save to CSV / Database]
Static Page Scraping with Beautiful Soup
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://quotes.toscrape.com"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")
quotes = []
for quote_div in soup.select("div.quote"):
text = quote_div.select_one("span.text").get_text(strip=True)
author = quote_div.select_one("small.author").get_text(strip=True)
tags = [tag.get_text(strip=True) for tag in quote_div.select("a.tag")]
quotes.append({"text": text, "author": author, "tags": ", ".join(tags)})
df = pd.DataFrame(quotes)
print(df.head())
print(f"Total quotes scraped: {len(df)}")
Output:
text author \
0 The world as we have created it is a process ... Albert Einstein
1 It is our choices, Harry, that show what we t... J.K. Rowling
tags
0 change, deep-thoughts, thinking, world
1 abilities, choices
Total quotes scraped: 10
Handling JavaScript-Rendered Pages with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("https://quotes.toscrape.com/js/")
wait = WebDriverWait(driver, 10)
quote_elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.quote"))
)
data = []
for el in quote_elements[:5]:
text = el.find_element(By.CSS_SELECTOR, "span.text").text
author = el.find_element(By.CSS_SELECTOR, "small.author").text
data.append({"text": text, "author": author})
driver.quit()
df = pd.DataFrame(data)
print(df)
Output:
text author
0 The world as we have created it is a process ... Albert Einstein
1 It is our choices, Harry, that show what we t... J.K. Rowling
2 There are only two ways to live your life. O... Albert Einstein
3 The person, be it gentleman or lady, who has ... Jane Austen
4 Imperfection is beauty, madness is genius and ... Marilyn Monroe
Polite Scraping with Rate Limiting
import time
from random import uniform
def polite_scrape(urls, delay_range=(1, 3)):
data = []
for url in urls:
try:
response = requests.get(
url,
headers={"User-Agent": "DataScienceTutorial/1.0"},
timeout=10,
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
title = soup.title.get_text(strip=True) if soup.title else "No title"
data.append({"url": url, "title": title, "status": response.status_code})
except requests.RequestException as e:
data.append({"url": url, "error": str(e), "status": None})
delay = uniform(*delay_range)
print(f"Scraped {url} -- waiting {delay:.1f}s")
time.sleep(delay)
return pd.DataFrame(data)
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
]
df = polite_scrape(urls)
print(df)
Output:
Scraped https://quotes.toscrape.com/page/1/ -- waiting 2.3s
Scraped https://quotes.toscrape.com/page/2/ -- waiting 1.7s
url title status
0 https://quotes.toscrape.com/page/1/ Quotes to Scrape 200
1 https://quotes.toscrape.com/page/2/ Quotes to Scrape 200
Practice Questions
- What is the purpose of the User-Agent header in HTTP requests, and why should you set it?
- When would you choose Selenium over Requests + Beautiful Soup for a scraping project?
- What are the ethical and legal considerations before scraping a website?
Answers:
- User-Agent identifies your client to the server. Setting a realistic User-Agent prevents your scraper from being blocked by basic bot detection that rejects requests with default or missing User-Agent headers.
- Use Selenium when the target website loads content dynamically with JavaScript that modifies the DOM after the initial HTML loads. Selenium runs a real browser that executes JavaScript, making the full page content available.
- Check robots.txt for scraping permissions, review the website's terms of service, implement Rate Limiting to avoid overloading servers, identify your scraper with a descriptive User-Agent, and only scrape publicly accessible data without Authentication Bypass.
Challenge
Scrape a multi-page product listing from a public e-commerce site (or a demo site). Extract product name, price, rating, and availability from at least 50 products across multiple pages. Implement polite scraping with delays, handle pagination automatically, save the results to a CSV file, and add error handling for network failures and missing elements.
FAQs
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro