Data Science Projects for Beginners — Build Your Portfolio

DodaTech 4 min read

In this tutorial, you'll learn about Data Science Projects for Beginners. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Build a portfolio of Data Science projects — from exploratory analysis to predictive modeling and interactive dashboards — using real-world datasets.

Why It Matters

Employers want to see what you can do, not just what you know. Projects demonstrate your skills with real data, tools, and workflows.

Real-World Use

A portfolio with 3-5 quality projects is worth more than any certification.

Project 1: Exploratory Data Analysis

Analyze a dataset and generate insights.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset("titanic")

# EDA report
print(f"Shape: {df.shape}")
print(f"Missing:\n{df.isnull().sum()}")

# Survival rate by class
survival_by_class = df.groupby("class")["survived"].mean()
print(f"\nSurvival by class:\n{survival_by_class}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.barplot(x="class", y="survived", data=df, ax=axes[0])
axes[0].set_title("Survival Rate by Class")

sns.histplot(data=df, x="age", hue="survived", kde=True, ax=axes[1])
axes[1].set_title("Age Distribution by Survival")

plt.tight_layout()
plt.show()

Skills: pandas, data cleaning, matplotlib, seaborn, storytelling

Datasets: Titanic, Iris, Tips, any dataset from Kaggle

Project 2: Customer Segmentation

Cluster customers based on spending behavior.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load customer data
df = pd.read_csv("customer_data.csv")

# Features for clustering
features = ["annual_income", "spending_score"]
X = df[features]

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow method for optimal k
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias, marker="o")
plt.title("Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()

# Cluster with optimal k (e.g., 5)
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X_scaled)

# Visualize
sns.scatterplot(
    data=df, x="annual_income", y="spending_score",
    hue="cluster", palette="viridis"
)
plt.title("Customer Segments")
plt.show()

# Profile clusters
print(df.groupby("cluster")[features].mean())

Skills: KMeans clustering, scaling, the elbow method, customer profiling

Datasets: Mall Customer Segmentation from Kaggle

Project 3: Sales Forecasting

Predict future sales using time series analysis.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Load sales data
df = pd.read_csv("sales.csv", parse_dates=["date"], index_col="date")

# Create lag features
df["lag_1"] = df["sales"].shift(1)
df["lag_7"] = df["sales"].shift(7)
df["rolling_7"] = df["sales"].rolling(7).mean()

# Add date features
df["dayofweek"] = df.index.dayofweek
df["month"] = df.index.month

# Drop NaN rows from lag features
df = df.dropna()

# Train/test split
train = df.iloc[:-30]
test = df.iloc[-30:]

features = ["lag_1", "lag_7", "rolling_7", "dayofweek", "month"]
X_train, y_train = train[features], train["sales"]
X_test, y_test = test[features], test["sales"]

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print(f"MAE: ${mean_absolute_error(y_test, predictions):.2f}")

# Plot
plt.figure(figsize=(12, 5))
plt.plot(test.index, y_test.values, label="Actual", marker="o")
plt.plot(test.index, predictions, label="Predicted", marker="x")
plt.title("Sales Forecast")
plt.legend()
plt.show()

Skills: Time series, feature engineering, regression, model evaluation

Datasets: Store sales data from Kaggle, or any CSV with dates and values

Project 4: Sentiment Analysis

Analyze text data for sentiment.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Sample data
df = pd.DataFrame({
    "text": [
        "This product is amazing!",
        "Terrible service, very disappointed",
        "Pretty good, would recommend",
        "Waste of money, don't buy",
        "Average quality, nothing special",
    ],
    "sentiment": ["positive", "negative", "positive", "negative", "neutral"],
})

# Vectorize text
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(df["text"])

# Train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, df["sentiment"], test_size=0.2, random_state=42
)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

# Predict new text
new_texts = ["I love this!", "This is awful"]
new_vectors = vectorizer.transform(new_texts)
print(model.predict(new_vectors))

Skills: NLP, TF-IDF, classification, model evaluation

Datasets: Twitter sentiment, movie reviews, product reviews

Project 5: Interactive Dashboard

Build a dashboard using ipywidgets or Streamlit.

# Save as app.py — run with `streamlit run app.py`
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

st.set_page_config(page_title="Data Dashboard", layout="wide")
st.title("📊 Data Analysis Dashboard")

# File upload
uploaded = st.sidebar.file_uploader("Upload CSV", type=["csv"])
if uploaded is not None:
    df = pd.read_csv(uploaded)

    # Sidebar filters
    st.sidebar.header("Filters")
    numeric_cols = df.select_dtypes(include="number").columns

    if len(numeric_cols) > 0:
        x_col = st.sidebar.selectbox("X-axis", numeric_cols)
        y_col = st.sidebar.selectbox("Y-axis", numeric_cols)

        # Main content
        col1, col2 = st.columns(2)

        with col1:
            st.subheader("Data Preview")
            st.dataframe(df.head(100))

        with col2:
            st.subheader("Summary Statistics")
            st.dataframe(df.describe())

        st.subheader("Scatter Plot")
        fig, ax = plt.subplots(figsize=(10, 6))
        sns.scatterplot(data=df, x=x_col, y=y_col, ax=ax)
        st.pyplot(fig)

        st.subheader("Correlation Heatmap")
        fig, ax = plt.subplots(figsize=(10, 8))
        sns.heatmap(df.select_dtypes(include="number").corr(),
                    annot=True, cmap="coolwarm", ax=ax)
        st.pyplot(fig)

Skills: Streamlit, interactive visualization, dashboard design

Portfolio Tips

Your portfolio should include:

1. 3-5 projects showing different skills
2. Clean, documented code on GitHub
3. README with problem statement and findings
4. Visualizations that tell a story
5. At least one deployed app (Streamlit)

Project checklist:
□ Clear problem statement
□ Data source documented
□ EDA with visualizations
□ Features explained
□ Model performance reported
□ Insights and recommendations
□ Code on GitHub
□ README.md written

Project Ideas by Skill Level

Level	Project	Skills
Beginner	Titanic EDA	pandas, matplotlib
Beginner	Housing price analysis	pandas, visualization
Intermediate	Customer segmentation	KMeans, clustering
Intermediate	Sales forecasting	Time series, regression
Intermediate	Sentiment analysis	NLP, classification
Advanced	Recommendation system	Collaborative filtering
Advanced	Real-time dashboard	Streamlit, APIs
Advanced	End-to-end ML pipeline	MLOps, automation

← Previous Building a Data Analysis Pipeline with Python Next → Data Visualization with Matplotlib and Seaborn — Complete Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science