Feature Engineering Techniques — Practical Guide
In this tutorial, you'll learn about Feature Engineering Techniques. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Feature engineering is the Process of transforming raw data into features that better represent the underlying problem to Machine Learning models, often making the difference between a mediocre model and a state-of-the-art one.
What You'll Learn
How to create numeric features from text and dates, handle missing values, encode categorical variables, scale features, and generate interaction terms using pandas and Scikit-Learn.
Why It Matters
Feature engineering is where most of the practical gains in Machine Learning come from. Better features improve model performance more than algorithm selection or hyperparameter tuning. Kaggle competition winners spend 80% of their time on feature engineering.
Real-World Use
Doda Browser's tab prediction feature uses engineered features like time of day, day of week, recent tab usage frequency, and browsing session length to predict which tab you will switch to next.
Feature Engineering Workflow
flowchart LR
A[Raw Data] --> B[Cleaning]
B --> C[Missing Values]
B --> D[Outliers]
C --> E[Transformation]
D --> E
E --> F[Encoding]
E --> G[Scaling]
E --> H[Creation]
F --> I[Feature Set]
G --> I
H --> I
I --> J[Model Training]
Handling Missing Values
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
data = pd.DataFrame({
'age': [25, np.nan, 35, 42, np.nan, 28],
'salary': [50000, 60000, np.nan, 75000, 65000, 55000],
'experience': [2, 5, 7, np.nan, 4, 3]
})
print("Before imputation:")
print(data.isnull().sum())
imputer = SimpleImputer(strategy='median')
data_imputed = pd.DataFrame(
imputer.fit_transform(data),
columns=data.columns
)
print("\nAfter imputation:")
print(data_imputed)
Expected output:
Before imputation:
age 2
salary 1
experience 1
dtype: int64
After imputation:
age salary experience
0 25.0 50000.0 2.0
1 31.5 60000.0 5.0
2 35.0 55000.0 7.0
3 42.0 75000.0 4.0
4 31.5 65000.0 4.0
5 28.0 55000.0 3.0
Median imputation preserves the central tendency without being affected by outliers like mean would.
Encoding Categorical Variables
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
data = pd.DataFrame({
'color': ['red', 'blue', 'green', 'blue', 'red'],
'size': ['S', 'M', 'L', 'M', 'S'],
'price': [10, 15, 20, 15, 10]
})
one_hot = OneHotEncoder(sparse_output=False)
encoded = one_hot.fit_transform(data[['color']])
encoded_df = pd.DataFrame(encoded, columns=one_hot.get_feature_names_out(['color']))
result = pd.concat([encoded_df, data[['size', 'price']]], axis=1)
print(result)
Expected output:
color_blue color_green color_red size price
0 0.0 0.0 1.0 S 10
1 1.0 0.0 0.0 M 15
2 0.0 1.0 0.0 L 20
3 1.0 0.0 0.0 M 15
4 0.0 0.0 1.0 S 10
One-hot encoding prevents ordinal relationships from being implied in categorical data. Never use LabelEncoder for input features.
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
data = pd.DataFrame({
'age': [25, 35, 42, 28, 55],
'income': [40000, 60000, 120000, 35000, 200000],
'score': [0.1, 0.4, 0.9, 0.2, 0.7]
})
scaler = StandardScaler()
standardized = scaler.fit_transform(data)
standardized_df = pd.DataFrame(standardized, columns=data.columns)
print("Standardized (mean=0, std=1):")
print(standardized_df.round(3))
print(f"\nMeans: {standardized_df.mean().round(3).values}")
print(f"Stds: {standardized_df.std().round(3).values}")
Expected output:
Standardized (mean=0, std=1):
age income score
0 -0.957 -0.788 -1.089
1 -0.261 -0.403 -0.218
2 0.261 0.531 1.307
3 -0.696 -0.877 -0.654
4 1.653 1.537 0.654
Means: [ 0. -0. 0.]
Stds: [1. 1. 1.]
StandardScaler is preferred for algorithms that assume normally distributed data (linear models, SVM). MinMaxScaler works better for neural networks.
Creating Date-Based Features
dates = pd.Series(pd.date_range('2025-01-01', periods=5, freq='W'))
features = pd.DataFrame({
'date': dates,
'year': dates.dt.year,
'month': dates.dt.month,
'day_of_week': dates.dt.dayofweek,
'is_weekend': (dates.dt.dayofweek >= 5).astype(int),
'quarter': dates.dt.quarter
})
print(features)
Expected output:
date year month day_of_week is_weekend quarter
0 2025-01-05 2025 1 6 1 1
1 2025-01-12 2025 1 6 1 1
2 2025-01-19 2025 1 6 1 1
3 2025-01-26 2025 1 6 1 1
4 2025-02-02 2025 2 6 1 1
Date features capture seasonality, trends, and cyclical patterns that raw timestamps hide from models.
Practice Questions
- Why is one-hot encoding preferred over label encoding for nominal categorical features?
- When would you use StandardScaler versus MinMaxScaler?
- What is the purpose of creating interaction features?
Frequently Asked Questions
Related Topics
- Python — all examples use Python
- Pandas Guide — data manipulation library
- scikit-learn Guide — preprocessing tools
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro