Skip to content

Feature Engineering Techniques — Practical Guide

DodaTech 4 min read

In this tutorial, you'll learn about Feature Engineering Techniques. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Feature engineering is the Process of transforming raw data into features that better represent the underlying problem to Machine Learning models, often making the difference between a mediocre model and a state-of-the-art one.

What You'll Learn

How to create numeric features from text and dates, handle missing values, encode categorical variables, scale features, and generate interaction terms using pandas and Scikit-Learn.

Why It Matters

Feature engineering is where most of the practical gains in Machine Learning come from. Better features improve model performance more than algorithm selection or hyperparameter tuning. Kaggle competition winners spend 80% of their time on feature engineering.

Real-World Use

Doda Browser's tab prediction feature uses engineered features like time of day, day of week, recent tab usage frequency, and browsing session length to predict which tab you will switch to next.

Feature Engineering Workflow

flowchart LR
    A[Raw Data] --> B[Cleaning]
    B --> C[Missing Values]
    B --> D[Outliers]
    C --> E[Transformation]
    D --> E
    E --> F[Encoding]
    E --> G[Scaling]
    E --> H[Creation]
    F --> I[Feature Set]
    G --> I
    H --> I
    I --> J[Model Training]

Handling Missing Values

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

data = pd.DataFrame({
    'age': [25, np.nan, 35, 42, np.nan, 28],
    'salary': [50000, 60000, np.nan, 75000, 65000, 55000],
    'experience': [2, 5, 7, np.nan, 4, 3]
})

print("Before imputation:")
print(data.isnull().sum())

imputer = SimpleImputer(strategy='median')
data_imputed = pd.DataFrame(
    imputer.fit_transform(data),
    columns=data.columns
)

print("\nAfter imputation:")
print(data_imputed)

Expected output:

Before imputation:
age           2
salary        1
experience    1
dtype: int64

After imputation:
     age   salary  experience
0  25.0  50000.0         2.0
1  31.5  60000.0         5.0
2  35.0  55000.0         7.0
3  42.0  75000.0         4.0
4  31.5  65000.0         4.0
5  28.0  55000.0         3.0

Median imputation preserves the central tendency without being affected by outliers like mean would.

Encoding Categorical Variables

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red'],
    'size': ['S', 'M', 'L', 'M', 'S'],
    'price': [10, 15, 20, 15, 10]
})

one_hot = OneHotEncoder(sparse_output=False)
encoded = one_hot.fit_transform(data[['color']])
encoded_df = pd.DataFrame(encoded, columns=one_hot.get_feature_names_out(['color']))

result = pd.concat([encoded_df, data[['size', 'price']]], axis=1)
print(result)

Expected output:

   color_blue  color_green  color_red size  price
0         0.0          0.0        1.0    S     10
1         1.0          0.0        0.0    M     15
2         0.0          1.0        0.0    L     20
3         1.0          0.0        0.0    M     15
4         0.0          0.0        1.0    S     10

One-hot encoding prevents ordinal relationships from being implied in categorical data. Never use LabelEncoder for input features.

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler

data = pd.DataFrame({
    'age': [25, 35, 42, 28, 55],
    'income': [40000, 60000, 120000, 35000, 200000],
    'score': [0.1, 0.4, 0.9, 0.2, 0.7]
})

scaler = StandardScaler()
standardized = scaler.fit_transform(data)
standardized_df = pd.DataFrame(standardized, columns=data.columns)

print("Standardized (mean=0, std=1):")
print(standardized_df.round(3))
print(f"\nMeans: {standardized_df.mean().round(3).values}")
print(f"Stds:  {standardized_df.std().round(3).values}")

Expected output:

Standardized (mean=0, std=1):
     age   income   score
0 -0.957   -0.788  -1.089
1 -0.261   -0.403  -0.218
2  0.261    0.531   1.307
3 -0.696   -0.877  -0.654
4  1.653    1.537   0.654

Means: [ 0. -0.  0.]
Stds:  [1. 1. 1.]

StandardScaler is preferred for algorithms that assume normally distributed data (linear models, SVM). MinMaxScaler works better for neural networks.

Creating Date-Based Features

dates = pd.Series(pd.date_range('2025-01-01', periods=5, freq='W'))
features = pd.DataFrame({
    'date': dates,
    'year': dates.dt.year,
    'month': dates.dt.month,
    'day_of_week': dates.dt.dayofweek,
    'is_weekend': (dates.dt.dayofweek >= 5).astype(int),
    'quarter': dates.dt.quarter
})
print(features)

Expected output:

       date  year  month  day_of_week  is_weekend  quarter
0 2025-01-05  2025      1            6           1        1
1 2025-01-12  2025      1            6           1        1
2 2025-01-19  2025      1            6           1        1
3 2025-01-26  2025      1            6           1        1
4 2025-02-02  2025      2            6           1        1

Date features capture seasonality, trends, and cyclical patterns that raw timestamps hide from models.

Practice Questions

  1. Why is one-hot encoding preferred over label encoding for nominal categorical features?
  2. When would you use StandardScaler versus MinMaxScaler?
  3. What is the purpose of creating interaction features?

Frequently Asked Questions

How many features is too many?

The number of features should be less than the number of training samples. A common rule is at least 10 samples per feature. With regularization (L1/L2), you can handle feature-to-sample ratios closer to 1:1.

Should I scale features for tree-based models?

No. Decision trees and random forests are invariant to feature scaling because they split on thresholds. Scaling is essential for distance-based models like KNN, SVM, linear regression, and neural networks.

  • Python — all examples use Python
  • Pandas Guide — data manipulation library
  • scikit-learn Guide — preprocessing tools

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro