Machine Learning Basics — Complete Beginner's Guide

DodaTech Updated 2026-06-22 7 min read

Machine Learning is a branch of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed for every outcome.

What You'll Learn

In this tutorial, you'll learn the fundamental concepts of Machine Learning, the difference between supervised and unsupervised learning, key terminology every practitioner needs, and the complete ML workflow from data collection to deployment.

Why It Matters

Machine Learning powers recommendation systems, fraud detection, medical diagnosis, self-driving cars, and language models like GPT. Understanding ML basics is the first step toward building intelligent systems that can automate decisions and uncover patterns hidden in data. Python and scikit-learn are the primary tools for implementing ML solutions.

Real-World Use

When you use a spam filter, ML classifies your emails. When Netflix suggests a movie, ML analyzes your viewing history. When Durga Antivirus Pro detects a new threat, ML models identify malicious patterns without waiting for a signature update.

What is Machine Learning?

Machine Learning is the practice of teaching computers to learn from data. Instead of writing explicit rules like "if email contains 'free money' then mark as spam," you show the computer thousands of examples and let it discover the patterns itself. The computer builds a mathematical model that generalizes from these examples, enabling it to make predictions on data it has never seen before. This ability to generalize — to apply learned patterns to new situations — is what makes ML so powerful and fundamentally different from traditional rule-based programming.

{{< faq "What is the difference between AI, ML, and Deep Learning?">}} Artificial intelligence is the broad field of creating intelligent machines. Machine Learning is a subset of AI where systems learn from data. Deep Learning is a subset of ML using multi-layered neural networks. Think of it as AI > ML > Deep Learning in terms of specificity. {{< /faq >}}

Types of Machine Learning

Machine Learning algorithms fall into three main categories based on the type of data and feedback available. The choice depends on your problem: do you have labeled examples (supervised), unlabeled data (unsupervised), or an environment where the model can take actions and receive feedback (reinforcement)? Each category has distinct algorithms, evaluation methods, and use cases. Understanding which category your problem falls into is the first step in choosing the right approach.

Supervised Learning

In supervised learning, you have labeled data — each example comes with the correct answer. The model learns to map inputs to outputs.

flowchart LR
  A[Training Data with Labels] --> B[Model Training]
  B --> C[Trained Model]
  D[New Input] --> C
  C --> E[Predicted Label]

Common tasks:

Classification: Predict a category (spam or not spam, cat or dog)
Regression: Predict a continuous value (house price, temperature)

Unsupervised Learning

In unsupervised learning, the data has no labels. The model finds hidden patterns or groupings on its own.

flowchart LR
  A[Unlabeled Data] --> B[Pattern Discovery]
  B --> C[Clusters or Groups]
  B --> D[Reduced Dimensions]

Common tasks:

Clustering: Group similar customers by behavior
Dimensionality Reduction: Compress data while preserving structure

Reinforcement Learning

An agent learns by interacting with an environment, receiving rewards for good actions and penalties for bad ones. Used in game-playing AIs and robotics.

Key Terminology

Term	Definition	Example
Feature	An input variable used by the model	Age, income, pixel value
Label	The output value to predict	"Spam" or "Not Spam"
Training Set	Data used to teach the model	80% of your dataset
Test Set	Data used to evaluate the model	20% of your dataset
Overfitting	Model memorizes training data, fails on new data	100% train accuracy, 60% test accuracy
Underfitting	Model is too simple to capture patterns	Poor performance on both train and test
Hyperparameter	A setting chosen before training	Learning rate, tree depth

The Machine Learning Workflow

Every ML project follows a similar pipeline:

Data Collection: Gather raw data from databases, APIs, or files
Data Preprocessing: Clean missing values, encode categories, scale features
Exploratory Data Analysis: Visualize distributions, find correlations
Model Selection: Choose an algorithm based on the problem type
Training: Feed data to the algorithm so it learns patterns
Evaluation: Measure performance on unseen data
Hyperparameter Tuning: Adjust settings to improve results
Deployment: Put the model into production

flowchart LR
  A[Data Collection] --> B[Preprocessing]
  B --> C[EDA]
  C --> D[Model Selection]
  D --> E[Training]
  E --> F[Evaluation]
  F --> G[Tuning]
  G --> H[Deployment]
  F -.-> I[Bad Performance]
  I -.-> D

Python Code Examples

Example 1: Simple Classification with scikit-learn

Let's train a scikit-learn classifier to identify iris flower species.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Expected output:

Accuracy: 1.00

Example 2: Simple Regression

Predict housing prices using linear regression.

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1000], [1500], [2000], [2500], [3000]])
y = np.array([200000, 280000, 350000, 430000, 500000])

model = LinearRegression()
model.fit(X, y)

new_house = np.array([[1800]])
predicted = model.predict(new_house)
print(f"Predicted price for 1800 sqft: ${predicted[0]:,.0f}")

Expected output:

Predicted price for 1800 sqft: $314,000

Example 3: Clustering Unlabeled Data

Group customers by their purchasing behavior using K-Means.

from sklearn.cluster import KMeans
import numpy as np

annual_income = np.array([[15], [20], [25], [60], [70], [75],
                          [100], [110], [120]])
spending_score = np.array([[39], [42], [45], [60], [65], [62],
                           [80], [85], [88]])

X = np.column_stack([annual_income, spending_score])

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)

print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Labels: {kmeans.labels_}")

Expected output:

Cluster centers:
[[  20.   42.]
 [  70.   62.33333333]
 [ 110.   84.33333333]]
Labels: [0 0 0 1 1 1 2 2 2]

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Data leakage	Test data influences training	Split before any preprocessing
Ignoring class imbalance	Model predicts majority class only	Use stratified sampling or resampling
Not scaling features	Large values dominate smaller ones	Use StandardScaler or MinMaxScaler
Overfitting	Model too complex for data size	Reduce complexity, add regularization
Wrong evaluation metric	Accuracy on imbalanced data misleads	Use precision, recall, or F1-score

Practice Questions

What is the key difference between supervised and unsupervised learning?

Answer: Supervised learning uses labeled data with known outputs, while unsupervised learning finds patterns in unlabeled data without predefined answers.

Name three real-world applications of classification.

Answer: Spam detection (spam vs not spam), medical diagnosis (disease vs no disease), and credit risk assessment (default vs no default).

What does overfitting mean and how can you prevent it?

Answer: Overfitting occurs when a model memorizes the training data instead of learning general patterns. You can prevent it by simplifying the model, using more training data, applying regularization, or using cross-validation.

Why do we split data into training and test sets?

Answer: To evaluate how well the model generalizes to unseen data. A model that performs well on training data but poorly on test data is overfitting.

What is the difference between a feature and a label?

Answer: A feature is an input variable used to make predictions (like square footage), while a label is the target value being predicted (like house price).

Challenge

Build a complete ML pipeline that predicts whether a patient has diabetes using the Pima Indians Diabetes dataset. Use at least two algorithms, compare their accuracy, and create a confusion matrix for the best one.

Real-World Task

Design a system that uses unsupervised learning to segment customers into groups based on their purchase history, website behavior, and demographic data. Explain what features you would use, which clustering algorithm you would choose, and how the business could use these segments.

Next Steps

Now that you understand ML basics, proceed to Pandas for data preprocessing and Python for implementation. Explore Data Preprocessing and Model Evaluation. scikit-learn is the main library you will use for implementation.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Next → CNNs for Image Classification: Convolutional Neural Networks Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning