Skip to content

Cloud Data Loss Prevention — DLP Policies & Sensitive Data Scanning Guide

DodaTech Updated 2026-06-24 5 min read

In this tutorial, you'll learn about Cloud Data Loss Prevention. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Cloud data loss prevention discovers, classifies, and protects sensitive data across cloud storage, databases, and Data Pipelines using services like AWS Macie, Azure Purview, and GCP DLP API with automated redaction and policy enforcement.

What You Will Learn

How to scan cloud storage for sensitive data, classify PII and financial information, and apply automated remediation like redaction or blocking public access.

Why It Matters

Data loss is the most expensive Cloud Security failure. Sensitive data sitting in unencrypted S3 buckets or unprotected BigQuery tables can be exfiltrated in minutes. DLP tools find and protect that data before attackers do.

Real-World Use

A fintech company uses AWS Macie to scan all S3 buckets weekly. When Macie discovers a CSV file containing credit card numbers in a publicly accessible bucket, it automatically applies a bucket policy to block public access and alerts the security team.

DLP Architecture

flowchart LR
  Sources["Data Sources\nS3 / Blob / GCS\nRDS / BigQuery"] --> Scanner[DLP Scanner\nClassification Engine]
  Scanner --> Findings[Findings & Alerts]
  Findings --> Dashboard[Compliance Dashboard]
  Findings --> Remediation["Auto-Remediation\nBlock Access / Redact"]
  
  subgraph Classification
    PII[PII Detection\nSSN, Email, Phone]
    Financial[Financial Data\nCC Numbers, Bank Accounts]
    Custom[Custom Patterns\nInternal IDs]
  end
  
  Scanner --> Classification
  
  style Scanner fill:#f90,color:#fff
  style Remediation fill:#e00,color:#fff

AWS Macie

Macie uses Machine Learning to discover sensitive data in S3 buckets. It creates automated data discovery jobs and generates findings.

# Enable Macie
aws macie2 enable-macie \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Create a classification job for a specific bucket
aws macie2 create-classification-job \
  --name "PII Scan - prod-data" \
  --job-type SCHEDULED \
  --s3-job-definition '{
    "bucketDefinitions": [{
      "accountId": "123456789012",
      "buckets": ["prod-data-bucket"]
    }]
  }' \
  --sampling 50 \
  --schedule-frequency DAILY

# List sensitive data findings
aws macie2 list-findings \
  --finding-criteria '{"criterion": {"category": {"eq": ["PII"]}}'
# Output:
# {
#   "findingIds": [
#     "arn:aws:macie2:us-east-1:...:finding/pii-credit-card-001]
#   ]
# }

# Get finding details
aws macie2 get-finding \
  --finding-id "arn:aws:macie2:us-east-1:...:finding/pii-credit-card-001"
# Output:
# {
#   "category": "PII",
#   "type": "Credential/CreditCard",
#   "severity": "HIGH",
#   "description": "Credit card numbers found in prod-data-bucket/data/transactions.csv"
# }

Azure Purview

Azure Purview scans data sources across Azure, on-premises, and Multi-Cloud environments. It classifies sensitive data and tracks data lineage.

# Register an Azure SQL Database as a data source
az purview scan create \
  --account-name prod-purview \
  --name prod-sql-scan \
  --kind AzureSqlDatabase \
  --resource-group prod-rg \
  --credential '{"kind": "ManagedIdentity"}' \
  --collection '{"referenceName": "prod-collection", "type": "CollectionReference"}'

# Start a scan run
az purview scan run start \
  --account-name prod-purview \
  --scan-name prod-sql-scan

# List scan results
az purview classification-rule list \
  --account-name prod-purview \
  --query "[].{Name:name, Action:properties.classificationAction}" \
  --output table
# Output:
# Name                Action
# CreditCard          ApplyClassification
# SocialSecurityNum   ApplyClassification
# Email               ApplyClassification

GCP DLP API

GCP DLP API inspects data for sensitive patterns and can de-identify it through redaction, masking, or tokenization.

# Create a DLP inspection template
gcloud dlp inspect-templates create \
  --project my-project \
  --template-id pii-inspect \
  --display-name "PII Inspection Template" \
  --info-types "CREDIT_CARD_NUMBER" "EMAIL_ADDRESS" "US_SOCIAL_SECURITY_NUMBER" \
  --min-likelihood LIKELY

# Inspect a Cloud Storage bucket
gcloud dlp jobs create \
  --project my-project \
  --inspect-job '{
    "storageConfig": {
      "cloudStorageOptions": {
        "fileSet": {"url": "gs://my-data-bucket/**/*.csv"}
      }
    },
    "inspectConfig": {
      "inspectTemplateName": "projects/my-project/inspectTemplates/pii-inspect"
    }
  }'

# List DLP job results
gcloud dlp jobs list \
  --project my-project \
  --filter "state:DONE" \
  --format="table(name, inspectDetails.result.processedBytes)"
# Output:
# Name                                                    ProcessedBytes
# projects/my-project/dlpJobs/job-abc123                 536870912

Automated Remediation

When DLP discovers sensitive data in an insecure location, automatically apply protections.

# AWS: Auto-remediate a publicly accessible bucket with Macie finding
# Example EventBridge rule triggers Lambda:
# - Get finding details
# - Block public access on the bucket
aws s3api put-public-access-block \
  --bucket prod-data-bucket \
  --public-access-block-configuration \
    BlockPublicAcls=true,BlockPublicPolicy=true,IgnorePublicAcls=true,RestrictPublicBuckets=true

Common Mistakes

  1. Scanning only once: Data changes daily. Schedule recurring scans for continuous discovery.
  2. No remediation automation: Finding sensitive data without acting on it leaves it exposed. Connect DLP findings to automated remediation workflows.
  3. Ignoring false positives: DLP tools flag benign data. Tune classification rules to reduce noise and focus on real risks.
  4. Not scanning all data sources: DLP often covers S3 and Blob storage but misses databases, data warehouses, and Data Pipelines. Extend coverage to all data stores.
  5. No data classification policy: Without defined sensitivity labels, DLP cannot prioritize findings. Implement a data classification policy first.

Practice Questions

  1. What is the difference between AWS Macie and AWS GuardDuty for data protection?
  2. How does Azure Purview classify sensitive data across hybrid sources?
  3. What de-identification techniques does GCP DLP API support?
  4. Why should DLP scanning be scheduled rather than one-time?
  5. How can DLP findings trigger automated security remediation?

Challenge

Implement DLP across a Multi-Cloud data environment. Configure AWS Macie to scan S3 buckets for PII weekly. Set up Azure Purview to scan an Azure SQL Database for credit card numbers. Configure GCP DLP to inspect BigQuery tables for email addresses. For each platform, write the CLI commands to create the scan, view findings, and automate a remediation action (block public access in AWS, apply masking in Azure, redact in GCP).

FAQ

What is cloud DLP?

Data Loss Prevention in the cloud discovers, classifies, and protects sensitive data through scanning, policy enforcement, and automated remediation.

How does AWS Macie detect sensitive data?

Macie uses Machine Learning and pattern matching to identify PII, financial data, and custom sensitive data types in S3 buckets.

Can Azure Purview scan on-premises databases?

Yes. Purview supports hybrid scanning across Azure, on-premises SQL Server, Power BI, and third-party sources.

What is GCP DLP API?

A service that inspects text, images, and structured data for sensitive content and applies de-identification techniques like redaction and masking.

How do I automate DLP response?

Use Event-Driven Architecture: DLP findings trigger cloud functions or workflows that apply protective measures like blocking public access or alerting security teams.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro