Skip to content

Data Warehousing for Analytics -- Snowflake, BigQuery & Redshift

DodaTech Updated 2026-06-22 6 min read

Data Warehousing for analytics centralizes data from multiple sources into a single queryable Repository optimized for analytical workloads using platforms like Snowflake, BigQuery, and Redshift.

What You'll Learn

In this tutorial, you will learn how to design a modern data warehouse for analytics, including schema design (star schema, Snowflake schema), ETL pipeline construction with dbt, query optimization techniques, and cost management across Snowflake, BigQuery, and Redshift.

Why It Matters

Raw data in production databases is not suitable for analytics. OLTP databases are optimized for writes, not for the complex aggregations and large scans that analytics queries require. A dedicated data warehouse decouples analytics from production, improves query performance by 10-100x, and enables self-serve analytics for the entire organization.

Real-World Use

Durga Antivirus Pro ingests 2TB of daily scan data into Snowflake from 50+ regional servers. The data warehouse runs Compliance reports, threat trend analysis, and customer-facing dashboards without impacting scan performance. The team uses dbt for transformations and queries return in under 5 seconds for 99% of dashboard queries.

Data Warehouse Architecture

flowchart TD
    A[Source Systems] -->|Extract| B["ETL / ELT Pipeline"]
    B --> C[Staging Layer]
    C --> D[Raw Data Lake]
    D --> E[Transform (dbt)]
    E --> F[Star Schema]
    F --> G[Fact Tables]
    F --> H[Dimension Tables]
    G --> I[Analytics Queries]
    H --> I
    I --> J[Dashboards]
    I --> K[ML Models]
    I --> L[Ad-Hoc Analysis]

Star Schema Design

The most common analytics schema: one fact table surrounded by dimension tables:

-- Fact table: page views
CREATE TABLE analytics.fact_page_views (
    page_view_id BIGINT IDENTITY(1,1),
    date_id INT NOT NULL,
    user_id INT NOT NULL,
    page_id INT NOT NULL,
    session_id INT NOT NULL,
    device_id INT NOT NULL,
    load_time_ms INT,
    view_duration_seconds INT,
    CONSTRAINT pk_fact_page_views PRIMARY KEY (page_view_id)
);

-- Dimension tables
CREATE TABLE analytics.dim_date (
    date_id INT PRIMARY KEY,
    full_date DATE NOT NULL,
    year INT NOT NULL,
    quarter INT NOT NULL,
    month INT NOT NULL,
    week INT NOT NULL,
    day_of_week INT NOT NULL,
    is_weekend BOOLEAN NOT NULL
);

CREATE TABLE analytics.dim_page (
    page_id INT PRIMARY KEY,
    page_url VARCHAR(500) NOT NULL,
    page_title VARCHAR(200),
    page_category VARCHAR(100),
    page_type VARCHAR(50)
);

CREATE TABLE analytics.dim_device (
    device_id INT PRIMARY KEY,
    device_type VARCHAR(50),
    browser VARCHAR(50),
    browser_version VARCHAR(20),
    os VARCHAR(50),
    screen_resolution VARCHAR(20)
);

Expected behavior: The fact table stores event data with foreign keys to dimension tables. Queries join through dimensions for filtering and grouping. This design supports fast aggregation queries on billions of rows.

ETL Pipeline with dbt

dbt (data build tool) transforms raw data into analytics-ready models:

# dbt_project.yml
name: 'analytics'
version: '1.0'
profile: 'snowflake'

models:
  analytics:
    staging:
      materialized: view
    marts:
      materialized: table
      schema: analytics
-- models/marts/fct_page_views.sql
WITH source AS (
    SELECT * FROM {{ ref('stg_web_events') }}
    WHERE event_type = 'page_view'
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['event_id']) }} AS page_view_id,
    {{ dbt_utils.date_id('event_time') }} AS date_id,
    u.user_id,
    p.page_id,
    d.device_id,
    s.session_id,
    e.load_time_ms,
    e.view_duration_seconds
FROM source e
LEFT JOIN {{ ref('dim_users') }} u ON e.user_id = u.user_id
LEFT JOIN {{ ref('dim_pages') }} p ON e.page_url = p.page_url
LEFT JOIN {{ ref('dim_devices') }} d ON e.device_id = d.device_id
LEFT JOIN {{ ref('dim_sessions') }} s ON e.session_id = s.session_id

Expected behavior: dbt run executes the model, creating or replacing the analytics.fct_page_views table in Snowflake with surrogate keys and dimension foreign keys populated from staging views.

Query Optimization

Optimize queries for warehouse execution:

-- Poor query: full scan on string comparison
SELECT
    page_url,
    count(*) AS views
FROM fact_page_views f
JOIN dim_page p ON f.page_id = p.page_id
WHERE p.page_url LIKE '/docs/%'  -- String scan on join column
GROUP BY page_url;

-- Optimized query: uses dimension filter before join
SELECT
    p.page_url,
    count(*) AS views
FROM dim_page p
JOIN fact_page_views f ON p.page_id = f.page_id
WHERE p.page_category = 'documentation'  -- Uses dimension index
GROUP BY p.page_url;

Expected behavior: The optimized query filters the dimension table first (reducing rows early) and uses the integer foreign key for the join instead of a string match. On Snowflake, this can reduce query time from minutes to seconds.

Tool Comparison

Feature Snowflake BigQuery Amazon Redshift ClickHouse
Architecture Cloud-native Serverless Cluster-based Columnar
Compute/storage separation Yes Yes No (RA3: partial) No
Auto-scaling Yes Yes Manual Manual
Query optimizer Automatic Automatic Manual (sort keys) In-index
Concurrency scaling Yes Yes Yes (elastic) Limited
Pricing Per second compute Per TB scanned Per hour cluster Per server
Free tier Trial credits $300 free credits 12 months free No

Common Errors

1. Star Schema Joining Low-Cardinality Dimensions Directly

Dimensions like dim_device with hundreds of rows do not benefit from star schema. Denormalize low-cardinality dimensions directly into the fact table to avoid unnecessary joins.

2. Full Table Scans on Partitioned Tables

BigQuery charges per TB scanned. Without partition filtering, every query scans the entire table. Always filter on the partition column (usually date) in WHERE clauses.

3. Materializing All dbt Models as Tables

Views are free; tables cost storage. Materialize only frequently queried marts as tables. Keep staging models as views to avoid duplicating raw data.

4. Ignoring Distribution Keys on Redshift

Redshift distributes data across nodes. Without a distribution key matching the join column, data must shuffle between nodes, causing slow queries on large joins.

5. Snowflake Auto-Suspend Misconfiguration

Setting the auto-suspend timeout too short (60 seconds) causes frequent spin-up/spin-down cycles. Set it to 5-10 minutes for interactive workloads to avoid latency spikes.

Practice Questions

1. What is the difference between a fact table and a dimension table? A fact table stores measurable events (page views, purchases) with foreign keys to dimensions. Dimension tables store descriptive attributes (page URL, user name, date) that provide context for the facts.

2. Why is star schema preferred over normalized schemas for analytics? Star schema uses denormalized dimensions and a single fact table, reducing the number of joins in analytical queries. Normalized schemas require many joins, slowing down aggregation queries.

3. How does dbt help with data warehouse transformations? dbt enables SQL-based transformations with dependency management, testing, documentation generation, and incremental model building. It follows the ELT pattern: load raw data first, then transform in the warehouse.

4. What is compute-storage separation in Snowflake? Compute and storage scale independently. Storage stays in S3, compute clusters spin up per query. This allows concurrent workloads without resource contention and lets you pause compute when not in use.

5. Challenge: Design a star schema for an e-commerce analytics warehouse with fact tables for orders, page views, and support tickets. Implement the schema in Snowflake or BigQuery, load 1M rows of sample data, and write three analytical queries: monthly revenue by product category, funnel conversion from page view to purchase, and customer lifetime value by acquisition channel.

Mini Project

Build an end-to-end analytics warehouse using Snowflake (or BigQuery free tier) and dbt. Ingest web events, user data, and product catalog from CSV files into a raw staging schema. Build a star schema with dbt including fact_page_views, fact_orders, dim_users, dim_products, and dim_date. Create dbt tests for uniqueness and referential integrity. Write three dashboard queries: daily active users, revenue by product category, and 7-day user retention. Deploy the dbt docs site to document the schema.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro