SQL — Modeling, Queries, Data Quality & Pipeline

Educational examples on a fictitious e-commerce dataset (CH/EU), covering 2019–2024. The point is not to “show a stack”, but to highlight what you can actually do with SQL: build a simple model, produce readable KPIs, read an execution plan to optimize a query (sargable), validate data quality, and track a pipeline in lightweight production.

SQL Server SSMS ETL Data Quality

↩ Back to Showcase

P1 — Modeling & Business Queries

SQL relational schema (customers, orders, order_items, products, payments)
Logical schema: customersordersorder_items, with products and payments.

Context & objectives

We start from a minimal relational model to illustrate what SQL can do: clear queries, monthly indicators, and reusable results for a dashboard. We rely on explicit types, primary/foreign keys, and simple constraints (NOT NULL, CHECK).

On the query side, we combine CTEs (WITH) and a few window functions (e.g. SUM() OVER) to aggregate properly. Target metrics: Revenue (CHF), number of orders, average basket, and active customers per month.

Monthly KPIs (excerpt)

Monthly revenue, orders, average basket, active customers
“sales_monthly” view: revenue_chf, orders_cnt, avg_order_value, active_customers.

P2 — Performance & Indexing (Execution Plans)

The goal: rewrite a query to make it sargable (search-ARGument-able) and leverage a proper index. Here, the “Top customers by period” query applied a function on the date column, forcing the optimizer to scan more rows than necessary.

Execution plan before — non-sargable query
Before: non-sargable filter (CONVERT(date, o.order_date)), index cannot be used effectively → wide scan.
Execution plan after — sargable query with index
After: direct bounds on indexed column (o.order_date >= '2021-01-01' AND o.order_date < '2023-01-01') + composite index (order_date, customer_id) → lighter plan, cheaper sort.

Best practices: limit selected columns, prefer EXISTS over IN when appropriate, pre-aggregate if needed, and read the plan (costs, operators).

P3 — Data Quality & Reliability

Data quality summary view: duplicates, nulls, negative payments, missing references
Summary view: rule, severity (high/medium/low), affected rows — sorted by priority.

This illustrates what SQL can do to make datasets more reliable: data quality rules (duplicates, NULL, orphan references, negative values), consistent formatting, and prioritization of fixes. A view consolidates results for dashboards or alerting systems.

  • Simple rules (composite key uniqueness, CHECK constraints).
  • Severity-based prioritization to handle highest-impact issues first.
  • Example corrections (safe UPDATE/DELETE).

P4 — Lightweight ETL & Execution Log

Pipeline diagram: staging → clean → mart with audit
Logical pipeline: staging (raw) → clean (normalized) → mart (BI-ready), plus audit area (jobs & quality).

A “lightweight” SQL pipeline remains readable and idempotent: incremental loads, MERGE/UPSERT, technical keys, timestamps. Scheduling can be handled via CRON/Azure DevOps at 08:30 CET, with retries on failure.

Execution log: jobs, timestamps, status, row counts, message
Execution log (job_runs): start/end timestamps, status, rows_in/out, message. Demo output: rerun after fix, constraint violation, delta import, monthly aggregation, etc.

Expected output: a simple, observable and traceable pipeline, ready to connect to an operational dashboard.