“The goal is to turn data into information, and information into insight.” — Carly Fiorina
Stacks on Stacks: Navigating the Modern Data Engineering Landscape
“Without data, you’re just another person with an opinion.“
— W. Edwards Deming
Chapter One: From Storage to Strategy
In the early 2000s, companies started stockpiling data like they were building digital Fort Knoxes. Hadoop brought batch processing to the enterprise. Data lakes promised infinite scalability. Yet too often, the outcome was a swamp—murky, unusable, and undocumented.
Fast forward to today. The modern data stack isn’t just about storing data; it’s about activating it—making data discoverable, trustworthy, observable, and actionable across every corner of the organization.
Chapter Two: A Tour of the Modern Data Stack
The “stack” is no longer monolithic—it’s modular, composable, and increasingly decentralized. Here’s a birds-eye view:
Core Stack Layers
Layer | Common Tools | Purpose |
Ingestion | Kafka, Fivetran, Airbyte, Debezium | Extract & stream data |
Storage | S3, Delta Lake, BigQuery, Snowflake | Persist raw and curated data |
Processing | Spark, dbt, Flink, Beam | Clean, model, and transform data |
Serving & ML | Looker, Tableau, Hex, MLflow | Analytics, dashboards, ML ops |
Orchestration | Airflow, Dagster, Prefect | Schedule and manage data workflows |
Lineage & Cataloging | Atlan, DataHub, Collibra, Amundsen | Track and document metadata |
Observability & Quality | Monte Carlo, Soda, Great Expectations | Monitor pipeline health and accuracy |
Chapter Three: The “-ilities” That Make or Break Your Data Strategy
Today’s data systems are judged not just by what they do, but by how well they do it—and how easily they enable others to do the same. Here’s a breakdown of the key non-functional attributes, or “-ilities”:
- Observability: Knowing if your pipeline is broken before your exec dashboard lights up in red.
- Lineage: Tracing a number back to its source (and its last 5 transformations).
- Cataloging & Discoverability: Helping analysts, engineers, and ML practitioners know what exists and how to use it.
- Data Quality: Validating freshness, accuracy, completeness, and consistency.
- Governance: Ensuring data access policies, usage standards, and compliance controls are in place (think GDPR, HIPAA).
- Scalability & Elasticity: Handling bursts of activity—like Black Friday or a viral campaign—without breaking a sweat.
- Resiliency: What happens when a job fails? Can the system retry and recover automatically?
Chapter Four: Batch, Streaming, Hybrid—When and Why
Batch: Reliable Workhorse
Use Cases:
- Payroll, financial reconciliation, billing
- ML training jobs
- Daily/weekly business reports
Tech Examples:
- Apache Spark, dbt, Airflow with Snowflake or S3
Pros: Efficient for large volumes, easy to debug
Cons: Not real-time, higher latency
Streaming: The Real-Time Engine
Use Cases:
- Fraud detection
- IoT telemetry (smart homes, connected cars)
- User behavior tracking
Tech Examples:
- Apache Kafka, Flink, Redpanda, Materialize
Pros: Low-latency, immediate reactions
Cons: Operationally complex, harder to backfill
Hybrid: The Pragmatic Blend
Use Cases:
- Real-time inventory paired with daily forecasting
- Live dashboards powered by aggregates
- ML pipelines that drift-detect in real time and retrain overnight
Tech Examples:
- Kafka for ingest → Flink + dbt for modeling → Snowflake + Looker for serving
Pros: Combines speed with completeness
Cons: Requires careful data modeling to avoid inconsistency
Chapter Five: Architecture in Practice
Here’s a modern reference architecture integrating the key components and “-ilities”:

Chapter Six: What Good Looks Like
Modern Success Stories
- Airbnb: Built Minerva to offer a centralized semantic layer across teams, with dbt + Presto + Amundsen powering discoverability and trust.
- Shopify: Combines Kafka for real-time events and Spark/Delta Lake for historical trends to support over 1M merchants.
- Monzo Bank: Uses real-time stream processing with Apache Kafka and event sourcing to enable financial reconciliation within minutes.
Chapter Seven: Where It Goes Wrong
Common Failure Modes
- Broken Lineage: A metric changes, and no one knows why.
- Opaque Pipelines: Data arrives late or not at all, but no alerts fire.
- No Catalog: New employees spend weeks learning tribal knowledge just to find tables.
- Too Many Silos: Marketing, product, and finance use separate pipelines, each “truth” slightly different.
Chapter Eight: Strategy, Not Stack
Don’t start with “What tools should we use?” Start with “What questions do we need to answer?”
- Do you need real-time actions or just accurate insights?
- Do your teams trust the data?
- Can you trace every dashboard tile to its source?
- Can you confidently answer: “What broke, and why?”
The right data stack is an enabler—not just of insights, but of alignment, innovation, and trust.
Wrapping up…
The best data stack in the world is meaningless without good people and culture. A stack should empower cross-functional teams: data engineers, analysts, product managers, data scientists, and business leaders.
Invest in:
- Documentation as a feature
- Data contracts and expectations
- Observability as a core platform concern
- Internal training and enablement
Because in the end, the best stack isn’t the flashiest—it’s the one everyone can use, trust, and build on.