Data Architectures and Their Economics

From warehouses and lakehouses to mesh and fabric, explore modern data architectures and see how each reshapes data costs, complexity, and vendor lock-in.

Data Architectures and Their Economics

Summary

Seven modern data architectures offer distinct patterns, each with its own economic tradeoffs, coordination costs, and platform lock-in considerations.

image_pdfimage_print

This blog on data architecture economics originally appeared on Andrew Sillifant’s blog. It has been republished with the author’s credit and consent. 

The modern data stack (MDS) was never an architecture. It was a philosophy: best-of-breed tools, loosely coupled via APIs, separation of concerns. Fivetran for ingestion. Snowflake for storage. dbt for transformation. Looker for visualization. Buy the pieces; assemble yourself.

It worked when capital was cheap. Between 2020 and 2021, venture capital poured $5.24 billion into data infrastructure. The MAD landscape exploded to 1,416 companies. Every function got its own startup.

Then interest rates started to rise in 2022, and suddenly integration costs mattered.

rise and rationalization of data infrastructure.
Figure 1: Companies of Matt Turck’s MAD landscape, 2012 to 2025. From 139 logos to 2,011 at peak, then a deliverable editorial cut. Approximate values are estimated from report context where exact counts were not published.

Today, 70% of data teams juggle five to seven different tools just to manage daily workflows. About 40% cite maintaining integrations as their highest cost center. The “best of breed” philosophy created a tax that most organizations can no longer afford.

The response has been market consolidation. Fivetran and dbt merged in late 2025, combining into a $600M ARR entity. IBM acquired Confluent for $11.1 billion. Starting in 2023, Microsoft bundled everything into Fabric (an interesting exposé into what next-gen platform architecture could be). Databricks and Snowflake keep expanding scope.

The MDS era is over. What replaces it isn’t a single architecture but several competing patterns, each with different economic tradeoffs. Understanding those tradeoffs matters more than understanding the technology.

7 Data Architectures and Their Economics

Every architectural pattern is really an economic thesis in disguise. Here’s what each one is betting on.

The data warehouse is the oldest pattern, now reborn in the cloud as a scale-out managed service. Snowflake, BigQuery, Redshift, Azure Synapse. A single system where storage and compute are bundled together. You load data in; you query it out.

data warehouse
Figure 2: A cloud-native data warehouse centralizes control, compute, and storage behind a unified query engine and metadata layer.

The economics: Centralization reduces coordination cost. One system, one team, one bill. The vendor handles infrastructure so you don’t need an army of DBAs or platform admins.

The tradeoff: The vendor controls the economics. Snowflake’s pricing is simple, but 90% of your bill is query compute, and they set the rates. Switching means replatforming everything.

Adoption: According to a survey by BARC, 79% of organizations use a data warehouse in their analytics environment. It’s the default.

The lakehouse emerged in early 2020, primarily from Databricks, as a response to a specific inefficiency: Organizations were paying twice. Once to store data in a cheap lake (S3, ADLS). And again to copy it into an expensive warehouse for querying. The ETL between them was a tax on every insight.

data lakehosue
Figure 3: A lakehouse architecture keeps data in open object storage while open table formats and shared catalogs power SQL, ML, and streaming engines from a single copy.

The economics: Store data once in open formats (Iceberg, Delta Lake, Hudi). Bring any compute engine to it. Eliminate the duplication. Commoditize storage, compete on compute.

The tradeoff: More operational complexity. Databricks’ “two-bill problem” means you pay them for software and your cloud provider for infrastructure separately. You can optimize aggressively (spot instances can cut compute costs up to 90%), but you need engineering capacity to do it.

Adoption: Somewhere between 8% and 65%, depending on how it’s measured. A 2023 BARC survey found 8%-12% with distinct lakehouse implementations. A Dremio survey found 65% of participants running the majority of analytics on lakehouse platforms. The gap reflects definition differences. Either way, adoption is accelerating.

Zhamak Dehghani coined the term “data mesh” in 2019 while at Thoughtworks. It’s not a technology choice. It’s an organizational design. Where fabric virtualizes access from the top down through metadata and automation, mesh federates ownership from the bottom up through domain teams. Instead of a central data team owning everything, domain teams (marketing, finance, product) own their own data as products. A thin platform layer provides self-serve infrastructure. Federated governance keeps things consistent.

data mesh
Figure 4: Data mesh shifts ownership to domain teams that publish data products on a shared platform, coordinated by federated governance.

The economics: Central data teams become bottlenecks at scale. The marginal cost of adding a new data product keeps rising because everything flows through the same people. Mesh flattens that curve by distributing ownership to the people with the most context.

The tradeoff: Higher total headcount. Each domain needs data talent embedded. Implementation takes 12-24 months for full enterprise rollout. Gartner’s 2024 Hype Cycle placed mesh in the “Trough of Disillusionment” and questioned whether it would ever reach mainstream adoption.

Adoption: Adoption remains niche. Only 18% of organizations had the governance maturity to attempt it as of 2021, and combined mesh/fabric influence reached just 18% of data programs by 2024. The pattern makes sense for companies with genuinely distinct domains. For everyone else, the coordination overhead may exceed the bottleneck it solves.

Data fabric is Gartner’s answer to mesh. Instead of reorganizing people, automate the problem away. A fabric uses active metadata, knowledge graphs, and ML to discover data across disparate sources and automate integration. The data stays where it is. The fabric virtualizes access.

Data fabric
Figure 5: Data fabric uses active metadata and automation to discover, connect, and virtualize access to data spread across many systems.

The economics: Labor is the most expensive part of the data stack. Schema mapping, pipeline building, and lineage tracking. All high-cost, low-value work. Automate it.

The tradeoff: Query performance. Virtualizing across sources means federated queries, which are slower than local ones. You also need the governance (metadata layer) to actually work, which requires maturity most organizations don’t have yet.

Adoption: According to a 2025 study, more than 35% of companies were researching or considering data fabric. Gartner claims properly implemented fabrics can reduce data management effort by 50%. The market is projected to grow from $3.25B (2024) to $19.54B (2033). Vendors like Denodo and the cloud-native catalog tools are chasing this.

The composable stack is the MDS philosophy refined. Instead of just “best of breed,” it emphasizes open standards and swappable components. Metrics defined once in a semantic layer and served to any tool. Open table formats so you’re not locked to one warehouse. APIs everywhere.

Compodable headless

The economics: Avoid lock-in. Keep optionality. Reduce the “data copy tax” where every downstream tool needs its own replica.

The tradeoff: The integration tax is still real. The 40% who cite integration as their highest cost are often running composable stacks. The theory is cleaner than the practice.

Adoption: Hard to measure because it’s more mindset than product. The growth of open table formats (Iceberg adoption, Parquet standardization) suggests the principles are spreading even if the label isn’t.

In a streaming-first architecture, events are the backbone. Instead of batch ETL that runs overnight, data flows continuously through a stream (Kafka, typically). Processing happens in real time. The lake or warehouse becomes a consumer of the stream, not the source of truth.

Schema Registry Streaming First
Figure 7: In a streaming-first architecture, event pipelines continuously feed real-time processing, operational systems, and downstream analytics.

The economics: Data has time value. Batch processing creates latency between when something happens and when you can respond. For operational use cases, and especially for AI agents that need current context, that latency is unacceptable.

The tradeoff: Operational complexity. Streaming systems are harder to debug, harder to reason about, and require different skills than batch. Not every workload needs real time.

Adoption: 72% of organizations now use data streaming for mission-critical systems, according to Confluent’s 2023 survey of 2,250 IT leaders. Kafka runs at over 80% of the Fortune 100. The infrastructure is mainstream. The question is how central it is to your architecture versus a bolt-on for specific use cases.

I purposefully did not use the term “AI” here because I wanted to get to the source of the thing, not the marketing nomenclature.

The newest pattern, and the least settled. LLM-ready architectures add a retrieval and orchestration layer on top of existing data platforms to serve large language models at inference time. The core problem: An LLM needs the right context, from the right sources, in the right format, fast enough to be useful. That means embedding pipelines, vector indexes, prompt orchestration, and evaluation frameworks layered over whatever warehouse or lakehouse you already run.

User Query Path LLM Ready
Figure 8: An LLM-ready stack layers embeddings, vector search, and orchestration on top of existing data platforms to deliver relevant context to large language models.

This is not a replacement architecture. It is an additional cost layer. Your warehouse still serves BI. Your lakehouse still runs batch analytics. The LLM stack sits alongside them, consuming their data through retrieval pipelines. The infrastructure question is not “rebuild for LLMs” but “what do I add, and what does that addition cost?”

The economics: LLMs are becoming the primary interface between people and enterprise data. Enterprise generative AI spending tripled from $11.5B to $37B in a single year. The organizations that can feed their models accurate, current context will outperform those that cannot. The retrieval layer is where that advantage is built.

The tradeoff: Maturity and cost transparency. RAG dominates at 51% adoption among enterprises deploying LLMs, up from 31% the prior year. But the patterns are still evolving rapidly. Prompt engineering remains the most common customization technique. Fine-tuning is rare. Only 16% of enterprise deployments qualify as true agents. Most production systems are simpler than the marketing suggests. Best practices for chunking, retrieval, evaluation, and orchestration shift quarterly. You are building on ground that has not finished moving.

Adoption: Early but accelerating. The infrastructure layer (storage, retrieval, orchestration connecting LLMs to enterprise systems) reached $1.5B in spend in 2025, while model API spending doubled to $8.4B. Every major platform is adding retrieval capabilities natively: Snowflake Cortex, Databricks Vector Search, Postgres with pgvector. The dedicated vector database market (Pinecone, Weaviate, Milvus) is growing, but the bigger story is existing platforms absorbing the functionality. The “LLM-ready” layer is being folded into incumbent ecosystems, not replacing them.

Here’s the uncomfortable truth: These aren’t seven independent choices.

The market is converging. Databricks and Snowflake both adopted open table formats. Both added streaming capabilities. Both are building AI features. Microsoft Fabric bundles warehouse, lake, streaming, and BI into one offering. IBM buying Confluent signals streaming folding into enterprise platforms.

The “architecture” question is increasingly becoming “which ecosystem?” The patterns matter less than understanding that we’re heading toward three to four major platform ecosystems, with open formats providing a (limited) escape valve for portability.

What differs is the economic model underneath: who bears the coordination cost, where lock-in lives, what skills you need, and who captures value.

ArchitectureEconomic ThesisKey TradeoffChoose If…
WarehouseCentralization reduces coordinationVendor controls economicsBI-first, want simplicity, can absorb vendor pricing
LakehouseEliminate ETL tax, commoditize storageOperational complexityML workloads, engineering capacity to optimize
MeshDistribute ownership, flatten cost curveHigher headcount, long implementationLarge org, distinct domains, 12+ month runway
FabricAutomate integration laborQuery performance, metadata maturityData sprawl problem, strong metadata foundation
ComposableAvoid lock-in, maintain optionalityIntegration tax still realMulti-cloud mandate, strategic fear of lock-in
StreamingTime value of dataOperational complexityOperational use cases, real-time requirements
LLM-ReadyOptimize for what’s nextImmature, shifting groundGenAI central to product, greenfield build

FAQ

A cloud data warehouse centralizes storage and compute under one vendor, which simplifies operations but concentrates pricing power and lock‑in. A lakehouse stores data once in open formats and lets you choose compute engines, which can lower storage and ETL costs but demands more engineering effort to run well.

Data mesh makes sense when you have genuinely distinct domains (like finance, product, and marketing) that each need their own embedded data teams and can support the extra headcount and governance. Data fabric fits better when your main problem is data sprawled across systems and you want to automate discovery and integration without reorganizing your org chart.

LLM‑ready patterns don’t replace your warehouse or lakehouse; they sit beside them, adding retrieval, embeddings, vector search, and orchestration so large language models can pull the right data at inference time. Think of it as an additional cost layer optimized for AI interfaces, not a whole new data platform.

Start from your constraints: team size and skills, regulatory and governance requirements, tolerance for vendor lock‑in, and how quickly you need to move. Then map those to the economic thesis of each pattern—centralization vs. distribution, automation vs. labor, real‑time vs. batch—and pick the smallest architecture that actually solves your current problems.