Open Data Infrastructure - To Infinity and Beyond!
The Evolution of Data Systems In the Modern AI Era
Your data is yours. Or is it? For most of data platform history, it hasn’t been true in practical terms. Your data lived in proprietary formats that only one vendor could read, accessed through connectors that only one vendor controlled, queried by engines that only one vendor operated. Moving it cost a small fortune in egress fees. Landing it somewhere else meant a migration project measured in quarters, if not years (plus the extravagant water bill from crying in the shower…).
That’s changing. Data infrastructure has gradually become more open, flexible, and interoperable. What started as a series of pragmatic engineering trade-offs has crystallized into a core design principle: open data infrastructure (ODI). The idea of ODI is that you maintain access and control over your data through open standards, whether that’s APIs for ingest, open table formats for storage, or interoperable engines for consumption. No vendor lock-in.
This principle didn’t arrive out of nowhere. It was enabled incrementally, through phases of architectural innovation. The result is an architecture with four essential properties: flexibility, interoperability, robustness, and most importantly, ownership. ODI turns out to be exactly what the AI era demands.
The Monolith
The relational database was a single, tightly integrated machine, and for good reason. ACID transactions require sub-millisecond coordination between the buffer pool, lock manager, write-ahead log (WAL), and storage engine. Under WAL/ARIES-style recovery, the log has to be forced to stable storage at commit before the transaction gets acknowledged. Adding even modest network latency to that path degrades throughput substantially. Tight coupling wasn’t a design preference. It was a physical constraint imposed by the mechanics of durable writes. Databases didn’t talk with each other.
Within this model, two storage engine families emerged. B+ trees organize data in sorted, balanced structures optimized for point lookups, completing reads in one to three disk accesses when internal nodes are cache-resident. LSM-trees invert this trade-off by converting random writes into sequential I/O through in-memory memtables and immutable sorted string tables, accepting read amplification (mitigated by Bloom filters) for superior sustained write throughput.
Both power fast transactions. Neither was designed for analytics. Scanning a row-oriented table for a single column still reads entire rows and discards most of what it fetched. Compression is modest because pages mix heterogeneous types. Horizontal scaling via sharding forces application-level partitioning, makes cross-shard joins painful, and introduces two-phase commit overhead.
At this point, storage, compute, metadata, and data were one unit. Getting data in meant the vendor’s connectors. Getting data out meant the vendor’s query interface. Every layer was locked together.
Decoupling Storage from Compute
Cloud data warehouses (initial data warehouses didn’t have all of these characteristics) made two bets that transformed analytics: columnar storage (hooray!) and disaggregated compute (double hooray!!). Storing column values contiguously (one of my favorite words that I learned from Matt Topol - shoutout Columnar ) enables dictionary, run-length, and delta encoding, routinely hitting 5-10x compression on typical analytical datasets. Column pruning means a query touching 3 columns out of 50 reads roughly 94% less data. Vectorized execution processes column batches using CPU SIMD instructions, enabling throughputs impossible in row-at-a-time models.
Snowflake‘s architecture, described in its 2016 SIGMOD paper, became the template: data in cloud object storage as immutable micro-partitions in a proprietary columnar format, virtual warehouses for elastic compute, and a cloud services layer for metadata and query optimization. Google bigquery pushed further with a fully serverless model on Google’s Colossus file system and proprietary Capacitor format.
This was a genuine leap. Organizations could scale storage and compute independently, paying for compute only when queries ran. But your data still wasn’t yours. Snowflake’s micro-partitions, BigQuery’s Capacitor files, Redshift’s internal format: none of them are directly readable by external engines. You couldn’t bring your own compute to your own data without the vendor’s permission. Egress costs made portability economically impractical. And the source side was just as constrained. Ingesting data meant the vendor’s proprietary connectors and loading tools.
Cloud data warehouses solved analytics at cloud scale. It didn’t solve ownership.
Decoupling Data from Storage and Compute
The data lake flipped the trade-off. Instead of coupling data to a proprietary format, data lakes store everything in open file formats on commodity object storage, and let any engine read it.
Apache Parquet became the dominant analytical format. Its hierarchical structure (row groups containing column chunks containing pages) enables column pruning and predicate pushdown. A query engine reads only the footer, checks per-row-group statistics against filter predicates, and fetches just the needed column chunks from relevant row groups. The reduction in data read versus row formats is often an order of magnitude or more. The compute layer became a menu: Spark for batch ETL, Trino for interactive SQL, Athena for serverless pay-per-query, all reading the same Parquet files.
For the first time, data ownership shifted to the data owner. I’ll write that again, it’s important. For the first time, data ownership shifted to the data owner. Multiple engines could query the same files. Flexibility meant choosing your format, your engine, your storage.
But lakes lacked robustness and guarantees. An ETL job that crashes halfway through writing 1,000 Parquet files leaves 500 orphans with no rollback. Concurrent writers produce inconsistent results. S3 doesn’t support atomic directory renames. There’s no schema enforcement, so source systems silently change column types and files accumulate incompatible schemas. Streaming micro-batches create millions of tiny files, degrading query planning to minutes of file listing before a single byte gets read. The industry called it a “data swamp.” I like “data bog”, but I don’t make the rules.
The source side was still fragile, too. Getting data into the lake meant custom connectors or brittle ETL scripts for every source system, each with its own authentication model, rate limits, and schema quirks. Open formats freed storage, but without standardized APIs for source access, teams burned enormous effort just wiring up ingestion.
Warehouses had guarantees without ownership. Lakes had ownership without guarantees. Neither alone was enough. (queue Shark Tank music…) There has to be a better way!
Decoupling Metadata
The data lakehouse resolves this conundrum by decoupling metadata from everything else. The metadata layer tracks which files constitute a table, what schema they conform to, and how concurrent operations are serialized. You don’t replace Parquet, S3, or Spark. You add a thin, open metadata layer that delivers ACID transactions, schema enforcement, and time travel on top of existing lake infrastructure.
This is where the principle becomes architecturally complete. All four properties are simultaneously achievable: ownership (your data stays in your object storage in open formats), flexibility (swap any compute engine), interoperability (metadata is readable by all conforming engines), and robustness (ACID guarantees, schema enforcement, snapshot isolation).
All three major open table formats, Apache Iceberg, Delta Lake, and Apache Hudi, provide ACID transactions on object storage through optimistic concurrency control with atomic metadata commits. Writers stage data files without locks, then attempt an atomic metadata swap at commit. If a conflicting change landed first, the operation fails gracefully rather than corrupting the table. Readers always see a consistent, point-in-time snapshot.
Apache Iceberg has achieved the broadest cross-vendor adoption by 2026 (that’s right, I said it). Snowflake donated its Polaris metadata catalog to the Apache Software Foundation. AWS launched S3 Tables with built-in Iceberg support. Google shipped BigLake Iceberg tables. Databricks acquired Tabular (founded by Iceberg’s creators) and supports Iceberg alongside Delta Lake through UniForm and Unity Catalog. The pattern matters more than the specific format: the industry is converging on open metadata because the principle demands it.
Iceberg’s architecture shows why metadata decoupling works so well. Its hierarchical metadata tree lets engines prune billions of files via min/max statistics in milliseconds, completely bypassing the expensive list operations that crippled data lakes. Each write produces an immutable snapshot. Hidden partitioning derives partition values from data columns via transforms, eliminating user-specified partition columns in queries. Partition evolution lets you change schemes without rewriting data. Schema evolution is metadata-only, since columns are tracked by unique integer field IDs rather than names, making renames transparent with zero data rewriting.
The battle line has moved to the catalog layer. The Iceberg REST Catalog specification, an OpenAPI-based HTTP interface, turns catalog integration from an N×M problem into an N+M problem. Apache Polaris, Project Nessie, Lakekeeper, and AWS Glue all implement it. The architecture is federated: organizations connect multiple catalogs via a standard REST API, choosing each based on governance or operational needs. The open spec keeps control with the data owner.
Open Standards at Every Layer
Unbundling has extended into memory (Arrow), transformation (dbt), and source connectivity (APIs).
Apache Arrow is a great example. Arrow defines a language-independent columnar memory format that eliminates serialization overhead between engines. When a Python app needs data from a C++ engine, systems pass a memory pointer instead of serializing the payload. Arrow Flight SQL extends this to the network layer, with published benchmarks showing 20x+ throughput improvements over ODBC (honestly, treat yourself to some ADBC) in evaluated workloads (though the multiplier varies with data types and conditions). Substrait provides a cross-language serialization format for relational algebra, letting query plans compile once and run on any conforming engine.
On the source side, the same philosophy is finally taking hold. Standardized APIs and open connector protocols mean teams aren’t writing bespoke integrations for every SaaS platform, database, or event stream. Even MCPs have become open standards. Adding a new data feed becomes a configuration, not a bespoke (fancy word for hard and expensive) engineering project. The same interoperability and vendor independence that open table formats brought to storage also applies to ingestion.
Data transformation has moved towards modular, open standards. Tools like dbt (shoutout dbt Labs) apply software engineering discipline to the pipelines to turn raw data into trusted assets. Because storage is open and compute is modular, teams apply version control, automated testing, and dependency management to their SQL transformations. Multiple dbt adapters support Iceberg materialization across Spark, Trino, Athena, and Databricks, enabling Medallion Architectures (if you’re into that sort of thing) where each layer benefits from the table format’s ACID guarantees. dbt’s data contracts enforce structural guarantees through preflight validation: column names, types, and constraints must match the declared schema or the build fails.
What matters isn’t any specific tool. It’s the capability: open source access, open storage, open metadata, open interchange, governed transformation. Any tool that delivers these while preserving flexibility, interoperability, and robustness is contributing to the principle. Anything that locks a layer behind a proprietary interface is working against it.
Why This Matters for AI
Here’s where thirty years of decoupling meets the most consequential technology shift of the decade (again, fight me). The open, composable architecture that emerged from this evolution is precisely what AI systems need for batch training, feature engineering, and offline evaluation. That’s not coincidence. It’s structural.
Ownership and direct access. ML training streams millions of micro-batches per epoch at sustained throughput, a pattern SQL-optimized query engines were never designed for. Proprietary formats force extraction through JDBC or similar interfaces that can’t handle petabyte-scale movement. With open data infrastructure, ML clusters read Parquet files directly from S3 into Arrow memory arrays, bypassing database overhead entirely. PyIceberg gives native Python access to Iceberg tables without a JVM. Ray Data offers distributed ML preprocessing with snapshot pinning. Delta Lake and Hudi provide analogous (big word so you think I’m smart) patterns.
The key: open metadata makes data reachable without gatekeepers, and open APIs make sources reachable without vendor-specific connectors.
Robustness and reproducibility. Snapshot-based time travel addresses AI’s most critical data requirement: reproducibility. Every write creates an immutable snapshot with a unique ID and timestamp. Tag a snapshot before training, pin reads to that tag in the training script, log the snapshot ID with model metrics. Months later, reproduce the exact same run. Unlike manual dataset versioning, which requires full copies, snapshot tags are metadata references that don’t duplicate data files. Feature stores like Feast already support open table formats as offline sources, bringing transactional writes and snapshot-based reads into feature engineering workflows.
Interoperability across the toolchain. AI pipelines span more tools than traditional BI ever did: orchestrators, feature stores, training frameworks, experiment trackers, evaluation harnesses, serving systems. A proprietary storage layer that forces every tool through one vendor’s API is a bottleneck. Open infrastructure lets every tool access data directly through the same Parquet files, the same metadata catalog, the same Arrow interchange format.
Flexibility to evolve. The AI landscape moves faster than any technology domain in recent memory. Today’s dominant training framework might be gone tomorrow. The SaaS tools your org relies on might be acquired or deprecated next year. An infrastructure built on open standards can swap any component, compute engine, training framework, catalog, or source connector, without migrating data or rewriting pipelines. That’s the difference between adopting new AI capabilities in weeks and being locked into a migration that takes quarters.
The trust layer completes it. Governed transformation pipelines provide the contracts, quality tests, and lineage required to trust feature data feeding models. Open table formats deliver engine-agnostic storage with guaranteed consistency. The tools vary. The equation holds: open standards for access plus open formats for storage plus governed transformation equals trustworthy AI data.
The Principle Endures
Open is a design principle. Your data is yours. Access it through open APIs. Store it in open formats. Govern it with open metadata. Transform it with code you control. Consume it with any engine you choose.
The tools that best express this today, Iceberg, dbt, Arrow, open connector protocols, will evolve. New ones will emerge. The principle endures because its value is structural. Flexibility, interoperability, and robustness aren’t features on a vendor’s roadmap. They’re properties of the infrastructure design itself.
AI requires access, flexibility, and interoperability. Consumers require ownership without lock-in.
This is the moment when those needs have finally converged - Open Data Infrastructure.
Andrew Madson - Head of Developer Relations, Fivetran
Ready to get started with Iceberg without the brainfreeze? Check out Fivetran Managed Data Lakes!


