Iceberg File Format API Release

What It Means for Data Engineers!

Feb 24, 2026

Reference: Introducing the Apache Iceberg File Format API, published February 20, 2026 by the Apache Iceberg PMC.

Iceberg has always known about file formats. It tracks Parquet, Avro, and ORC files inside snapshots and manifests. But the code that actually reads and writes those files? That was scattered across engine-specific integrations with no shared contract. Every engine did its own thing.

The File Format API, shipping in Iceberg 1.11.0, fixes that. It’s a registry-based plugin system that decouples file format implementations from both Iceberg’s core and its engine integrations (Spark, Flink, Arrow). New columnar formats can now plug into Iceberg without touching engine code.

This post covers what the API is, why it was built, how the architecture works, what it means for Vortex, and how you can start using it today. There’s also an honest comparison of three deployment scenarios so you can figure out what makes sense for your roadmap.

Why a new abstraction was needed

Iceberg has supported Parquet, Avro, and ORC since its early days, but format handling grew organically into a structure with four well-known problems.

Fragmented and duplicated logic. Each engine, Spark, Flink, and the generic Java data module, maintained its own format-specific readers, writers, and feature handling. No shared interface defined what a “format implementation” needs to provide. Adding a new format meant touching iceberg-core, iceberg-spark, iceberg-flink, and the data module independently.

Large branching code paths. Multi-format support lived in switch statements and conditional branches. Extending it was difficult, and inconsistencies crept in across engines easily.

Uneven feature support. Capabilities like projection pushdown, filter pushdown, and delete file handling required custom per-format, per-engine work. Some features worked for Parquet in Spark but not in Flink, or for ORC but not Avro. Closing those gaps was slow because each combination needed its own implementation.

Accelerating format innovation. Newer formats like Vortex and Lance bring adaptive encodings, integrated indexes, GPU-optimized layouts, and file structures that don’t fit traditional row-group designs. None of these could slot in cleanly under the old architecture without significant per-engine investment.

The bottom line: integrating a new file format with Iceberg meant a multi-engine, multi-module effort with no standardized contract. The File Format API provides that contract.

Architecture: FormatModel, FormatModelRegistry, and builders

The API introduces three core concepts that work together as a pluggable, registry-based system.

FormatModel<D, S>

FormatModel<D, S> is the central interface every file format must implement. It’s parameterized on two types:

D, the data record type (e.g., Record for generic Java, Spark’s InternalRow, Flink’s RowData)
S, the schema type used by that object model

A FormatModel implementation provides five things:

format(), returns the FileFormat enum value (e.g., PARQUET, ORC, AVRO)
type(), returns the data record class
schemaType(), returns the schema class
writeBuilder(EncryptedOutputFile), returns a builder for producing file appenders
readBuilder(InputFile), returns a builder for consuming data files

Concrete implementations exist for all three current formats: ParquetFormatModel, AvroFormatModel, and ORCFormatModel, plus an ArrowFormatModel for vectorized batch reading.

FormatModelRegistry

FormatModelRegistry is a static registry that stores FormatModel instances keyed by (FileFormat, Class<?>) pairs, the combination of file format and object model type. Engines interact exclusively with the registry rather than with format-specific code directly.

Registration happens through dynamic class loading at static initialization time, with graceful degradation when a format’s classes aren’t on the classpath. These are the auto-registration targets:

org.apache.iceberg.data.GenericFormatModels
org.apache.iceberg.arrow.vectorized.ArrowFormatModels
org.apache.iceberg.flink.data.FlinkFormatModels
org.apache.iceberg.spark.source.SparkFormatModels

All read and write operations route through registry methods:

// Reading
FormatModelRegistry.readBuilder(fileFormat, clazz, inputFile)

// Writing — three distinct paths for different file content types
FormatModelRegistry.dataWriteBuilder(fileFormat, clazz, outputFile)
FormatModelRegistry.equalityDeleteWriteBuilder(fileFormat, clazz, outputFile)
FormatModelRegistry.positionDeleteWriteBuilder(fileFormat, clazz, outputFile)

One important design choice: registrations must be unique per (FileFormat, Class<?>) pair. If you try to register a second FormatModel for the same key, it throws IllegalArgumentException. This prevents ambiguous dispatch.

Read and write builders

ReadBuilder<D, S> supports fluent configuration of schema, filters, projection, split ranges (byte-range or row-range), batch size, container reuse, and name mappings. It ultimately produces a CloseableIterable<D>. An important detail: filter pushdown is best-effort. The builder’s contract explicitly allows format implementations to skip full filter application, meaning the calling engine may need to re-apply them. This is consistent with how Iceberg already works. Avro, for example, ignores pushed-down filters.

The write builders return FileWriterBuilder<W, S> instances that wrap the lower-level WriteBuilder with content-aware writer construction. The type parameter W is one of DataWriter<D>, EqualityDeleteWriter<D>, or PositionDeleteWriter<D>, produced via static factory methods like forDataFile(), forEqualityDelete(), and forPositionDelete().

Here’s the control flow at a high level:

┌─────────────────────┐
│   Query Engine      │ (Spark, Flink, generic Java, ...)
│   (reads or writes) │
└────────┬────────────┘
         │ requests builder by (FileFormat, object model class)
         ▼
┌─────────────────────┐
│ FormatModelRegistry │ static registry, keyed by (FileFormat, Class<?>)
└────────┬────────────┘
         │ dispatches to registered FormatModel
         ▼
┌─────────────────────┐
│    FormatModel      │ e.g., ParquetFormatModel, AvroFormatModel
│    <D, S>           │
└────────┬────────────┘
         │ produces ReadBuilder or WriteBuilder
         ▼
┌─────────────────────┐
│  Format-specific    │ readers, writers, appenders
│  implementation     │
└────────┬────────────┘
         │
         ▼
   Data files / Delete files → committed to Iceberg table metadata

The big shift: engines no longer contain format-specific code. They request a builder from the registry for their object model type and file format, configure it, and execute.

Vortex: the format driving this change

The File Format API is format-agnostic by design, but the primary motivator and validation case is Vortex. It’s an extensible columnar file format created by SpiralDB and donated to the LF AI & Data Foundation (Linux Foundation) as an Incubation-stage project in August 2025. It’s backed by Microsoft, Snowflake, Palantir, and NVIDIA, with Wes McKinney and CMU’s Andy Pavlo on its Technical Steering Committee. Vortex is written in Rust.

Vortex’s core design separates logical types (”DTypes”) from physical layouts and uses a cascading compression system. Instead of applying a single codec per column like Parquet does, Vortex recursively selects optimal encodings per data chunk. That means Frame-of-Reference for sequential integers, Run-Length Encoding for sparse data, ALP for floating-point, FSST for strings, and FastLanes for SIMD-friendly integer decoding. This enables its headline capability: running filter expressions directly on encoded/compressed storage segments without full decompression.

Reported performance claims against Parquet (from Vortex documentation and independent benchmarks):

Metric Vortex vs. Parquet Random access reads ~100x faster (1.5ms vs. ~200ms) Scan performance 10–20x faster Write performance ~5x faster Compression ratio Comparable

The file format has been stable since v0.36.0, with backward-compatible guarantees for all subsequent releases. DuckDB added Vortex as a core extension in January 2026 (requires DuckDB 1.4.2+), and independent TPC-H SF=100 benchmarks showed Vortex 18% faster than Parquet V2 and 35% faster than Parquet V1 by geometric mean.

One caveat worth noting: Vortex’s forward compatibility (reading files written by newer Vortex versions) isn’t implemented yet. The Vortex project documents this explicitly.

Three deployment scenarios compared

To ground this, here are three real scenarios as they exist today. This comparison is based on what’s documented and released, not what’s projected.

Scenario A: Vortex with Iceberg (via the File Format API)

This is the integration tracked as GitHub Issue #15416. SpiralDB and Microsoft Gray Systems Lab built a proof-of-concept (”Vortex on Ice”) presented at the Iceberg Summit in April 2025. The integration uses JNI (Java Native Interface) for the Rust-to-Java bridge, which turned out to be roughly 3x faster than JNA. The team built a vectorized Spark reader and introduced “row-splittability” so Vortex can report desired row-splits in Iceberg metadata (instead of byte-range splits) for Spark parallelism.

Proof-of-concept benchmark results reported by the Iceberg blog:

Benchmark Engine Scale Result TPC-H Spark + Iceberg SF=100 ~30% overall speedup; 2–4x on slowest queries TPC-DS Spark + Iceberg SF=1000 30% runtime reduction, 20% storage reduction

Where things stand as of February 2026: The File Format API provides the mechanism for this integration, but the Vortex FormatModel isn’t merged into Iceberg yet. It’s tracked as an open issue. How Vortex file-level statistics map into Iceberg’s DataFile metrics isn’t documented yet either. This is an engineering roadmap item, not a production-ready stack.

Scenario B: Vortex standalone (without Iceberg)

Vortex works as a file format with its own scan API. It includes a self-describing file format with a postscript pointing to dtype, layout, file-level statistics, and footer segments. Language bindings exist for Python, Java, and C/C++.

You get the raw performance characteristics (fast scans, fast random access, compute-on-compressed-data) without Iceberg’s table management. What you lose: snapshot isolation, atomic commits, schema evolution, time travel, multi-engine concurrent access guarantees, and manifest-based partition pruning. Vortex is a file format, not a table format.

Scenario C: Iceberg with Parquet (the status quo)

This is what most production lakehouse deployments run today. Parquet is the default file format in Iceberg, with mature reader/writer implementations across all supported engines, well-documented table properties for tuning (vectorization, compression, split sizes, column metrics), and the deepest ecosystem support. Iceberg provides snapshot isolation, schema evolution, partition evolution, and atomic commits on top of Parquet files.

Side-by-side comparison

Attribute Vortex + Iceberg (Scenario A) Vortex standalone (Scenario B) Iceberg + Parquet (Scenario C) Maturity Proof-of-concept; not released File format stable since v0.36.0; no forward compat yet Production-grade; widely deployed Table management Full Iceberg semantics (snapshots, manifests, schema evolution) None, file format only Full Iceberg semantics Scan performance PoC shows 30% speedup over Parquet at TPC-DS SF=1000 Claims 10–20x over Parquet for scan workloads Mature; well-optimized with vectorized readers Random access Expected to inherit Vortex characteristics ~100x faster than Parquet per Vortex docs Slow (~200ms per Vortex comparison) Multi-engine support Intended via File Format API; not available yet Via Vortex Scan API and language bindings Spark, Flink, Trino, DuckDB, Athena, BigQuery, and more Transactionality Would inherit Iceberg ACID semantics Not applicable, no table-level commit model Serializable isolation, atomic commits Schema evolution Would inherit Iceberg schema evolution Vortex uses struct DTypes for column schemas; no evolution semantics Full Iceberg schema evolution Operational readiness Pre-release; requires building from Iceberg source Usable today in supported engines (e.g., DuckDB 1.4.2+) Strongest operational playbook in the lakehouse ecosystem

Here’s the takeaway: Scenario C is the lowest-risk production path today. Scenario B is compelling for workloads where Vortex’s performance matters and you can work within its ecosystem constraints. Scenario A is the convergence point, getting Vortex’s performance with Iceberg’s table management, but it’s not available as a released, tested integration yet.

How a new format integrates with Iceberg through the API

The File Format API defines a four-step process for adding any new format (Vortex, Lance, or something that doesn’t exist yet) to Iceberg:

Add the format to the FileFormat enum in the Iceberg API module. This is the metadata-level identifier that shows up in manifest entries and DataFile records.
Implement FormatModel<D, S> for each supported engine/data model combination. A Vortex integration, for instance, might need VortexGenericFormatModel (for the generic Java Record type), VortexSparkFormatModel (for Spark’s InternalRow), and VortexFlinkFormatModel (for Flink’s RowData).
Implement the corresponding ReadBuilder and WriteBuilder that handle format-specific I/O: encoding, decoding, filter pushdown, projection, and split planning.
Register the FormatModel with FormatModelRegistry.register(). The registry handles dispatch from that point forward, no changes to core Spark or Flink integration code needed.

For Vortex specifically, the Rust-to-Java boundary adds a fifth consideration: the JNI bridge layer that lets Java-based engines call into Vortex’s Rust implementation. The proof-of-concept showed this is viable and performant.

Example: working with the File Format API today

A note on availability: The File Format API ships in Iceberg 1.11.0, which hasn’t been released as of February 24, 2026 (today the latest published release is 1.10.1). The interfaces and implementations below are merged into the Iceberg main branch and you can use them by building from source, but they’re not in a stable Maven release yet.

This code example shows two things: how the FormatModelRegistry works as the central dispatch mechanism, and how to use the read and write builders the registry provides.

Registry behavior: registration and lookup

This test, based on the pattern from Iceberg’s own TestFormatModelRegistry, shows the core mechanics of the registry. You register a FormatModel, look it up by (FileFormat, Class<?>) key, and the uniqueness constraint prevents ambiguous dispatch:

package org.apache.iceberg.formats;

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

import org.apache.iceberg.FileFormat;
import org.apache.iceberg.encryption.EncryptedOutputFile;
import org.apache.iceberg.io.InputFile;
import org.apache.iceberg.util.Pair;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

class FormatModelRegistryExample {

  @BeforeEach
  void clearRegistry() {
    // Clear registrations between tests to isolate behavior
    FormatModelRegistry.models().clear();
  }

  @Test
  void registrationAndLookup() {
    // Create a FormatModel for Parquet with a specific object model type
    FormatModel<?, ?> parquetModel =
        new ExampleFormatModel(FileFormat.PARQUET, Object.class, Object.class);

    // Register it — keyed by (FileFormat.PARQUET, Object.class)
    FormatModelRegistry.register(parquetModel);

    // The registry now maps (PARQUET, Object.class) -> parquetModel
    assertThat(FormatModelRegistry.models())
        .containsEntry(Pair.of(FileFormat.PARQUET, Object.class), parquetModel);
  }

  @Test
  void separateRegistrationsForDifferentObjectModels() {
    // Two FormatModels for the same file format but different object model types.
    // This is the normal case: Parquet needs one model for generic Records,
    // another for Spark's InternalRow, etc.
    FormatModel<?, ?> genericModel =
        new ExampleFormatModel(FileFormat.PARQUET, Object.class, Object.class);
    FormatModel<?, ?> customModel =
        new ExampleFormatModel(FileFormat.PARQUET, Long.class, Object.class);

    FormatModelRegistry.register(genericModel);
    FormatModelRegistry.register(customModel);

    // Both coexist — different keys due to different object model types
    assertThat(FormatModelRegistry.models())
        .containsEntry(Pair.of(FileFormat.PARQUET, Object.class), genericModel);
    assertThat(FormatModelRegistry.models())
        .containsEntry(Pair.of(FileFormat.PARQUET, Long.class), customModel);
  }

  @Test
  void duplicateRegistrationIsRejected() {
    // Registering two models for the same (FileFormat, Class<?>) key
    // is an error — the registry enforces uniqueness to prevent
    // ambiguous dispatch
    FormatModel<?, ?> first =
        new ExampleFormatModel(FileFormat.PARQUET, Object.class, Object.class);
    FormatModelRegistry.register(first);

    FormatModel<?, ?> second =
        new ExampleFormatModel(FileFormat.PARQUET, Object.class, String.class);

    assertThatThrownBy(() -> FormatModelRegistry.register(second))
        .isInstanceOf(IllegalArgumentException.class)
        .hasMessageContaining("Cannot register class");
  }

  // ------------------------------------------------------------------
  // Minimal FormatModel implementation for demonstration purposes.
  // In production, ParquetFormatModel, AvroFormatModel, etc. provide
  // real reader/writer implementations.
  // ------------------------------------------------------------------
  private static class ExampleFormatModel implements FormatModel<Object, Object> {

    private final FileFormat format;
    private final Class<?> type;
    private final Class<?> schemaType;

    ExampleFormatModel(FileFormat format, Class<?> type, Class<?> schemaType) {
      this.format = format;
      this.type = type;
      this.schemaType = schemaType;
    }

    @Override
    public FileFormat format() {
      return format;
    }

    @Override
    @SuppressWarnings("unchecked")
    public Class<Object> type() {
      return (Class<Object>) type;
    }

    @Override
    @SuppressWarnings("unchecked")
    public Class<Object> schemaType() {
      return (Class<Object>) schemaType;
    }

    @Override
    public ModelWriteBuilder<Object, Object> writeBuilder(EncryptedOutputFile outputFile) {
      return null; // Stub — real implementations wrap format-specific writers
    }

    @Override
    public ReadBuilder<Object, Object> readBuilder(InputFile inputFile) {
      return null; // Stub — real implementations wrap format-specific readers
    }
  }
}

How engines use the registry for reads and writes

Once format models are registered (which happens automatically via static class loading in the standard Iceberg modules), engine code uses the registry to get builders. Here’s a sketch of how the read and write paths work through the registry, using the actual API signatures:

// ============================================================
// READ PATH — how an engine reads a data file through the API
// ============================================================

// The engine knows the file format (from Iceberg manifest metadata),
// the object model class it wants, and the InputFile to read.
CloseableIterable<Record> records = FormatModelRegistry
    .readBuilder(FileFormat.PARQUET, Record.class, inputFile)
    .schema(tableSchema)               // Iceberg schema for the table
    .projection(projectedSchema)        // Only the columns needed
    .filter(residualFilter)             // Pushed-down filter expression (best-effort)
    .split(splitStart, splitLength)     // Byte-range split for parallelism
    .reuseContainers()                  // Optimize GC by reusing record containers
    .recordsPerBatch(4096)              // Batch size for vectorized readers
    .build();

// ============================================================
// WRITE PATH — how an engine writes a data file through the API
// ============================================================

// For data files:
DataWriter<Record> dataWriter = FormatModelRegistry
    .dataWriteBuilder(FileFormat.PARQUET, Record.class, encryptedOutputFile)
    .schema(tableSchema)
    .partitionSpec(partitionSpec)
    .partition(partitionData)
    .sortOrder(sortOrder)
    .set("parquet.compression.codec", "zstd")  // Format-specific config
    .build();

// For equality delete files:
EqualityDeleteWriter<Record> eqDeleteWriter = FormatModelRegistry
    .equalityDeleteWriteBuilder(FileFormat.PARQUET, Record.class, encryptedOutputFile)
    .schema(deleteSchema)
    .partitionSpec(partitionSpec)
    .partition(partitionData)
    .build();

// For position delete files:
PositionDeleteWriter<Record> posDeleteWriter = FormatModelRegistry
    .positionDeleteWriteBuilder(FileFormat.PARQUET, Record.class, encryptedOutputFile)
    .schema(deleteSchema)
    .partitionSpec(partitionSpec)
    .partition(partitionData)
    .build();

The thing to notice: nothing in this code is Parquet-specific beyond the FileFormat.PARQUET enum value. Swap FileFormat.PARQUET for FileFormat.VORTEX (once the Vortex FormatModel is registered) and it routes to the Vortex implementation with no other code changes. That’s the entire value proposition.

Running this today

To work with these APIs before the 1.11.0 release:

Clone the Apache Iceberg repository: git clone https://github.com/apache/iceberg.git
Build from source (requires JDK 17 or 21): ./gradlew build -x test
The File Format API interfaces live in iceberg-core; format model implementations are in iceberg-data (generic), iceberg-spark (Spark), iceberg-flink (Flink), and iceberg-arrow (Arrow/vectorized)
Run the existing tests: ./gradlew :iceberg-core:test --tests "*TestFormatModelRegistry*"

What the API unlocks beyond pluggable formats

The blog post calls out two additional capabilities the File Format API enables.

Column families. These are vertically split storage layouts where a single Iceberg table’s columns get stored across separate file groups. This supports partial column updates without rewriting entire files, higher write parallelism, smaller metadata footers, and more efficient selective reads. Two design documents are in progress (a Column Families proposal and an Efficient Column Updates proposal tracked as Issue #15146). This work was explicitly waiting for the File Format API to land.

Comet integration. PR #13786 would migrate the Spark Comet reader to the FormatModel API, letting Comet-related classes be cleanly separated and potentially moved to the Comet repository.

What to watch and when to act

If you’re running Iceberg + Parquet in production today: Nothing changes right away. Parquet stays the default, and the File Format API is a non-breaking refactor of internal plumbing. Your existing tables, configurations, and engine integrations keep working through the 1.11.0 upgrade.

If you’re evaluating next-generation file formats: The API gives you a clear integration path. Watch Issue #15416 (Vortex integration) and the TCK development (Issue #15415). When the Vortex FormatModel lands and the TCK validates it, you’ll have a tested path to swap file formats without rewriting your pipeline logic.

If you’re building a custom file format: The four-step integration process (enum addition, FormatModel implementation, builder implementation, registry registration) is your contract. The TCK, once completed, will be the compliance test you need to pass.

The 1.11.0 release marks the start of a multi-format era for lakehouse storage. It doesn’t change what you do today, but it fundamentally changes what becomes possible tomorrow.

Source: Introducing the Apache Iceberg File Format API, Apache Iceberg blog, February 20, 2026

Andrew Madson

Head of Developer Relations, Fivetran

Open Data + AI with Andrew Madson

Discussion about this post

Ready for more?