Why Iceberg Feels Like a Gift to Real Data Engineers

For those who haven’t looked into Iceberg yet, let me just say this up front: Apache Iceberg is probably one of the best things that happened to the open source community in big data. And maybe the worst thing that ever happened to some vendors.

You know the type. The ones who want your data locked in, nicely tucked away behind their dashboards, wrapped in APIs, and billed per scan. Iceberg breaks that.

A small thank you to Netflix

If I’ve ever felt bad about paying for my Netflix subscription, Iceberg made it easier to live with. Because yes—Netflix created Apache Iceberg, and honestly, they nailed it. They made a format that finally understands what engineers actually want:

Schema evolution that just works
Partition evolution without rewriting old data
Real time travel and branching
Compatibility with every major engine

And you can store your data on something as simple as S3. No weird file structures. No storage tax. Just plain objects, versioned and queryable.

One table, all engines

Iceberg doesn’t care if you're querying it from Snowflake, Trino, Flink, Spark, Athena, or even ClickHouse (yes, ClickHouse is getting support too). That’s the whole point. Write once, query everywhere. Finally, we’re decoupling the physical storage from the analytical engine.

You want to write optimized Parquet for Athena? Great. You want a streaming pipeline in Flink to read it minute by minute? Also fine. You want Snowflake to query the same dataset without duplicating it? Absolutely possible.

Vendor lock-in who?

Iceberg is your exit plan. It’s your strategy to not rebuild everything when your contract ends. To finally stop doing ETL between systems that already have the same data.

It’s also your way out of the “just scale it up” mindset. Because with Iceberg, you start thinking like a real data engineer again.

Not just "how much RAM does this cluster need", but:
How should I partition my table?
What’s the query pattern of my users?
Can I bucket by customer ID to reduce small file scans?
Do I optimize for reads or for frequent updates?

This is the stuff that matters. But it’s not magic. You still need to care. Here’s the thing. Iceberg gives you all this power, but it doesn’t run itself. You need to take care of your tables.

Partitioning: the heart of performance

Let’s say your users mostly query by event_type, but your partitioning is still based on event_date. You’ll quickly end up with massive scans on wide partitions. Flink or Spark might show great throughput, but the query latency sucks—and you're burning money.

With Iceberg, you can:

Use hidden partitioning so users don’t need to know the structure
Apply bucketing for high-cardinality fields like user_id or device_id
Change partitions over time without rewriting old data

That’s partition evolution, not just configuration. And the cherry on top: time travel

Maybe you want to debug something. Maybe your data science team wants to reproduce a model from last week. Maybe a write job went wrong, and you want to roll back.

With Iceberg, you can just query:

SELECT * FROM my_table VERSION AS OF '2024-07-01T00:00:00Z'

Or even branch the table and test changes in isolation. No data duplication. No dev cluster. Just built-in Git for your data.

Final thoughts

Iceberg isn’t a magic tool. But it’s a format that finally respects your work as an engineer. It gives you the freedom to architect systems based on needs—not tool limitations. It invites you to care again. To design data for how it’s used. To make things faster, cleaner, and more open. And most of all—it gives you control back.

If you want to see how I manage Iceberg tables at scale, from Flink and Spark pipelines to Trino and Athena access layers, I’d be happy to share some setups.

Just promise me one thing: Don’t treat Iceberg like a data lake. Treat it like a data product. Because it really can be that good.