The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality.

Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.

Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review.

Data contracts as types and compile time checks (even across languages) are not new - this is a recent paper exposing the idea of correctness-by-design pipeline, which is a super set of this particular issue obviously (disclaimer: I'm one of the author of the paper): https://arxiv.org/pdf/2602.02335

Hey jtagliabuetooso! Absolutely, the idea isn't new. Rocky's bet is on the shipped implementation. I'll read your paper, thank you for sharing.

“Why it’s distinctive” is misleading (perhaps LLM-generated)?

Imo, we cite other work because it puts our work in context for experts and beginners alike; because it makes clear that we all stand on someone else’s shoulders (progress is, most of the time, a collective endeavor, not a lone-genius affair); because it is intellectually honest to acknowledge our debts.

Especially today, when putting research ideas out there almost guarantees they will be plagiarized by someone vibe-coding or vibe-writing, recognizing that our contributions come from somewhere is more important than ever. The implementation may or may not be novel, but the fact that it depends on LLMs even in the README should make you even more aware of why proper attribution is crucial: what's the incentive for open innovation if we all behave like this?

I hadn't come across Bauplan or your work before today's thread. Looks like a few of us are landing on branches/replay/lineage from different angles (like yours Iceberg-native, mine warehouse-delegated). I will spend proper time with the paper and Bauplan.

Same. I worked on an in-house product many years ago now where lineage and provenance were the entire point. Really cool to see this!

Thank you!

Hey Xiaoher-C! Hum, I don't have a lineage diff command yet. As of now, I can make a small lift to wire up together two commands that already exist: "rocky ci --diff --base main" which runs a diff between main and HEAD, and "rocky lineage --column columnname --downstream". I'll add this to my backlog!