Hacker News

I had to look up what Arrow actually does, and I might have to run some performance comparisons vs sqlite.

It's very neat for some types of data to have columns contiguous in memory.

skeeter2020 17 hours ago [ - ]

>> some performance comparisons vs sqlite.

That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.

mandeepj 16 hours ago [ - ]

> Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.

Not sure if I misunderstood, what are the chances those different consumers / tools / operations are running in your memory space?

daddykotex 16 hours ago [ - ]

Not an expert, so I could be wrong, but my understanding is that you could just copy those bytes directly from the wire to your memory and treat those as the Arrow payload you're expecting it to be.

You still have to transfer the data, but you remove the need for a transformation before writing to the wire, and a transformation when reading from the wire.

cestith 14 hours ago [ - ]

If you are in control of two processes on a single machine instance, you could share the memory between a writer and a read-only consumer.

The key phrase though would seem to be “memory representation”m and not “same memory”. You can spit the in-memory representation out to an Arrow file or an Arrow stream, take it in, and it’s in the same memory layout in the other program. That’s kind of the point of Arrow. It’s a standard memory layout available across applications and even across languages, which can be really convenient.

shadow28 12 hours ago [ - ]

Arrow supports zero-copy data sharing - checkout the Arrow IPC format and Arrow Flight.

actionfromafar 16 hours ago [ - ]

Thanks! This is all probably me using the familiar sqlite hammer where I really shouldn't.

nu11ptr 18 hours ago [ - ]

If I recall, Arrow is more or less a standardized representation in memory of columnar data. It tends to not be used directly I believe, but as the foundation for higher level libraries (like Polars, etc.). That said, I'm not an expert here so might not have full info.

tormeh 17 hours ago [ - ]

You can absolutely use it directly, but it is painful. The USP of Arrow ist that you can pass bits of memory between Polars, Datafusion, DuckDB, etc. without copying. It's Parquet but for memory.

skeeter2020 17 hours ago [ - ]

This is true, and as a result IME the problem space is much smaller than Parquet, but it can be really powerful. The reality is most of us don't work in environments where Arrow is needed.

tosh 17 hours ago [ - ]

Take a look at parquet.

You can also store arrow on disk but it is mainly used as in-memory representation.

data_ders 18 hours ago [ - ]

yeah not necessarily compute (though it has a kernel)!

it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.

in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.

and, imho, it all really comes down to standard data types for columns!