The data scientists I work with use this. Why do they use it? I don't really know much about it, but I've noticed they use it quite often. I mainly use MySQL or PostgreSQL. What are the advantages of DuckDB? It seems like they usually use it as an alternative to Pandas.

DuckDB has been probably my most used tool in 2026 - if you're comfortable with SQL it's incredible at quickly prototyping and slicing / dicing data.

I do a lot of experiments with regexes, and if you get used to the RE2 syntax that DuckDB uses, you can see up to 10-100x uplift in terms of speed compared to Postgres on things like regexp_matches(), regexp_extract(), etc (depending on query/table/machine specifics). It has quite powerful scripting with custom Macros, fixes a lot of annoyances of SQL for me compared to Postgres.

I think if you have access to a machine with a lot of RAM / cores and a beefy data set, then it's basically like a RAMdisk version of Snowflake running locally on your machine.

(and of course the fact that it makes it convenient to read CSV/parquet, read/write from S3, etc) - it's a very ergonomic tool.

Thank you for your kind reply. I should look into it too. In my case, knowing various libraries is directly related to my livelihood. Have a good day.

Here is the thing, it’s a write only single file format. If you need to run analytical queries it’s optimized for reading, you just open a file and query for the parts you want. If you have multiple clients that read and write data to the database, you should use postgresql.

It’s not really a database in the traditional sense, there is no ACID complexity, it’s a library that lets use write SQL to query a tabular data file.

Primarily the ability to work directly with data in its native format (CSV for example) without needing ETL.

How does this work in a production setup? Can this be set up like a server, or is it mostly for individual users to play around with data?

The idea is that you treat data storage and data processing as two distinct tasks. You have your data in S3 or HDFS or a local directory and you run DuckDB on whatever single-node compute you have: a local machine or a container in a cluster.

There are companies that write cluster computing engines with duckdb as the byte-cruncher at their heart, but usually it's more like NumPy, Pandas or Polars on steroids. Or SQLite, but for running OLAP queries.

In my previous job (working with electric vehicles) we had a AWS batch job that pulled all data from S3[1] into containers (one container per vehicle) and then push that data into duckdb then run some basic queries and data analysis.

The key thing is that this scaled horizontally pretty much forever, since each vehicle had a fixed amount of data per year we could tightly control the performance characteristics of the analysis. Adding more vehicles didn't make things slower, just linearly more expensive.

I vaguely remember the data from those containers also being used to process some aggregate analysis (like the each vehicle-container would output some data that would be consumed by another job that did aggregates). But I don't remember the specifics.

[1]: I believe we used JSONL or parquet format, but I didn't work in that part of the stack directly

It is an OLAP db. So you can have a pipeline storing data in parquet files in S3. And then use DuckDB to directly query on it.

Then it definitely makes sense. Scientists usually handle a lot of CSV files. Thank you