Why is DuckDB so popular when one can use Python + Pandas?

Better perf + SQL is that mostly it?

The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:

  WITH lagged AS (
      SELECT 
          *, 
          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
      FROM events
  ),
  sessions AS (
      SELECT 
          *, 
          SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1)) 
              OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
      FROM lagged
  )
  SELECT
      user_id,
      session_id,
      MIN(event_time) AS session_start,
      MAX(event_time) AS session_end,
      COUNT(*) AS event_count
  FROM sessions
  GROUP BY ALL
  ORDER BY user_id, session_start;
vs

  result = (
      df.sort(["user_id", "event_time"])
      .with_columns(
          session_id=(
              pl.when(pl.col("event_time").diff().is_null())
              .then(1)
              .when(pl.col("event_time").diff().dt.total_minutes() > 30)
              .then(1)
              .otherwise(0)
              .cum_sum()
              .over("user_id")
          )
      )
      .group_by(["user_id", "session_id"])
      .agg(
          session_start=pl.col("event_time").min(),
          session_end=pl.col("event_time").max(),
          event_count=pl.col("event_time").count(),
      )
      .sort(["user_id", "session_start"])
  )

Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere (yes, inconsistencies aside).

Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.

How can you tell if someone is a polars fan? Don’t worry, they’ll tell you. :)

Polars and duckdb interoperate so it's not really one or the other.

Other dataframe libraries work directly on duckdb also, like Ibis.

FWIW, I think SQLAlchemy does also.

In any case though, I find both the SQL and the code there equally readable.

I really hate SQL as a language for doing data-related tasks, but the core tech for it is often much better in terms of reliability (and is close to having performance parity IME)... specifically for working with larger-than-memory datasets. I think the team at Polars is working on improving larger-than-memory operations though.

I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.

"Languages come and go, but SQL is forever"

That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)

The other thing DuckDB does quite well, and which is out of scope for polars, is its keeping of an absolute zoo of external data sources you can query with SQL from the same database client. It's an excellent data warehousing tool.

Why would you prefer Python and Pandas over good old SQL? Pandas is so verbose and hard to debug, most of the times struggle to be performant on small datasets.

SQL has been around since the dawn of databases. I am happy to see a trend away from pandas.

Pandas has lots and lots of problems.

Performance is definitely one of them, but it also has inconsistent and duplicated methods, inconsistent defaults (e.g. some methods are inplace by default), copy by reference issues, I could go on.

It was an early winner in an extremely popular language. That's really the main thing going for it, but alternatives have been a long time coming.

I wrote a blog post a while back to address this question here: https://www.robinlinacre.com/recommend_duckdb/