Hacker News

It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.

https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24

emmelaich a year ago [ - ]

Thanks, that's a bit upsetting :-)

tmoertel a year ago [ - ]

Indeed.

On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:

    -- Returns a pseudorandom fp64 number in the range [0, 1). The number
    -- is determined by the given `key`, `seed` string, and integer `index`.
    CREATE MACRO pseudorandom_uniform(key, seed, index)
    AS (
      (HASH(key || seed || index) >> 11) * POW(2.0, -53)
    );

o11c a year ago [ - ]

`HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?

tmoertel a year ago [ - ]

Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.

emmelaich a year ago [ - ]

Thanks. I think my trick was to generate (externally) a table of random numbers, then `join` by rowid that randtable to the table of interest. It was only a few million numbers so didn't take long to generate. And it was pretty much a one-off job.

hobs a year ago [ - ]

Had a good laugh, this is the normal response to difference in SQL implementations in my experience.