The graphs for OrioleDB looks very impressive. Can someone give a counter argument to switching to this?

Oreole is not a plug-and-play yet. From their docs ( https://www.orioledb.com/docs ) > OrioleDB currently requires a set of patches to PostgreSQL to enhance the pluggable storage API and other PostgreSQL subsystems. All of these patches have been submitted to the PostgreSQL community and are under review.

(Opinions are my own, not of my employer)

I think that "under review" claim is doing some very heavy lifting, especially when it relates to their changes to index tuple lifecycle management. The patches that have been submitted are unlikely to get committed in full anytime soon, even after substantial changes to the patches' designs.

PostgreSQL just has not been designed for what OrioleDB is doing, and forcing OrioleDB's designs into PostgreSQL upstream would a lot of (very) sharp edges that the community can't properly test without at least a baseline implementation - which critically hasn't been submitted to upstream. Examples of these sharp edges are varsized TIDs, MVCC-owning indexes, and table AM signalled index inserts.

There are certainly ideas in OrioleDB's designs that PostgreSQL can benefit from (retail index tuple deletion! self-clustering tables!), but these will need careful consideration in how this can be brought into the project without duplicating implementations at nearly every level. A wholesale graft of a downstream fork and then hoping it'll work out well enough is just not how the PostgreSQL project works.

Oh this dampens my expectations, I was under the wrong impression that it will stabilise outside of Postgres before it is merged.

You get basically most of the advantages of a B-tree-oriented table, but also most of the disadvantages AFAIK. In particular, any index lookup/scan is going to take twice as long (index blocks don't point to the place on disk, they just contain the primary key and then you need to go lookup _that_ in the actual table).

This is generally true, but there are some additional aspects.

1. With PostgreSQL heap, you need to access the heap page itself. And it's not for free. It goes all through the buffer manager and other related components.

2. In OrioleDB, we have a lightweight protocol to read from pages. In-memory pages are connected using direct links (https://www.orioledb.com/docs/architecture/overview#dual-poi...), and pages are read lock-less (https://www.orioledb.com/docs/architecture/overview#page-str...). Additionally, tree navigation for simple data types skips both copying and tuple deforming (https://www.orioledb.com/blog/orioledb-fastpath-search).

According to all of the above, I believe OrioleDB still wins in the case of secondary key lookup. I think this is indirectly confirmed by the results of the TPC-C benchmark, which contains quite a lot of log of secondary key lookups. However, this subject is worth dedicated benchmarking in the future.

It would be interesting to see how OrioleDB does with more OLAP-like loads. From when I spent a lot of time benchmarking this, the indirect index design was _the_ main reason why MySQL+InnoDB was losing significantly to Postgres on TPC-H (well, DBT-3).[1] There was a lot of working around it with covering indexes etc..

Of course, the flip side of the coin is that if you do an UPDATE of a row in the presence of a secondary index, and the UPDATE doesn't touch the key, then you don't need to update the index(es) at all. So it really depends on how much you update rows versus how often you index-scan them IME.

[1] TPC-H doesn't have difficult enough queries to really stress the planner, so it mattered comparatively less there than in other OLAP work.

Thank you, that would be on the TODO list.

That’s how regular Postgres b-tree indexes work too.

I'll take a [citation needed] on that one.

https://www.postgresql.org/docs/current/indexes-index-only-s...

This is why Postgres b-tree indexes offer CREATE INDEX (indexCol1, indexCol2, ...) INCLUDE (includeCol1, includeCol2, ...). With INCLUDE, the index will directly store the listed additional columns, so if your query does `SELECT includeCol1 WHERE indexCol1 = X AND indexCol2 > Y`, you avoid needing to look up the entire row in the heap, because includeCol1 is stored in the index already. This is called a "covering index" because the index itself covers all the data necessary to answer the query, and you get an "index only scan" in your query plan.

The downside to creating covering indexes is that it's more work for Postgres to go update all the INCLUDE values in all your covering indexes at write time, so you are trading write speed for increased read speed.

I think it's quite typical to see this in SQL databases. SQLite behaves the same way for indexes; the exception is that if you create a WITHOUT ROWID table, then the table itself is sorted by primary key instead of by ROWID, so you get at most 1 index that maps directly to the row value. (sqlite docs: https://sqlite.org/withoutrowid.html)

That link directly contradicts what you are saying.

> This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.

Note that it says _index and the heap_. Not _index and the primary index and the heap_. (For a B-tree-organized table, the leaf heap nodes are essentially the bottom of the primary index, so it means that to find anything, you need to follow the primary index from the top, which may or may not entail extra disk accesses. For a normal Postgres heap, this does not happen, you can just go directly to the right block.)

Index-only scans (and by extension, INCLUDE) are to avoid reaching into the heap at all.

> The downside to creating covering indexes is that it's more work for Postgres to go update all the INCLUDE values in all your covering indexes at write time, so you are trading write speed for increased read speed.

For updates, even those that don't touch INCLUDE values, Postgres generally needs to go update the index anyway (this the main weakness of such a scheme). HOT is an exception, where you can avoid that update if there's room in the same heap block, and the index scans will follow the marker(s) you left to “here's the new row!” instead of fetching it directly.

Ah gotcha. My understanding of the heap reference was that heap lookups involve traversing a heap-like tree data structure, not a direct link to a specific leaf page.

Based on the limited description of OrioleDB I understand it works like SQLite WITHOUT ROWID, actually storing the row tuple in the primary key b-tree, but I didn’t go read the code

It's confusing nomenclature, but Postgres' “heap” is not a heap (B-tree) data structure. It is just a bunch of blocks all on the same level, similar to how the “heap” in an operating system (as opposed to the stack) is.

Notably, you can have a Postgres table without a primary key at all, not even an implicit one.

> Based on the limited description of OrioleDB I understand it works like SQLite WITHOUT ROWID, actually storing the row tuple in the primary key b-tree, but I didn’t go read the code

This is my understanding of OrioleDB as well.

Yep, regular PostgreSQL indexes point to a heap location (block number + offset). And it is the same for primary and secondary indexes.