Column Store Plans
Over at pgsql-general, Bráulio Bhavamitra asks:
I wonder if there is any plans to move postgresql entirely to a columnar store (or at least make it an option), maybe for version 10?
This is a pretty interesting question. Completely replacing the current row-based store wouldn’t be a good idea: it has served us extremely well and I’m pretty sure that replacing it entirely with a columnar store would be disastrous performance-wise for OLTP use cases.
That doesn’t mean columnar stores are a bad idea in general — because they aren’t. They just have a more limited use case than “the whole database”. For analytical queries on append-mostly data, a columnar store is a much more appropriate representation than the regular row-based store, but not all databases are analytical.
However, in order to attain interesting performance gains you need to do a lot more than just change the underlying storage: you need to ensure that the rest of the system can take advantage of the changed representation, so that it can execute queries optimally; for instance, you may want aggregates that operate in a SIMD mode rather than one-value-at-a-time as it is today. This, in itself, is a large undertaking, and there are other challenges too.
As it turns out, there’s a team at 2ndQuadrant working precisely on these matters. We posted a patch last year, but it wasn’t terribly interesting — it only made a single-digit percentage improvement in TPC-H scores; not enough to bother the development community with (it was a fairly invasive patch). We want more than that.
In our design, columnar or not is going to be an option: you’re going to be able to say Dear server, for this table kindly set up columnar storage for me, would you? Thank you very much. And then you’re going to get a table which may be slower for regular usage but which will rock for analytics. For most of your tables the current row-based store will still likely be the best option, because row-based storage is much better suited to the more general cases.
We don’t have a timescale yet. Stay tuned.
Isn’t that essentially what cstore_fdw already provides ? Or does its fdw nature make it hard to optimize ? What about a table with some fields stored in colums and the other fields stored in rows ?
Oh man, what a tease. I’ll just sit here daydreaming about an optimized columnar store integrated with parallel query…
So I watched Mark Wong youtube clip summarising patched and CS comparisons..
From what I could tell, CS1 ans CS2 were 3-4x faster for queries. XL was 160 faster than PG. so therefore XL about half the speed of CS DBs?
I also assume that XL was based on 9.5? and doesnt have any of the 9.6 perf improvements which looks like 100% read improvement approx and 2 CPU parallel query also looks like a fairly linear 100% gain as well (not sure if XL would improve the stats if it did have 9.6 perf code in it?)
If the 9.6 code would improve XL.. and with similar perf gain e.g. 100% gain, then it would be similar to CS for queries would it not?
So a Column store only benefit would be the write benefits..
So is there much benefit of a column store? wouldnt focus on improvement PG as it is be a better option?
Also, not sure I understand the full PG history, but if Cstore_FDW already exists.. why not incorporate that and just improve that rather than reinvent the wheel? I assume it may not be easy to utilise SIMD but surely tweaking cstore is better than from scratch, especially from a time to market – well get it into a release perspective..
I could be well off, and since im not technical, any insight, corrections and thoughts would be appreciated 🙂
I’ve worked with several of the columnar analystics RDBMS’s…namely, Vertica (the leader in this space, and an off-shoot of the academic work done on the CSTORE project…also very Postgres-like on the surface…*hint hint*), Infobright (MySQL variant), Amazon Redshift (aka ParAccel, a fork of Postgres), and MariaDB ColumnStore (now called MariaDB AX and formerly Calpont InfiniDB). In doing analytical benchmarks with various row-oriented and column-oriented SQL RDBMS’s, it becomes quickly apparently that the folks at Vertica have the equation right in terms of performance, high-availability, and manageability. I’d pit Vertica against Oracle Enterprise Edition any time and know that it would come out the winner for large-scale analytics. I would highly, highly suggest that, instead of re-inventing the wheel, you dig deep into the CSTORE project to gain an understanding of a foundation that works in practice…the open-source code for that academic project is still readily available.
One additional note: Column-oriented technology comes with a penalty, and that penalty (as you’ve already noted) is difficulty handling updates to existing rows. I’ve looked at column-oriented tech such as the the cstore_fdw, however a 20% gain in performance just isn’t worth that penalty in practice. When you look at gains from Vertica by comparison, a 50X faster improvement is unquestionably worth handling the penalty around updates and other row-centric operations.