JSONB type performance in PostgreSQL 9.4
The 9.4 version of PostgreSQL introduces the JSONB
data type, a specialised representation of the JSON
data, allowing PostgreSQL to be competitive in managing the “lingua franca” of the moment for the exchange of data via web services. It is useful to perform a number of tests to verify its actual performance.
Test data
We will run our tests using the customer reviews data from Amazon for the year 1998 in JSON
format. The file customer_reviews_nested_1998.json.gz can be downloaded from the website of Citus Data.
The file, once unzipped, takes up 209MB and contains approximately 600k of records in JSON
format with a structure similar to the following:
{ "customer_id": "ATVPDKIKX0DER", "product": { "category": "Arts & Photography", "group": "Book", "id": "1854103040", "sales_rank": 72019, "similar_ids": [ "1854102664", "0893815381", "0893816493", "3037664959", "089381296X" ], "subcategory": "Art", "title": "The Age of Innocence" }, "review": { "date": "1995-08-10", "helpful_votes": 5, "rating": 5, "votes": 12 } } |
Dimensions
The data can be loaded into a PostgreSQL database using the JSONB
data type with the following commands:
CREATE TABLE reviews(review jsonb); \copy reviews FROM 'customer_reviews_nested_1998.json' VACUUM ANALYZE reviews; |
The resulting table will take up approximately 268MB, with an additional cost of disk storage of around 28%. If we try to load the same data using the JSON
type, which stores it as text, the result will be a table of 233MB, with an increase in space of roughly 11%. The reason for this difference is that the internal structures of JSONB
, which are used to access the data without analysing the entire document each time, have a cost in terms of space.
Data access
Once the data is stored in the database, it is necessary to create an index in order to access it efficiently. Before the 9.4 version of PostgreSQL, the only method of indexing the contents of a JSON
field was to use a B-tree
index on a specific search expression. For example, if we want to perform a search by product category we will use:
CREATE INDEX on reviews ((review #>> '{product,category}')); |
The newly created index takes up 21MB, or approximately 10% of the original data, and will be used for queries that contain within the WHERE clause the exact search term “review #>> {product,category}
”, such as:
SELECT review #>> '{product,title}' AS title, avg((review #>> '{review,rating}')::int) FROM reviews WHERE review #>> '{product,category}' = 'Fitness & Yoga' GROUP BY 1 ORDER BY 2; title | avg ---------------------------------------------------+-------------------- Kathy Smith - New Yoga Challenge | 1.6666666666666667 Pumping Iron 2 | 2.0000000000000000 Kathy Smith - New Yoga Basics | 3.0000000000000000 Men Are from Mars, Women Are from Venus | 4.0000000000000000 Kathy Smith - Functionally Fit - Peak Fat Burning | 4.5000000000000000 Kathy Smith - Pregnancy Workout | 5.0000000000000000 (6 rows) |
The query takes approximately 0.180ms to be performed on the test machine, but the index that has been created is highly specific and cannot be used for different searches.
Starting with version 9.4, the JSONB
data type supports the use of inverted indexes (GIN, or General inverted Indexes), which allow indexing of the components of a complex object.
Let’s create a GIN
index on our table reviews
with the following command:
CREATE INDEX on reviews USING GIN (review); |
The resulting index takes up 64MB of disk, which is approximately 30% of the size of the original table. This index can be used to speed up searches using the following operators:
JSON
@>
JSON
is a subsetJSON
?
TEXT
contains a valueJSON
?&
TEXT[]
contains all the valuesJSON
?|
TEXT[]
contains at least one value
The above query shall thus be rewritten using the operator @>
to search for rows that contain '{"product": {"category": "Fitness & Yoga"}}'
:
SELECT review #>> '{product,title}' AS title, avg((review #>> '{review,rating}')::int) FROM reviews WHERE review @> '{"product": {"category": "Fitness & Yoga"}}' GROUP BY 1 ORDER BY 2; |
The query takes approximately 1.100ms to be performed on the test machine, and the index that has been created is flexible and can be used for any search within the JSON
data.
In fact, it is often the case that the only operation used in the applications is the search for a subset. In this case it is possible to use a different GIN
index which only supports the @>
operator and is therefore considerably smaller. The syntax for creating this type of “optimised” index is as follows:
CREATE INDEX on reviews USING GIN (review jsonb_path_ops); |
The resulting index occupies 46MB – only 22% of the size of the original data – and, thanks to its smaller size, it is used more efficiently by PostgreSQL.
This allows the above query to be run in just 0.167ms, with a performance increase of 650% compared to the original GIN
index and 8% compared to the specific B-tree
index initially used: all this without loss of generality with regards to the possible search operation.
Conclusions
With the introduction of the JSONB
type and the GIN
indexes
using jsonb_path_ops
operator class, PostgreSQL combines the elasticity of the JSON
format at an amazing data access speed.
Today it is thus possible to store and process data in JSON
format with high performance while enjoying the robustness and flexibility that PostgreSQL has habitually provided us with over the years.
Thanks a lot for this article 😉
This was a very interesting article. I have a couple of questions:
1. What does “\copy” exactly do? How did it know that you want to store the data in the “review” column? Was Postgres able to detect that it was the only column in the table?
2. What does “VACUUM ANALYZE” do in your case? I read the Postgres documentation, and I understood from there that it cleans the deleted rows, but you didn’t delete anything. Does the “ANALYZE” part make future queries faster?
3. If the “json_path_ops” index is focused on the @> operator only, does it mean that your SELECT statement is slower, since it still uses the #>> operator? Or it’s that fast because it’s actually doing the SELECT only after the WHERE filtering?
1) \copy is a psql meta command which invokes the SQL commands “COPY … FROM STDIN” or “COPY .. TO STDOUT”, and then fetches/stores the data in a file accessible to the psql client.
If you don’t specify any column list, PostgreSQL will load/dump all the columns of the table.
2) It’s always a good idea to “VACUUM ANALYZE” a table after a bulk load of data. It ensures that all the internal structures used by PostgreSQL are up to date (hint bits, visibility map, etc…) and collects statistics about the content of the table, so the query planner can use these statistics to help determine the most efficient execution plans for queries.
3) The “json_path_ops” GIN index is used to quickly find the data during the WHERE filtering, then the #>> operator is applied in the output phase to display only the required portion of data. It makes the access very fast.
Just a small correction. After bulk inserting of data, VACUUM is not needed, only ANALYZE, so that it updates the query planner for that table/database. Running VACUUM ANALYZE won’t hurt, but it will just take longer, which makes a big difference on big tables/databases.
Thanks for the article. I didn’t know that you can make a specialized GIN index which will only work for some operators.
Marco,
Thanks for putting this information together. I’ve used the json/jsonb functionality a bit and it’s really nice. Accessing the `similar_ids` array is a little more challenging. I started out doing this:
SELECT review #>> ‘{product,title}’ AS title
FROM reviews
WHERE exists(select 1 from jsonb_array_elements(“reviews”.review#>'{product,similar_ids}’) where value = ‘”0910627312″‘);
which works for both json / jsonb by changing the name of the function to `json_array_elements`
I’m going to adopt your syntax (and better use of the index) with this as soon as I can get my production db upgraded to 9.4.x:
SELECT review #>> ‘{product,title}’ AS title
FROM reviews
WHERE “reviews”.review#>'{product,similar_ids}’ @> ‘”0910627312″‘;
Thanks again!
dwilkins
Hi Dwilkins,
the query you pasted will not use the jsonb_path_ops index. To take advantage of it you need to write the query as follows:
SELECT review #>> ‘{product,title}’ AS title
FROM reviews
WHERE review @> ‘{“product”: {“similar_ids”: [“0910627312”]}}’;
Thanks Marco – I’m gradually getting more familiar with Postgres – It’s obvious now that I explain each query that mine wasn’t using the index.
What is the size limit of JSONB data type?
It has the same limitations of other TOASTable fields, so it should work with maximum one GB of data. However, in my experience, it’s better to not exceed few hundred megabytes of data.
Thanks Marco. Very helpful this data.
really useful article. Thanks a ton marco.
Great Article thanks! Have you any suggestions how I could speed up a jsonb query, when I would like to compare floats which are saved as values in a jsonb column?
Such as:
SELECT “products”.* FROM “products”
WHERE (COALESCE(CAST( nullif(technical_details#>>’{weight,value}’,”) AS FLOAT),0) BETWEEN 500 AND 1500)
This query takes 30ms in my db. I added a GIN index on the column technical_details, but this did not really help.
Thanks for your help!
In my experience
review #>> ‘{product,title}’ is slower than review->product->>title
I want to use use jsonb but I want to find out if using gin index or jsonb_path_ops index described in your article reduces the poor performance that occurs when querying jsonb since PostgreSQL doesn’t know how to keep statistics on the values of fields within JSONB columns as described in the article below.
http://blog.heapanalytics.com/when-to-avoid-jsonb-in-a-postgresql-schema/
Thanks for your response
Is there a way to benchmark Postgres JSONb implementation as directly as possible? I mean measuring the serialization/deserialization time for some json documents, as well as measuring the time to access specific values?