Percona Database Performance Blog (mysqlperf) wrote,
Percona Database Performance Blog
mysqlperf

Storing and Using JSON Within PostgreSQL Part Two

https://www.percona.com/blog/storing-and-using-json-within-postgresql-part-two

https://www.percona.com/blog/?p=78018

Storing and Using JSON Within PostgreSQL 2

We explored the basic functionality of JSON within PostgreSQL in Part One of this series. Now we will look a little deeper into some things we may want to use regularly.  Most of the functionality we talked about in part one works well when you have a simple JSON format.  However, in real life, our documents may get a bit messy.  Let’s look at part of the JSON we are using for our tests:

Example JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

You can see here that we have some nested arrays and a bit of multi-dimensional flair.  If we wanted to get all the characters or actors in this movie, we would have a challenge using the basic functions.  Thankfully, PostgreSQL has a deep set of functions for interacting with JSON.

First, let’s look at how to get all the movies starring Robert Downey Jr. The easiest way is to use one of the  following:

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @> '{ "cast": [{ "name" : "Robert Downey Jr." }]}'

select jsonb_column->>'title', jsonb_column->>'imdb_rating' from movies_jsonb where jsonb_column @@ '$.cast.name == "Robert Downey Jr."'

However, what if we also need to pull out the character from the movie?  For our needs of getting a full list of actors and characters who were in this particular movie, we can use the jsonb_to_rescordset (similar to MySQL’s json_table function we covered in the MySQL part of this series).

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where imdb_id = 'tt2395427' limit 15;
  imdb_id  |             title              | imdb_rating |    id     |         name         |    character     
-----------+--------------------------------+-------------+-----------+----------------------+------------------
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000375 | Robert Downey Jr.    | Tony Stark
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1165110 | Chris Hemsworth      | Thor
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0749263 | Mark Ruffalo         | Bruce Banner
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0262635 | Chris Evans          | Steve Rogers
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0424060 | Scarlett Johansson   | Natasha Romanoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0719637 | Jeremy Renner        | Clint Barton
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000652 | James Spader         | Ultron
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000168 | Samuel L. Jackson    | Nick Fury
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0000332 | Don Cheadle          | James Rhodes
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1093951 | Aaron Taylor-Johnson | Pietro Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0647634 | Elizabeth Olsen      | Wanda Maximoff
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm0079273 | Paul Bettany         | Jarvis
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1130627 | Cobie Smulders       | Maria Hill
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm1107001 | Anthony Mackie       | Sam Wilson
 tt2395427 | Avengers: Age of Ultron (2015) |        7.50 | nm2017943 | Hayley Atwell        | Peggy Carter
(15 rows)

This works fine – until it doesn’t.  If I do a similar search for all movies starring Robert Downey Jr., I get:

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name text,character text)  where name like 'Robert Downey%' limit 10; 
ERROR:  cannot call jsonb_to_recordset on a non-array

When we look at the results, you can see that the function expects an array, and several of our movies have no cast (or a NULL value in the json).

movie_json_test=> select imdb_id, jsonb_typeof((jsonb_column->>'cast')::jsonb) from movies_json_generated limit 10;
  imdb_id  | jsonb_typeof 
-----------+--------------
 tt7620156 | 
 tt0109524 | array
 tt0534466 | array
 tt0111091 | 
 tt4167726 | array
 tt0638383 | array
 tt6346314 | array
 tt5877808 | array
 tt4098782 | 
 tt0365100 | array
(10 rows)

You can work around this “Null” issue in a couple of different ways.  The easiest is by converting to text, then back into JSON.  For example:

select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset((jsonb_column->>'cast'::text)::jsonb) as t(id text,name varchar(100),character text) where name like 'Robert Downey%';

Remember in part one how I harped on types and ensuring you cast to different data types when you needed to?  This is another example.  In this case, first taking the null as text, then taking the empty string, and then converting to a JSON object with a null inside.

While this is the easiest way, let’s show some other interesting ways to work around this to highlight some of the other functions, indexes, etc., we learned in part one.  Because we know that we have NULL values for some cast entries, we could check and filter out the values where the type is not empty or null.  For example, here is a simple check if the text version of our jsonb_column->’cast’ is not equal to null.

movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
  imdb_id  |                   title                   | imdb_rating |    id     |       name        |               character                
-----------+-------------------------------------------+-------------+-----------+-------------------+----------------------------------------
 tt0266220 | The 65th Annual Academy Awards (1993)     |             | nm0000375 | Robert Downey Jr. | Himself - Presenter
 tt1515091 | Sherlock Holmes: A Game of Shadows (2011) |        7.50 | nm0000375 | Robert Downey Jr. | Sherlock Holmes
 tt1231583 | Due Date (2010)                           |        6.60 | nm0000375 | Robert Downey Jr. | Peter Highman
 tt0343663 | Eros (2004)                               |        6.00 | nm0000375 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium")
 tt4420868 | The EE British Academy Film Awards (2015) |        7.40 | nm0000375 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough
(5 rows)
 
movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and jsonb_column->'cast'::text != 'null' limit 5;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..7.30 rows=5 width=332) (actual time=0.586..84.666 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..545198.71 rows=373482 width=332) (actual time=0.585..84.664 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..74611.38 rows=373482 width=272) (actual time=0.023..30.257 rows=3786 loops=1)
               Filter: ((jsonb_column -> 'cast'::text) <> 'null'::jsonb)
               Rows Removed by Filter: 258
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.014..0.014 rows=0 loops=3786)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 24
 Planning Time: 0.064 ms
 Execution Time: 84.692 ms
(10 rows)

This is not terribly fast, but it does work. This is basically working around the functionality built into the JSON functions, however.  To speed our query up, we can, of course, index this column; however, some of the data in our movie cast list is just too large:

movie_json_test=> create index gen_func_index_cast on movies_json_generated (((jsonb_column->>'cast')::text));
ERROR:  index row requires 10704 bytes, maximum size is 8191

Of course, we can work around this as well.  But sometimes, it’s best to pivot.  Let’s look at another alternative to simply searching for a NULL value.  As we learned in the first post, we can use generated columns or expression indexes to do some nice things.  Here we could create either an expression index that checks for a null or add a new column that uses a case statement to flag null casts:

movie_json_test=> alter table movies_json_generated add column cast_is_null boolean generated always as  ((case when ((jsonb_column->>'cast')::text) is null then true else false end)) stored; 
ALTER TABLE


movie_json_test=> select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is false limit 5;
  imdb_id  |                          title                          | imdb_rating |    id     |       name        |                   character                    
-----------+---------------------------------------------------------+-------------+-----------+-------------------+------------------------------------------------
 tt3473134 | Off Camera with Sam Jones (TV Series 2014– )            |             | nm0000375 | Robert Downey Jr. | Himself                  2 episodes, 2014-2019
 tt0092851 | Dear America: Letters Home from Vietnam (TV Movie 1987) |        7.90 | nm0000375 | Robert Downey Jr. | (voice)
 tt0426841 | The 1994 Billboard Music Awards (1994)                  |             | nm0000375 | Robert Downey Jr. | Himself
 tt1228705 | Iron Man 2 (2010)                                       |        7.00 | nm0000375 | Robert Downey Jr. | Tony Stark
 tt0821642 | The Soloist (2009)                                      |        6.70 | nm0000375 | Robert Downey Jr. | Steve Lopez
(5 rows)


movie_json_test=> explain analyze select imdb_id, title, imdb_rating, t.* from movies_json_generated, jsonb_to_recordset(jsonb_column->'cast') as t(id text,name varchar(100),character text)  where name like 'Robert Downey%' and cast_is_null is not true limit 5;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.01..8.24 rows=5 width=332) (actual time=0.912..30.550 rows=5 loops=1)
   ->  Nested Loop  (cost=0.01..309227.39 rows=187680 width=332) (actual time=0.912..30.548 rows=5 loops=1)
         ->  Seq Scan on movies_json_generated  (cost=0.00..72750.59 rows=187680 width=272) (actual time=0.007..1.069 rows=1789 loops=1)
               Filter: (cast_is_null IS NOT TRUE)
               Rows Removed by Filter: 106
         ->  Function Scan on jsonb_to_recordset t  (cost=0.01..1.25 rows=1 width=282) (actual time=0.016..0.016 rows=0 loops=1789)
               Filter: ((name)::text ~~ 'Robert Downey%'::text)
               Rows Removed by Filter: 23
 Planning Time: 0.068 ms
 Execution Time: 30.572 ms
(10 rows)

You can see there are several options here for dealing with the nulls, some way easier (and cleaner) than others.  I want to highlight some of the challenges this brings up with using unstructured data within a structured system.

Evolution Upsets the Balance (or Breaks Stuff)

All of the above solutions work for the existing data, but the wonderful thing about JSON is that you can evolve what you store over time.  Let’s use the above example.  Let’s say that for years, every movie that is fed into your system has a full cast listing of characters and actors.  Then, one day, the feed you get your data from allows movies without a cast listing.  Your application will still work, your load scripts will still work.  But every once in a while, your users will get a weird error, or you will see small flashes in your logs (if you are logging these).  But 99.9% of queries are fine.  It is these transient issues that drive people bonkers.

Looking at this problem slightly differently, what if you start adding data or changing the order of certain items?  Back to our original JSON:

{
  "imdb_id": "tt2395427",
  "tmdb_id": "99861",
  "douban_id": "10741834",
  "title": "Avengers: Age of Ultron (2015)",
  "genres": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "country": "USA",
  "version": [
    {
      "runtime": "141 min",
      "description": ""
    }
  ],
  "imdb_rating": 7.5,
  "director": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon"
    }
  ],
  "writer": [
    {
      "id": "nm0923736",
      "name": "Joss Whedon",
      "description": "written by"
    },
    {
      "id": "nm0498278",
      "name": "Stan Lee",
      "description": "based on the Marvel comics by and"
    },
    {
      "id": "nm0456158",
      "name": "Jack Kirby",
      "description": "based on the Marvel comics by"
    }
  ],
  "cast": [
    {
      "id": "nm0000375",
      "name": "Robert Downey Jr.",
      "character": "Tony Stark"
    },
    {
      "id": "nm1165110",
      "name": "Chris Hemsworth",
      "character": "Thor"
    },
    {
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner"
    },
    {
      "id": "nm0262635",
      "name": "Chris Evans",
      "character": "Steve Rogers"
    },
    {
      "id": "nm0424060",
      "name": "Scarlett Johansson",
      "character": "Natasha Romanoff"
    },
    {
      "id": "nm0719637",
      "name": "Jeremy Renner",
      "character": "Clint Barton"

What if we wanted to add another piece of data to the cast records? Let’s say we want to add a thumbs up, or thumbs down vote to each character, so something like this:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100,
      "character": "Bruce Banner"
    },

Now we run into another challenge.  If we inject the new fields into the middle of the JSON, instead of the end, our call to jsonb_to_recordset(jsonb_column->’cast’) as t(id text,name varchar(100),character text) will cause use issues in the application.  We had to define the definition of the recordest; the first field returned is id, then actor name, and then character.  If we changed this record, the character returned in the SQL would be 10000, not Bruce Banner.  Now to make additions, you would need to start adding at the end like:

{
      "id": "nm0749263",
      "name": "Mark Ruffalo",
      "character": "Bruce Banner",
      “actcor_thumbs_up”: 10000,
      “actor_thumbs_down”: 100
    },

MySQL mitigates this with allowing you to defined specific paths in their equivalent function json_table(json_column, ‘$.cast[*]’ columns( V_name varchar(200) path ‘$.name’, V_character varchar(200) path ‘$.character’) This allows you to define exactly which fields you want, not just the first X ones. That said, there are other limitations in the json_table method MYSQL uses.  Here this is a trade-off.  You need to be aware again that the order and structure of your JSON matter greatly if you are going to use built-in database functions to query this data.

To JSON or Not?  Database Design

While PostgreSQL’s JSON features are pretty in-depth (there are dozens of other functions I did not show), there are similar challenges I see when using these compared to a more classic database design. JSON functionality is NOT a replacement for actually designing your database and thinking about schema and design.  It is a great tool to offer additional flexibility.  If you treat PostgreSQL or any database simply as a dumping ground for data, bad things happen.  It would help to think about how JSON columns can augment your current database designs, not replace them.

Let me give you a simple example of some of the dangers of simply forgoing database design in favor of just dumping in JSON.  While I think we have highlighted some of the complexity and setup challenges with different methods of storing JSON in PostgreSQL, I wanted to take this a step further and compare the performance of different database designs and their impact on performance and scalability.  For this, I built three separate designs and schemas for housing the same data:

A.)  Minimal Database Design -> single table, single column.  “The dump.”  With only a GIN index.

B.)  Single Table -> Generated columns and Expression indexes where needed.

C.)  Old School Normalized Tables, with JSON stored for easy use

The idea here is to illustrate a few of the trade-offs around performance when picking a specific design(1):

A.)  Minimal Database Design No Gin IDX A.)  Minimal Database Design with Gin IDX B.)  Single Table W Generated Column B.)  Single Table W Expression Indexes c.) Normalized
Simple Query for Movie Title (random 10  titles) 800ms 0.3ms 0.2ms 0.2ms 0.2ms
Select top 10 movies 841ms 831ms 0.9ms 0.9ms 0.3ms
Select all movies for an actor  1009.0ms 228ms 5588.0ms(2) 5588.0ms(2) 0.8ms
Count movies for a given actor 5071ms 5075ms 5544ms NA 1ms
  1. Each test was repeated 100 times, and the average results were listed.  Min/Max is available as well. 
  2. The nested json for our “cast info” was not easily indexed, there are some things we could do to mitigate this, but it is a bit more complicated.

Interestingly, there are a few cases where we have problems with using only generated columns/expression indexes, generally because of nested data within the JSON.  We could do some tricks, like adding a GIN index and query the JSON to find the element or creating some funky nested generated column with an expression index.  I left these out for now.  Also, I could go deeper into the tests, testing inserts, updates, and, of course, assembling JSON from a normalized table, but I didn’t.  My goal here is merely to make you think about your design.  Choosing one way to implement and use JSON may be ok under certain use cases, but you may find yourself querying or aggregating the data, and things break.

Recap & What’s Next

A few quick takeaways:

Recapping part 1:

  • Using JSONB is probably going to be your best option in most use cases.
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits.

What we learned in part 2:

  • The more straightforward the JSON, the easier it will be to pull out and interact with the data you need.
  • Nested JSON data can be pulled out in a few different ways. jsonb_to_rescordset is the easiest way I found to pull out the data I need.  However, using this function or others is very susceptible to the data structure in your document (data types matter!).
  • JSON data whose format changes ( elements added or removed ) may make using some functions difficult, if not impossible, to use.
  • JSON within a well-built, designed application and database can be wonderful and offer many great benefits.  JSON just dumped into databases won’t scale.  Database design still matters.

Now that we have covered the basics and a few nuances of JSON with PostgreSQL, next up, it is time to look at the same functionality and data within MongoDB before finally comparing the three databases. Stay tuned!

Subscribe
  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments