An important index on nodes is not being used

The production instance of `tiled.nsls2.bnl.gov` now holds almost 70 million nodes.

```sql
tiled_catalog_prod=# select count(*) from nodes;
  count
----------
 66946050
(1 row)
```

We have followed the mantra, "Make it work, make it right, make it fast" (in that order), and we have seen that queries such as, "Fetch the single most recent scan for this beamline," are slow. There is some quite low-hanging fruit to make Tiled faster. The index created by this expression is not having the intended effect.

https://github.com/bluesky/tiled/blob/b98e34b37fd876dfb7ab4412b7f6f08ae5a529a6/tiled/catalog/orm.py#L101-L111

PostgreSQL accepts this, unfortunately, but it's a bit of a foot-gun. We can see that in production, the index is not effective. 
```sql
tiled_catalog_prod=# SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'nodes'
ORDER BY indexrelname;
         indexrelname         |  idx_scan  | idx_tup_read | idx_tup_fetch
------------------------------+------------+--------------+---------------
 ix_nodes_parent              |   22364609 |  75956623183 |   65078973378
 key_parent_unique_constraint |  366223859 |   1609316142 |    1608928295
 metadata_tsvector_search     |         71 |      3320247 |             0
 nodes_pkey                   | 1363555876 |   1363934256 |    1217639562
 top_level_metadata           |       4504 |     83046299 |             0
(5 rows)
```

The key here is 0 heap fetches:

```
 top_level_metadata           |       4504 |     83046299 |             0
```

The planner occasionally tries it, gets nothing useful, and fetches rows from elsewhere.

On a local test database with just 100 nodes, we can see that replacing this useless index with separately useful ones is effective, changing a (slow `Seq Scan on nodes` to a (fast) `Index Scan`
```
postgres=# EXPLAIN ANALYZE SELECT * FROM nodes WHERE parent=0 ORDER BY time_created, id LIMIT 1;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Limit  (cost=3.76..3.77 rows=1 width=114) (actual time=0.097..0.098 rows=1 loops=1)
   ->  Sort  (cost=3.76..4.01 rows=100 width=114) (actual time=0.094..0.095 rows=1 loops=1)
         Sort Key: time_created, id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on nodes  (cost=0.00..3.26 rows=100 width=114) (actual time=0.014..0.035 rows=100 loops=1)
               Filter: (parent = 0)
               Rows Removed by Filter: 1
 Planning Time: 0.244 ms
 Execution Time: 0.127 ms
(9 rows)

postgres=# -- Drop useless index
DROP INDEX CONCURRENTLY top_level_metadata;

-- Create separate GIN indexes of JSON columns and a B-tree index on the others.
CREATE INDEX CONCURRENTLY ix_nodes_access_blob
    ON nodes USING GIN (access_blob);

CREATE INDEX CONCURRENTLY ix_nodes_metadata
    ON nodes USING GIN (metadata jsonb_path_ops);

CREATE INDEX CONCURRENTLY ix_nodes_parent_time_id
    ON nodes (parent, time_created, id);
DROP INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
postgres=# EXPLAIN ANALYZE SELECT * FROM nodes WHERE parent=0 ORDER BY time_created, id LIMIT 1;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.14..0.30 rows=1 width=114) (actual time=0.017..0.018 rows=1 loops=1)
   ->  Index Scan using ix_nodes_parent_time_id on nodes  (cost=0.14..16.20 rows=100 width=114) (actual time=0.014..0.015 rows=1 loops=1)
         Index Cond: (parent = 0)
 Planning Time: 0.508 ms
 Execution Time: 0.046 ms
(5 rows)
```

Further testing, incorporating queries on the GIN indexes on `metadata` and `access_blob`, will require larger sample data sizes. More to follow.


	Index(
	"top_level_metadata",
	"parent",
	# include the keys of the default sorting ('time_created', 'id'),
	# used to avoid creating a temp sort index
	"time_created",
	"id",
	"metadata",
	"access_blob",
	postgresql_using="gin",
	),

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An important index on nodes is not being used #1320

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An important index on nodes is not being used #1320

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions