У меня есть две таблицы в PostgreSQL 12: набор данных содержит много cfiles, а cfile содержит один набор данных.
SELECT * FROM datasets;
id | name
----+----------
1 | dataset1
2 | dataset2
SELECT * FROM cfiles;
id | dataset_id | property_values (jsonb)
----+------------+-----------------------------------------------
1 | 1 | {"Sample Names": ["SampA", "SampB", "SampC"]}
2 | 1 | {"Sample Names": ["SampA", "SampB", "SampD"]}
3 | 1 | {"Sample Names": ["SampE"]}
4 | 2 | {"Sample Names": ["SampA, SampF"]}
5 | 2 | {"Sample Names": ["SampG"]}
Я пытаюсь получить этот результат:
id | name | sample_names
----+----------+-----------------------------------
1 | dataset1 | SampA; SampB; SampC; SampD; SampE
2 | dataset2 | SampA, SampF; SampG
После этого вопроса SO и отличный ответ, у меня есть запрос ниже:
SELECT datasets.id, datasets.name,
string_agg(DISTINCT sn.sample_names, '; ' ORDER BY sn.sample_names) as sample_names
FROM cfiles
CROSS JOIN jsonb_array_elements_text(cfiles.property_values -> 'Sample Names') as sn(sample_names)
JOIN datasets on cfiles.dataset_id=datasets.id
GROUP BY datasets.id, datasets.name
-- Problematic line:
-- ORDER BY datasets.name
LIMIT 20;
Это работает очень хорошо, пока мне не нужно заказать результат.
Для 45K строк cfile без ORDER BY
~12 мс, с ORDER BY
~58881 мс
Ниже приведен мой исходный запрос (из вопроса SO выше), который гораздо менее элегантен и использует манипуляции со строками, но превосходит перекрестное соединение в 10 раз при ~ 5150 мс.
SELECT datasets.id,
datasets.name,
ARRAY_TO_STRING(
ARRAY(
SELECT DISTINCT * FROM unnest(
STRING_TO_ARRAY(
STRING_AGG(
DISTINCT REPLACE(
REPLACE(
REPLACE(
REPLACE(
cfiles.property_values ->> 'Sample Names', '",' || chr(32) || '"', ';'
), '[' , ''
), '"' , ''
), ']' , ''
), ';'
), ';'
)
) ORDER BY 1 ASC
), '; '
) as sample_names
FROM datasets
JOIN cfiles ON cfiles.dataset_id=datasets.id
GROUP BY datasets.id, datasets.name
ORDER BY datasets.name
LIMIT 20;
Можно ли каким-либо образом повысить производительность запроса перекрестного соединения выше (включая ORDER BY
), чтобы сделать его быстрее, чем альтернатива манипуляции со строками?
План запроса для запроса перекрестного соединения без ORDER BY
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.69..4351.18 rows=20 width=106) (actual time=0.409..11.706 rows=20 loops=1)
Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
-> GroupAggregate (cost=0.69..132907.88 rows=611 width=106) (actual time=0.407..11.694 rows=20 loops=1)
Output: datasets.id, datasets.name, string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names)
Group Key: datasets.id
-> Nested Loop (cost=0.69..109992.24 rows=4581600 width=106) (actual time=0.065..10.742 rows=207 loops=1)
Output: datasets.id, datasets.name, sn.sample_names
-> Merge Join (cost=0.69..18360.24 rows=45816 width=527) (actual time=0.042..5.155 rows=1697 loops=1)
Output: cfiles.property_values, datasets.id, datasets.name
Inner Unique: true
Merge Cond: (cfiles.dataset_id = datasets.id)
-> Index Scan using index_cfiles_dataset_id_path on public.cfiles (cost=0.41..17682.45 rows=45816 width=461) (actual time=0.016..2.665 rows=1697 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
-> Index Scan using datasets_pkey on public.datasets (cost=0.28..103.56 rows=611 width=74) (actual time=0.016..0.066 rows=27 loops=1)
Output: datasets.id, datasets.tid, datasets.bucket_path_id, datasets.path, datasets.name, datasets.last_modified, datasets.file_count, datasets.size, datasets.content_types, datasets.locked, datasets.created_at, datasets.updated_at
-> Function Scan on pg_catalog.jsonb_array_elements_text sn (cost=0.01..1.00 rows=100 width=32) (actual time=0.002..0.002 rows=0 loops=1697)
Output: sn.sample_names
Function Call: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
Planning Time: 0.926 ms
Execution Time: 11.845 ms
(20 rows)
План запроса для запроса перекрестного соединения с ORDER BY
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=130727.27..130727.32 rows=20 width=106) (actual time=60970.131..60970.140 rows=20 loops=1)
Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
-> Sort (cost=130727.27..130728.79 rows=611 width=106) (actual time=60970.128..60970.132 rows=20 loops=1)
Output: datasets.id, datasets.name, (string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names))
Sort Key: datasets.name
Sort Method: top-N heapsort Memory: 27kB
-> GroupAggregate (cost=10585.66..130711.01 rows=611 width=106) (actual time=112.152..60965.350 rows=598 loops=1)
Output: datasets.id, datasets.name, string_agg(DISTINCT sn.sample_names, '; '::text ORDER BY sn.sample_names)
Group Key: datasets.id
-> Nested Loop (cost=10585.66..107795.37 rows=4581600 width=106) (actual time=111.856..4959.284 rows=3289130 loops=1)
Output: datasets.id, datasets.name, sn.sample_names
-> Gather Merge (cost=10585.66..16163.37 rows=45816 width=527) (actual time=111.828..207.605 rows=45816 loops=1)
Output: cfiles.property_values, datasets.id, datasets.name
Workers Planned: 2
Workers Launched: 2
-> Merge Join (cost=9585.63..9875.04 rows=19090 width=527) (actual time=100.410..132.173 rows=15272 loops=3)
Output: cfiles.property_values, datasets.id, datasets.name
Inner Unique: true
Merge Cond: (cfiles.dataset_id = datasets.id)
Worker 0: actual time=94.756..119.875 rows=12140 loops=1
Worker 1: actual time=95.064..120.437 rows=12454 loops=1
-> Sort (cost=9529.25..9576.97 rows=19090 width=461) (actual time=99.259..114.951 rows=15272 loops=3)
Output: cfiles.property_values, cfiles.dataset_id
Sort Key: cfiles.dataset_id
Sort Method: external merge Disk: 10192kB
Worker 0: Sort Method: external merge Disk: 5568kB
Worker 1: Sort Method: external merge Disk: 5592kB
Worker 0: actual time=93.461..105.574 rows=12140 loops=1
Worker 1: actual time=93.784..105.796 rows=12454 loops=1
-> Parallel Seq Scan on public.cfiles (cost=0.00..4188.90 rows=19090 width=461) (actual time=0.028..21.442 rows=15272 loops=3)
Output: cfiles.property_values, cfiles.dataset_id
Worker 0: actual time=0.036..22.118 rows=12140 loops=1
Worker 1: actual time=0.035..22.162 rows=12454 loops=1
-> Sort (cost=56.38..57.91 rows=611 width=74) (actual time=1.133..1.334 rows=603 loops=3)
Output: datasets.id, datasets.name
Sort Key: datasets.id
Sort Method: quicksort Memory: 110kB
Worker 0: Sort Method: quicksort Memory: 110kB
Worker 1: Sort Method: quicksort Memory: 110kB
Worker 0: actual time=1.272..1.471 rows=611 loops=1
Worker 1: actual time=1.259..1.474 rows=611 loops=1
-> Seq Scan on public.datasets (cost=0.00..28.11 rows=611 width=74) (actual time=0.100..0.584 rows=611 loops=3)
Output: datasets.id, datasets.name
Worker 0: actual time=0.155..0.719 rows=611 loops=1
Worker 1: actual time=0.121..0.667 rows=611 loops=1
-> Function Scan on pg_catalog.jsonb_array_elements_text sn (cost=0.01..1.00 rows=100 width=32) (actual time=0.051..0.067 rows=72 loops=45816)
Output: sn.sample_names
Function Call: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
Planning Time: 0.894 ms
Execution Time: 60972.185 ms
(50 rows)
ОБНОВЛЕНИЕ 2: план запроса для запроса @bobflux, приведенный ниже, сократил его до 9 мс!
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
Sort (cost=25228.68..25228.73 rows=20 width=72) (actual time=8.166..8.177 rows=14 loops=1)
Output: ds_1.dataset_id, ds.dataset_name, (string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))))
Sort Key: ds.dataset_name
Sort Method: quicksort Memory: 26kB
CTE ds
-> Limit (cost=0.69..16.76 rows=20 width=74) (actual time=0.059..0.313 rows=20 loops=1)
Output: datasets.id, datasets.name
-> Nested Loop Semi Join (cost=0.69..488.56 rows=607 width=74) (actual time=0.057..0.302 rows=20 loops=1)
Output: datasets.id, datasets.name
-> Index Only Scan using datasets_name_id on public.datasets (cost=0.28..137.44 rows=611 width=74) (actual time=0.028..0.062 rows=20 loops=1)
Output: datasets.name, datasets.id
Heap Fetches: 20
-> Index Only Scan using index_cfiles_dataset_id_path on public.cfiles cfiles_1 (cost=0.41..5.79 rows=75 width=8) (actual time=0.010..0.010 rows=1 loops=20)
Output: cfiles_1.dataset_id, cfiles_1.path
Index Cond: (cfiles_1.dataset_id = datasets.id)
Heap Fetches: 0
-> Hash Join (cost=24073.53..25211.48 rows=20 width=72) (actual time=7.261..8.025 rows=14 loops=1)
Output: ds_1.dataset_id, ds.dataset_name, (string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))))
Hash Cond: (ds_1.dataset_id = ds.dataset_id)
-> GroupAggregate (cost=24072.88..25207.88 rows=200 width=40) (actual time=6.862..7.602 rows=14 loops=1)
Output: ds_1.dataset_id, string_agg(DISTINCT (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), '; '::text ORDER BY (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))))
Group Key: ds_1.dataset_id
-> Sort (cost=24072.88..24450.38 rows=151000 width=40) (actual time=6.688..6.744 rows=259 loops=1)
Output: ds_1.dataset_id, (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
Sort Key: ds_1.dataset_id
Sort Method: quicksort Memory: 44kB
-> ProjectSet (cost=0.41..5443.72 rows=151000 width=40) (actual time=4.419..6.469 rows=259 loops=1)
Output: ds_1.dataset_id, jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))
-> Nested Loop (cost=0.41..4673.62 rows=1510 width=459) (actual time=0.028..4.285 rows=1749 loops=1)
Output: cfiles.property_values, ds_1.dataset_id
-> CTE Scan on ds ds_1 (cost=0.00..0.40 rows=20 width=8) (actual time=0.001..0.012 rows=20 loops=1)
Output: ds_1.dataset_id, ds_1.dataset_name
-> Index Scan using index_cfiles_dataset_id_path on public.cfiles (cost=0.41..232.91 rows=75 width=459) (actual time=0.012..0.129 rows=87 loops=20)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Index Cond: (cfiles.dataset_id = ds_1.dataset_id)
-> Hash (cost=0.40..0.40 rows=20 width=40) (actual time=0.382..0.383 rows=20 loops=1)
Output: ds.dataset_name, ds.dataset_id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> CTE Scan on ds (cost=0.00..0.40 rows=20 width=40) (actual time=0.067..0.356 rows=20 loops=1)
Output: ds.dataset_name, ds.dataset_id
Planning Time: 1.781 ms
Execution Time: 8.386 ms
(42 rows)