PG插件pg_trgm实现模糊查询走索引
PG中的索引
参考:https://www.xmmup.com/pgzhongdesuoyin.html
索引类型:B-Tree、hash、Gin、Gist、SP-Gist、Brin
B-Tree:用来查找单个值或者扫描一个范围,比如大于、小于或者等于某个值。
B-Tree索引默认升序存储,从PostgreSQL8.3开始,空值也被存储在其中,但默认放在最后,使用如下SQL可以把这个顺序翻转过来:
Create index idx_111 on t( c1 desc NULLS first);
Hash:用于等值查询并且索引中无NULL值。
Gin:Gin存储了一个由键及其行倒排列表构成的列表,键的行倒排列表中的每一行都包含该键,一行也可能出现在多个键的倒排列表中。Gin也是一种实现全文搜索的方式,在PostgreSQL的contrib模块的pg_trgm扩展可以利用Gin索引来实现全文搜索。
GiST:通用搜索树(GiST)提供了一种用于存储数据的方式来构建平衡的树结构,只需要定义如何对待键即可,内建的B-tree也可以用它构建。这就允许使用该类索引来解决B-tree能处理的常规相对和范围比较之外的问题。例如,PostgreSQL中的几何数据类型包括的操作符允许索引根据项之间的距离以及项之间是否相交来排序。
SP-GiST:是Space-Partitioned GiST的缩写。SP-GiST支持划分搜索树,这种书可以用来开发范围广泛的非平衡数据结构,例如四叉树、k-d树和后缀树(字典树)。
Brin:代表块范围索引。块范围是指表中相邻页面的范围。对于每个块(block),在当前的实现当中,Brin索引存储了每个块的最小值和最大值。
创建索引:
建议使用create index concurrently 来构建索引,理由是不会对表的业务访问造成影响,这么做的代价就是create index concurrently的时长较长,另外,create index concurrently一旦创建失败,产生的索引将会被标记为INVALID,INVALID的索引对于查询不可用,但是对表的更改会更新索引中的值,这显然是一种浪费。参考:https://www.xmmup.com/pgbingfachuangjiansuoyinconcurrently.html
示例
脚本:
1 2 3 4 5 6 7 8 9 10 11 | CREATE TABLE t_hash AS SELECT id, md5(id::text) FROM generate_series(1, 50000) AS id; VACUUM ANALYZE; SELECT * FROM t_hash LIMIT 10; \timing create extension pg_trgm; CREATE INDEX idx_gin ON t_hash USING gin(md5 gin_trgm_ops); |
过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | [root@lhrpgcm1 /]# su - pg12 Last login: Fri Mar 4 11:23:17 CST 2022 on pts/5 [pg12@lhrpgcm1 ~]$ psql psql (12.9) Type "help" for help. postgres=# \dT List of data types Schema | Name | Description --------+------+------------- (0 rows) postgres=# CREATE TABLE t_hash AS postgres-# SELECT id, md5(id::text) postgres-# FROM generate_series(1, 50000) AS id; SELECT 50000 postgres=# VACUUM ANALYZE; VACUUM postgres=# SELECT * FROM t_hash LIMIT 10; id | md5 ----+---------------------------------- 1 | c4ca4238a0b923820dcc509a6f75849b 2 | c81e728d9d4c2f636f067f89cc14862c 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3 4 | a87ff679a2f3e71d9181a67b7542122c 5 | e4da3b7fbbce2345d7772b0674a318d5 6 | 1679091c5a880faf6fb5e6087eb1b2dc 7 | 8f14e45fceea167a5a36dedd4bea2543 8 | c9f0f895fb98ab9159f51fd0297e236d 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26 10 | d3d9446802a44259755d38e6d163e820 (10 rows) postgres=# \timing Timing is on. postgres=# \dx List of installed extensions Name | Version | Schema | Description ---------+---------+------------+------------------------------------ plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language repmgr | 5.1 | repmgr | Replication manager for PostgreSQL (2 rows) postgres=# create extension pg_trgm; CREATE EXTENSION Time: 45.121 ms postgres=# CREATE INDEX idx_gin ON t_hash USING gin(md5 gin_trgm_ops); CREATE INDEX Time: 1174.704 ms (00:01.175) postgres=# explain analyze SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_hash (cost=88.04..106.46 rows=5 width=37) (actual time=0.301..0.302 rows=0 loops=1) Recheck Cond: (md5 ~~ '%e2345679a%'::text) -> Bitmap Index Scan on idx_gin (cost=0.00..88.04 rows=5 width=0) (actual time=0.294..0.295 rows=0 loops=1) Index Cond: (md5 ~~ '%e2345679a%'::text) Planning Time: 0.417 ms Execution Time: 0.355 ms (6 rows) Time: 1.630 ms postgres=# drop index idx_gin; DROP INDEX Time: 13.914 ms postgres=# explain analyze SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%'; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan on t_hash (cost=0.00..1042.00 rows=5 width=37) (actual time=18.913..18.914 rows=0 loops=1) Filter: (md5 ~~ '%e2345679a%'::text) Rows Removed by Filter: 50000 Planning Time: 0.221 ms Execution Time: 18.938 ms (5 rows) Time: 19.832 ms postgres=# CREATE INDEX idx_gin ON t_hash USING gin(md5 gin_trgm_ops); CREATE INDEX Time: 1258.731 ms (00:01.259) postgres=# explain analyze SELECT * FROM t_hash WHERE md5 LIKE '%e23%'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_hash (cost=19.91..464.89 rows=505 width=37) (actual time=0.164..0.745 rows=334 loops=1) Recheck Cond: (md5 ~~ '%e23%'::text) Heap Blocks: exact=229 -> Bitmap Index Scan on idx_gin (cost=0.00..19.79 rows=505 width=0) (actual time=0.107..0.107 rows=334 loops=1) Index Cond: (md5 ~~ '%e23%'::text) Planning Time: 0.327 ms Execution Time: 0.824 ms (7 rows) Time: 1.861 ms postgres=# explain analyze SELECT * FROM t_hash WHERE md5 LIKE '%e2%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on t_hash (cost=0.00..1042.00 rows=6061 width=37) (actual time=0.015..19.450 rows=5806 loops=1) Filter: (md5 ~~ '%e2%'::text) Rows Removed by Filter: 44194 Planning Time: 0.181 ms Execution Time: 19.890 ms (5 rows) Time: 20.624 ms |
注意
pg_trgm使用限制
- 数据库排序规则需要使用LC_CTYPE 'zh_CN.utf8';
- 查询字符不能少于3个
- 存储的内容不能大于8k