博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Greenplum merge insert 用法与性能 (insert on conflict)
阅读量:6626 次
发布时间:2019-06-25

本文共 10296 字,大约阅读时间需要 34 分钟。

标签

PostgreSQL , Greenplum , merge insert , insert on conflict , 合并插入 , 有则更新 , 无则插入


背景

PostgreSQL insert on conflict语法非常强大,支持合并写入(当违反某唯一约束时,冲突则更新,不冲突则写入),同时支持流式计算。

流计算例子链接:

PostgreSQL insert on conflict语法如下:

Command:     INSERT    Description: create new rows in a table    Syntax:    [ WITH [ RECURSIVE ] with_query [, ...] ]    INSERT INTO table_name [ AS alias ] [ ( column_name [, ...] ) ]        [ OVERRIDING { SYSTEM | USER} VALUE ]        { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }        [ ON CONFLICT [ conflict_target ] conflict_action ]        [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]        where conflict_target can be one of:            ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]        ON CONSTRAINT constraint_name        and conflict_action is one of:            DO NOTHING        DO UPDATE SET { column_name = { expression | DEFAULT } |                        ( column_name [, ...] ) = [ ROW ] ( { expression | DEFAULT } [, ...] ) |                        ( column_name [, ...] ) = ( sub-SELECT )                      } [, ...]                  [ WHERE condition ]

Greenplum的版本较低,还不支持insert on conflict的语法。

如果需要在Greenplum中实现类似的功能该如何操作?

DEMO

ID为PK,以它为合并列,举例。

1、目标表,也就是需要合并写入的目标:

create table t( id int primary key, c1 int , c2 int, c3 int, c4 int, c5 int, crt_time timestamp);

2、中间表,也就是用户只管插入的表:

create table t_tmp(like t);

写入一些中间记录。

insert into t_tmp values(1,1,2,3,null,null,now());    insert into t_tmp values(1,1,2,4,null,null,now());    insert into t_tmp values(1,1,2,3,null,7,now());    insert into t_tmp values(1,1,null,3,5,6,now());
postgres=# select * from t_tmp;     id | c1 | c2 | c3 | c4 | c5 |          crt_time              ----+----+----+----+----+----+----------------------------      1 |  1 |  2 |  3 |    |    | 2017-12-13 17:03:16.28482      1 |  1 |  2 |  4 |    |    | 2017-12-13 17:03:16.286302      1 |  1 |  2 |  3 |    |  7 | 2017-12-13 17:03:16.635121      1 |  1 |    |  3 |  5 |  6 | 2017-12-13 17:03:25.434191    (4 rows)

3、窗口合并,按唯一值约束,仅提取一条(可能存在窗口内合并的需求,例如按时间取最新,比如以最后一条为准,又或者以有值,且最新的为准)。

以有值切最新为准例子:

select distinct on (id)       id,      first_value(c1) over (partition by id order by (case when c1 is null then null else crt_time end) desc nulls last) as c1,      first_value(c2) over (partition by id order by (case when c2 is null then null else crt_time end) desc nulls last) as c2,      first_value(c3) over (partition by id order by (case when c3 is null then null else crt_time end) desc nulls last) as c3,      first_value(c4) over (partition by id order by (case when c4 is null then null else crt_time end) desc nulls last) as c4,      first_value(c5) over (partition by id order by (case when c5 is null then null else crt_time end) desc nulls last) as c5,      first_value(crt_time) over (partition by id order by crt_time desc) as crt_time      from t_tmp ;         id | c1 | c2 | c3 | c4 | c5 |          crt_time              ----+----+----+----+----+----+----------------------------      1 |  1 |  2 |  3 |  5 |  6 | 2017-12-13 17:03:25.434191    (1 row)

存储中间结果:

create table t_tmp1 (like t) ;        insert into t_tmp1     select distinct on (id)       id,      first_value(c1) over (partition by id order by (case when c1 is null then null else crt_time end) desc nulls last) as c1,      first_value(c2) over (partition by id order by (case when c2 is null then null else crt_time end) desc nulls last) as c2,      first_value(c3) over (partition by id order by (case when c3 is null then null else crt_time end) desc nulls last) as c3,      first_value(c4) over (partition by id order by (case when c4 is null then null else crt_time end) desc nulls last) as c4,      first_value(c5) over (partition by id order by (case when c5 is null then null else crt_time end) desc nulls last) as c5,      first_value(crt_time) over (partition by id order by crt_time desc) as crt_time      from t_tmp ;

4、合并写入:

将窗口提取的结果,合并写入目标表。

4.1、INNER JOIN,覆盖旧记录,同时补齐旧的字段(以NULL为判断条件。如果新的记录没有值,则取旧记录的值。)提取。

create table t_tmp2 (like t);        insert into t_tmp2    select     t_tmp.id,     coalesce(t_tmp.c1, t.c1),     coalesce(t_tmp.c2, t.c2),     coalesce(t_tmp.c3, t.c3),     coalesce(t_tmp.c4, t.c4),     coalesce(t_tmp.c5, t.c5),     coalesce(t_tmp.crt_time, t.crt_time)     from    t_tmp1 as t_tmp    inner join     t    using (id);

4.2、DELETE USING,删除全量表的符合条件的记录。

delete from t using t_tmp2 where t.id=t_tmp2.id;

4.3、INSERT

insert into t    select t_tmp1.* from t_tmp1 left join t_tmp2 using (id) where t_tmp2.* is null    union all    select * from t_tmp2;

Greenplum merge insert 性能

硬件:使用一台64线程机器,单机启动48个segment。

1、全量数据20亿。

create table t(id int, c1 int , c2 int, c3 int, c4 int, c5 int, crt_time timestamp) with (APPENDONLY=true, ORIENTATION=column);        insert into t select id, null,null,null,null,10000, now() from generate_series(1,2000000000) t(id);

2、增量数据1000万条,涉及500万个ID。

create table t_tmp(like t);        insert into t_tmp select random()*1000000, random()*100,null,null,null,null, clock_timestamp() from generate_series(1,2000000) t(id);    insert into t_tmp select random()*2000000, null,random()*100,null,null,null, clock_timestamp() from generate_series(1,2000000) t(id);    insert into t_tmp select random()*3000000, null,null,random()*100,null,null, clock_timestamp() from generate_series(1,2000000) t(id);    insert into t_tmp select random()*4000000, null,null,null,random()*100,null, clock_timestamp() from generate_series(1,2000000) t(id);    insert into t_tmp select random()*5000000, null,null,null,null,random()*100, clock_timestamp() from generate_series(1,2000000) t(id);        总耗时4.5秒。

3、合并。

增量数据,窗口合并去重。

create table t_tmp1 (like t) ;        insert into t_tmp1     select distinct on (id)       id,      first_value(c1) over (partition by id order by ((case when c1 is null then null else crt_time end) is null), (case when c1 is null then null else crt_time end) desc) as c1,      first_value(c2) over (partition by id order by ((case when c2 is null then null else crt_time end) is null), (case when c2 is null then null else crt_time end) desc) as c2,      first_value(c3) over (partition by id order by ((case when c3 is null then null else crt_time end) is null), (case when c3 is null then null else crt_time end) desc) as c3,      first_value(c4) over (partition by id order by ((case when c4 is null then null else crt_time end) is null), (case when c4 is null then null else crt_time end) desc) as c4,      first_value(c5) over (partition by id order by ((case when c5 is null then null else crt_time end) is null), (case when c5 is null then null else crt_time end) desc) as c5,      first_value(crt_time) over (partition by id order by crt_time desc) as crt_time      from t_tmp ;        INSERT 0 3628283    Time: 5208.968 ms

使用增量数据,提取并合并旧数据。

create table t_tmp2 (like t);        insert into t_tmp2    select     t_tmp.id,     coalesce(t_tmp.c1, t.c1),     coalesce(t_tmp.c2, t.c2),     coalesce(t_tmp.c3, t.c3),     coalesce(t_tmp.c4, t.c4),     coalesce(t_tmp.c5, t.c5),     coalesce(t_tmp.crt_time, t.crt_time)     from    t_tmp1 as t_tmp    inner join     t    using (id);        INSERT 0 3628282    Time: 9504.092 ms

删除旧数据。

delete from t using t_tmp2 where t.id=t_tmp2.id;        DELETE 3628282    Time: 15356.920 ms

插入新增、以及合并的增量数据。

insert into t    select t_tmp1.* from t_tmp1 left join t_tmp2 using (id) where t_tmp2.* is null    union all    select * from t_tmp2;        INSERT 0 3628283    Time: 778.014 ms

数据校验

-- 中间结果        postgres=# select * from t_tmp where id=9;     id | c1 | c2 | c3 | c4 | c5 |          crt_time              ----+----+----+----+----+----+----------------------------      9 | 99 |    |    |    |    | 2017-12-13 23:18:14.65243      9 |  7 |    |    |    |    | 2017-12-13 23:18:14.817107      9 |    |  9 |    |    |    | 2017-12-13 23:18:15.292311      9 |    | 56 |    |    |    | 2017-12-13 23:18:15.449415    (4 rows)        -- 中间结果        postgres=# select * from t_tmp where id=446;     id  | c1 | c2 | c3 | c4 | c5 |          crt_time              -----+----+----+----+----+----+----------------------------     446 | 43 |    |    |    |    | 2017-12-13 23:18:14.291335     446 | 16 |    |    |    |    | 2017-12-13 23:18:14.715026     446 | 45 |    |    |    |    | 2017-12-13 23:18:15.048879     446 |    | 34 |    |    |    | 2017-12-13 23:18:15.646904     446 |    |  7 |    |    |    | 2017-12-13 23:18:15.81838     446 |    |    | 12 |    |    | 2017-12-13 23:18:16.220083     446 |    |    | 22 |    |    | 2017-12-13 23:18:16.26496     446 |    |    |    | 97 |    | 2017-12-13 23:18:17.464355     446 |    |    |    |    | 56 | 2017-12-13 23:18:18.427068    (9 rows)        -- 使用窗口合并后结果        postgres=# select * from t_tmp1 limit 10;     id  | c1 | c2 | c3 | c4 | c5 |          crt_time              -----+----+----+----+----+----+----------------------------       9 |  7 | 56 |    |    |    | 2017-12-13 23:18:15.449415  -- 验证      25 | 69 |    |  1 |    |    | 2017-12-13 23:18:16.161339     169 | 74 | 33 |  3 |    |    | 2017-12-13 23:18:16.71554     185 | 22 |    |    |    |    | 2017-12-13 23:18:14.93206     217 | 11 | 20 | 26 |    | 59 | 2017-12-13 23:18:17.911174     270 | 55 |    | 42 |    |    | 2017-12-13 23:18:16.494782     286 | 65 | 77 | 17 |    | 75 | 2017-12-13 23:18:17.895121     430 | 12 |    | 56 |    |    | 2017-12-13 23:18:16.744847     446 | 45 |  7 | 22 | 97 | 56 | 2017-12-13 23:18:18.427068  -- 验证     478 | 23 | 56 |    | 25 | 77 | 2017-12-13 23:18:18.293153    (10 rows)        -- 合并到全量表后结果        postgres=# select * from t where id=9;     id | c1 | c2 | c3 | c4 |  c5   |          crt_time              ----+----+----+----+----+-------+----------------------------      9 |  7 | 56 |    |    | 10000 | 2017-12-13 23:18:15.449415    (1 row)        postgres=# select * from t where id=446;     id  | c1 | c2 | c3 | c4 | c5 |          crt_time              -----+----+----+----+----+----+----------------------------     446 | 45 |  7 | 22 | 97 | 56 | 2017-12-13 23:18:18.427068    (1 row)

4、合并总耗时:

35秒

5、耗时分布

增量数据1000万条,涉及500万个ID。

4.5秒

增量数据,窗口合并去重。

5.2秒

使用增量数据,提取并合并旧数据。

9.5秒

删除旧数据。

15秒

插入新增、以及合并的增量数据。

0.7秒

Greenplum merge insert 限制

比较复杂、而且不支持新值使用NULL值(要支持的话,得修改一下覆盖逻辑)。

参考

转载地址:http://ybxpo.baihongyu.com/

你可能感兴趣的文章
企业Windows域环境中的组策略应用案例二
查看>>
ASP.NET Aries 入门开发教程6:列表数据表格的格式化处理及行内编辑
查看>>
51CTO首届卡拉OK大赛:我唱,为欢聚而歌
查看>>
LVM逻辑卷管理详解
查看>>
如何实现 Service 伸缩?- 每天5分钟玩转 Docker 容器技术(97)
查看>>
java socket编程
查看>>
MySQL基础建设之硬盘篇
查看>>
Java格式化时间
查看>>
安装Hyper-V
查看>>
配置XenDesktop一例报错-序列不包含任何元素
查看>>
数组循环移位
查看>>
一个优秀的公众号运营者需要具备哪些能力?
查看>>
桌面云
查看>>
教大家如何在word 2007中同时打出对齐上下标以及字母头上有波浪线(非编辑器)...
查看>>
Spring Boot五:使用properties配置文件实现多环境配置
查看>>
vim取消高亮显示
查看>>
设计从“心“开始
查看>>
windows7 系统盘 瘦身软件介绍: 冗余文件清理工具
查看>>
SSH整合步骤
查看>>
myeclipse tomcat内存溢出解决方法
查看>>