Clickhouse数据复制的原理与实践

云的事随心讲 2024-04-13 16:41:15

作者:俊达

说明

在clickhouse中,如果我们想实现数据多副本存储,以提高数据可用率和查询性能,可以使用复制表。支持数据复制的表引擎包括:

ReplicatedMergeTreeReplicatedSummingMergeTreeReplicatedReplacingMergeTreeReplicatedAggregatingMergeTreeReplicatedCollapsingMergeTreeReplicatedVersionedCollapsingMergeTreeReplicatedGraphiteMergeTree

使用复制表的前置条件是clickhouse配置了zookeeper。需要在配置文件中配置,如:

<zookeeper> <node> <host>example1</host> <port>2181</port> </node> <node> <host>example2</host> <port>2181</port> </node> <node> <host>example3</host> <port>2181</port> </node></zookeeper>

在clickhouse中,以表为单位进行复制。不同的表可以配置不同的复制策略。

需要注意的是,clickhouse不会复制CREATE, DROP, ATTACH, DETACH和RENAME这些操作。

而通过alter table给表增加字段的操作会进行复制。

创建复制表

建表语法:

CREATE TABLE table_name ( ... ) ENGINE = ReplicatedMergeTree('path_in_zookeeper', 'replica_name') ...

创建复制表需要指定两个关键参数:

path_in_zookeeper: zookeeper中的路径,同一个表的多个副本,该参数必须一样。replica_name: 多个副本需要配置不同的replica_name。

一般在建表时,我们会使用{shard}, {replica}等宏变量:

create table rep_table(id int, val String)engine ReplicatedMergeTree( '/clickhouse/tables/{shard_id}/rep/rep_table', '{replica}') order by id;

上面例子中,{shard_id}, {replica}都是在macros中定义的宏,我们以在系统表system.macros中查看当前实例的宏定义。

## 节点ck01ck01 :) select * from system.macros;SELECT *FROM system.macrosQuery id: a85a2f99-e2dd-4ba4-9b5e-519e7b5c9f40┌─macro────┬─substitution───┐│ cluster │ cluster-zero ││ replica │ 172.16.121.248 ││ shard_id │ 01 │└──────────┴────────────────┘4 rows in set. Elapsed: 0.001 sec.## 节点ck02ck02 :) select * from system.macros;SELECT *FROM system.macrosQuery id: a11a1a07-0757-414a-954a-dd716d0cda3d┌─macro────┬─substitution──┐│ cluster │ cluster-zero ││ replica │ 172.16.121.48 ││ shard_id │ 01 │└──────────┴───────────────┘4 rows in set. Elapsed: 0.002 sec.

有几点需要注意:

1、如果replica, shard_id等宏定义在建表之后发生了变化,则可能会导致相关的表出现异常。

## ck01中将shard_id定义修改为02, 重启clickhouse后,再写入数据ck01 :) insert into rep_table values(2, 'two', 'xx');INSERT INTO rep_table FORMAT ValuesQuery id: 882aa305-a7a7-48d4-b27e-d9564c0466930 rows in set. Elapsed: 0.004 sec.Received exception from server (version 22.6.3):Code: 242. DB::Exception: Received from localhost:9000. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/02/rep/rep_table/replicas/172.16.121.248). (TABLE_IS_READ_ONLY)

由于在zookeeper中并不存在/clickhouse/tables/02/rep/rep_table这个路径,数据无法写入。在clickhouse的启动日志中,可以看到相关的信息:

2022.12.20 03:42:42.505570 [ 177009 ] {} <Warning> rep.rep_table (01b4ad06-4d45-4451-808b-8403e8b1b6c8): No metadata in ZooKeeper for /clickhouse/tables/02/rep/rep_table: table will be in readonly mode.

2、复制表的多个副本表结构必须一致如果建表时表结构不一致,则无法创建表:

ck01 :) create table rep_table(id int, val String, id2 int) engine ReplicatedMergeTree('/clickhouse/tables/{shard_id}/rep/rep_table', '{replica}') order by id;CREATE TABLE rep_table( `id` int, `val` String, `id2` int)ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard_id}/rep/rep_table', '{replica}')ORDER BY idQuery id: 8aa10add-590a-4129-9420-a625edd9d5e10 rows in set. Elapsed: 0.240 sec.Received exception from server (version 22.6.3):Code: 122. DB::Exception: Received from localhost:9000. DB::Exception: Table columns structure in ZooKeeper is different from local table structure. Local columns:columns format version: 13 columns:`id` Int32`val` String`id2` Int32Zookeeper columns:columns format version: 12 columns:`id` Int32`val` String. (INCOMPATIBLE_COLUMNS)

3、多个副本的replica需要唯一如果建表时,在zookeeper中已经存在对应replica的路径,则无法创建成功:

ck02 :) create table rep_table2(id int, val String) engine ReplicatedMergeTree('/clickhouse/tables/{shard_id}/rep/rep_table', '{replica}') order by id ;CREATE TABLE rep_table2( `id` int, `val` String)ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard_id}/rep/rep_table', '{replica}')ORDER BY idQuery id: 6231b6f8-3421-4c5f-90b4-d13e1cdb64630 rows in set. Elapsed: 0.750 sec.Received exception from server (version 22.6.3):Code: 253. DB::Exception: Received from localhost:9000. DB::Exception: Replica /clickhouse/tables/01/rep/rep_table/replicas/172.16.121.48 already exists. (REPLICA_IS_ALREADY_EXIST)复制表在zookeeper中存了哪些信息

zookeeper在clickhouse的数据复制中起着关键作用。

如上图所示,对复制表的操作,会在zookeeper中记录日志信息,副本通过记录在zookeeper中的信息,实现数据复制。

如下操作都会在zookeeper中记录日志:

insertmerge(optimize table)alter table attach/detach partition/partalter table update/deletealter table add column复制表zookeeper节点内容

通过zookeeper客户端,或者使用系统表system.zookeeper,都可以查看复制表zookeeper对应节点中存储了哪些信息

ck01 :) select name from system.zookeeper where path='/clickhouse/tables/01/rep/rep_table';SELECT nameFROM system.zookeeperWHERE path = '/clickhouse/tables/01/rep/rep_table'Query id: 59cb84f5-87ad-4ec4-8fd6-974dbce358a1┌─name───────────────────────┐│ alter_partition_version ││ metadata ││ temp ││ table_shared_id ││ log ││ leader_election ││ columns ││ blocks ││ nonincrement_block_numbers ││ replicas ││ quorum ││ pinned_part_uuids ││ block_numbers ││ mutations ││ zero_copy_s3 ││ zero_copy_hdfs ││ part_moves_shard │

metadata:表结构信息

log: 数据复制关键信息。log节点下,每一条日志都对应着对表的一个动作。

replicas:每一个副本会在replicas下有一个节点。

mutations:对表的mutation操作(如alter table update/delete)

log节点信息ck02 :) select name, value from system.zookeeper where path='/clickhouse/tables/01/rep/rep_table/log' order by name\GSELECT name, valueFROM system.zookeeperWHERE path = '/clickhouse/tables/01/rep/rep_table/log'ORDER BY name ASCQuery id: 82f00616-9c18-4bc6-a107-189ac4a67aa1-- 对应一个mutation操作的日志Row 2:──────name: log-0000000008value: format version: 4create_time: 2022-12-20 06:06:11source replica: 172.16.121.248block_id:mutateall_1_1_0_2toall_1_1_0_3-- merge 操作的日志Row 3:──────name: log-0000000009value: format version: 4create_time: 2022-12-20 06:06:20source replica: 172.16.121.248block_id:mergeall_0_0_0_3all_1_1_0_3intoall_0_1_1_3deduplicate: 0part_type: Compact-- alter table的日志Row 8:──────name: log-0000000014value: format version: 4create_time: 2022-12-20 06:08:14source replica: 172.16.121.248block_id:alteralter_version6have_mutation1columns_str_size:61columns format version: 12 columns:`id` Int32`val` Stringmetadata_str_size:192metadata format version: 1date column:sampling expression:index granularity: 8192mode: 0sign column:primary key: iddata format version: 1partition key:granularity bytes: 10485760-- alter table attach partition的日志Row 10:───────name: log-0000000016value: format version: 4create_time: 2022-12-20 06:11:59source replica: 172.16.121.248block_id:REPLACE_RANGEdrop_range_name: all_0_0_0from_database: repfrom_table: tmp_repsource_parts: ['all_1_1_0']new_parts: ['all_8_8_0']part_checksums: ['5381E04F17BD6299E7C1F56B445FB8DB']columns_version: -1-- insert操作对应的日志Row 11:───────name: log-0000000017value: format version: 4create_time: 2022-12-20 06:17:38source replica: 172.16.121.248block_id: all_1659522035524593032_2034088950575960742getall_9_9_0part_type: Compactmutation节点信息ck02 :) select name, value from system.zookeeper where path='/clickhouse/tables/01/rep/rep_table/mutations'\GSELECT name, valueFROM system.zookeeperWHERE path = '/clickhouse/tables/01/rep/rep_table/mutations'Query id: 2cea07ee-0b0b-4a6d-85f0-f73fcb129f06Row 1:──────name: 0000000001value: format version: 1create time: 2022-12-20 06:06:11source replica: 172.16.121.248block numbers count: 1all 3commands: DELETE WHERE id = 1alter version: -1Row 2:──────name: 0000000000value: format version: 1create time: 2022-12-20 06:05:40source replica: 172.16.121.248block numbers count: 1all 2commands: UPDATE val = \'updated\' WHERE 1alter version: -1Row 3:──────name: 0000000003value: format version: 1create time: 2022-12-20 06:08:01source replica: 172.16.121.248block numbers count: 1all 5commands: DROP COLUMN padding2alter version: 5replicas节点

每一个副本都会在replicas路径下建立一个节点。

ck02 :) select name from system.zookeeper where path='/clickhouse/tables/01/rep/rep_table/replicas/172.16.121.48' ;SELECT nameFROM system.zookeeperWHERE path = '/clickhouse/tables/01/rep/rep_table/replicas/172.16.121.48'Query id: 9f4dd647-39fa-42ed-a8a2-beb642673018┌─name────────────────────────┐│ is_lost ││ metadata ││ is_active ││ mutation_pointer ││ columns ││ max_processed_insert_time ││ flags ││ log_pointer ││ min_unprocessed_insert_time ││ host ││ parts ││ queue ││ metadata_version │└─────────────────────────────┘

replicas下的关键信息:

log_pointer: 当前副本处理的日志位点queue:当前节点待处理任务队列metadata_version: 元数据版本添加新副本

给已有的表添加新副本时,新副本会选择一个原有的节点做全量数据同步。

从clickhouse的debug日志中可以看到复制的大概流程:

executeQuery: (from [::ffff:127.0.0.1]:42530) CREATE TABLE rep.rep_table ( `id` Int32, `val` String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard_id}/rep/rep_table', '{replica}') ORDER BY id SETTINGS index_granularity = 8192; (stage: Complete)rep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Loading data partsrep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): There are no data partsrep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): This table /clickhouse/tables/01/rep/rep_table is already created, will add new replicarep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Creating replica /clickhouse/tables/01/rep/rep_table/replicas/ck03rep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Became leaderrep.rep_table (ReplicatedMergeTreeRestartingThread): Activating replica.rep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Replica 172.16.121.48 has log pointer '18', approximate 0 queue lag and 0 queue sizerep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Replica 172.16.121.248 has log pointer '18', approximate 0 queue lag and 0 queue sizerep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Will mimic 172.16.121.48rep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Queued 3 parts to be fetched, 0 parts ignoredrep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Fetching part all_0_1_1_6 from /clickhouse/tables/01/rep/rep_table/replicas/172.16.121.248rep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Fetching part all_9_9_0 from /clickhouse/tables/01/rep/rep_table/replicas/172.16.121.248rep.rep_table (b2fe260c-ca4f-4e6d-903a-6a4f0358777e): Fetching part all_8_8_0 from /clickhouse/tables/01/rep/rep_table/replicas/172.16.121.248

在zookeeper中添加新的replica节点

根据其他节点的log pointer和queue信息,选择复制的源端节点

将源端节点的parts信息加入到本节点的queue中。

将part下载到本节点并attach到表中。

更多技术信息请查看云掣官网

0 阅读:0

云的事随心讲

简介:感谢大家的关注