Comparison and summary of applicable scenarios of Clickhouse and MySQL database

For the moment , There are many relevant materials on the Internet to prove ClickHouse Database query response speed is faster than MySQL A hundred to hundreds of times faster . actually ,ClickHouse and MySQL It has different application scenarios and limitations , I've been working on this recently ClickHouse When you plan to apply it to a table with a large amount of data for query , Stepped on some pits , So here's a summary , Used for subsequent data storage and processing as a memo , And for those who want to use ClickHouse Replace MySQL Make a reference when storing some data in the database .

use VersionedCollapsingMergeTree The engine does the data that will be modified but not often

In itself ClickHouse It is not suitable for frequent modification and deletion of data , For deletion and modification, it will consume a lot of performance , Especially frequent single data modification . therefore , Usually, we see a lot of data saying that data should be written in batches as much as possible , No matter what it is 1000 strip ,10000 Or how many are a batch , To promote ClickHouse Performance of .

in addition , In itself ClickHouse Data written to 、 modify 、 Deletion is asynchronous , For operation write 、 modify 、 The deleted data needs to be queried in time , res ClickHouse For storage , And ClickHouse Unsupported transaction , therefore ClickHouse Do not use in scenarios with high data consistency .

In some cases , Maybe the data usually doesn't need to be modified , But some scenes need to be modified once or several times , Then in order to respond quickly , Want to switch to ClickHouse If you come , May adopt VersionedCollapsingMergeTree Engine for data storage , About this storage engine , I can briefly introduce , For details, please refer to ClickHouse Detailed documentation of .

The general principle of this storage engine is , By providing a sign And a version Mark ,sign When storing , The value is 1 and -1,version The version number of the stored data , The specific rules applied to the following two scenarios are ：

In case of deletion , Reinsert the data in this row ,version remain unchanged ,sign Set to -1, Then there will be two items in the data table except sign Different duplicate data . then ClickHouse Merge and collapse are performed periodically , Match these two data , But two sign Add to 0 The data of . Here we need to pay attention to , Is a scheduled merge and clear , So you need to use group by after , Do it again having(sign)>0) To manually exclude deleted data .
In case of modification , While doing the above operations , Insert new data ,sign Marked as 1,version Add... On the basis of the previous one 1 that will do . Of course, when querying , You also need to manually exclude the data of the previous version .

Sparse indexes are not suitable for precise queries

Before saying that this sparse index is not suitable for accurate query , Let's start with the following scenario of precise query ：

According to some indexed condition , such as id perhaps user_id To query the data of the current row ;
According to some indexed condition , Such as id perhaps user_id To get the data list for paging

ClickHouse Using a sparse index , and MySQL Of B+ Trees are different .（

About sparse index and B+ I won't introduce tree index here , If you introduce these two things, you won't be able to explain them clearly for a while , If someone sees this article and doesn't understand it , You can study it by yourself .

）, So when you do exact condition query again ,ClickHouse The amount of scanned data will be large , The actual response speed will not reach the ideal state .

And for ClickHouse The case of sparse index , It's especially suitable for group by To make a query , After several times group by after , You can eliminate a lot of data , So usually, the most suitable scenario is to process statistical queries , under these circumstances , Response speed ratio in case of large amount of data MySQL Dozens of times faster , You can withdraw cash hundreds of times .

Column database is not suitable for querying a large number of columns at one time

The other is ClickHouse The characteristics of columnar database , Based on the above , Scenarios that require precise query , When a large number of fields need to be queried at one time , The response speed is not ideal . Even if there are many more group by Conditions , Finally, because there are many columns to scan , stay MySQL When properly indexed ,ClickHouse The response speed is usually not MySQL fast .

ClickHouse The query effect is better than MySQL Stable

I won't say more about the specific test here , Mainly about the scene . When both data tables are correctly indexed , Doing it 2 Billion When querying the data list ,MySQL When doing paging data query , The first few pages of queries will be obviously time-consuming , In about 500ms to 800ms Unequal , But the subsequent paging query can basically achieve 50ms to 80ms, This is supposed to be MySQL Data preheating works . however ClickHouse Basically stable in 230ms to 300ms.

summary

Based on current tests and observations , If you need to do statistical query , And the data is not frequently modified , use ClickHouse To store and process data queries . If you need to modify frequently or do big data list query , The best solution is to use MySQL Inquire about , And divide the data into tables , The resulting data response performance will be better than ClickHouse Too much .