be careful : This test uses the data set of the official website , There are too many filtering operations on the official website , So I wrote some aggregation operations without filtering , This test is a limit test , There will be many filtering operations for aggregation operations in real business

One . Table creation and data import

Create table : Tables must be created at each node ,ddl No synchronization  
create database test ; use test ; CREATE TABLE ontime ( Year UInt16, Quarter
UInt8, Month UInt8, DayofMonth UInt8, DayOfWeek UInt8, FlightDate Date,
UniqueCarrier FixedString(7), AirlineID Int32, Carrier FixedString(2), TailNum
String, FlightNum String, OriginAirportID Int32, OriginAirportSeqID Int32,
OriginCityMarketID Int32, Origin FixedString(5), OriginCityName String,
OriginState FixedString(2), OriginStateFips String, OriginStateName String,
OriginWac Int32, DestAirportID Int32, DestAirportSeqID Int32, DestCityMarketID
Int32, Dest FixedString(5), DestCityName String, DestState FixedString(2),
DestStateFips String, DestStateName String, DestWac Int32, CRSDepTime Int32,
DepTime Int32, DepDelay Int32, DepDelayMinutes Int32, DepDel15 Int32,
DepartureDelayGroups String, DepTimeBlk String, TaxiOut Int32, WheelsOff Int32,
WheelsOn Int32, TaxiIn Int32, CRSArrTime Int32, ArrTime Int32, ArrDelay Int32,
ArrDelayMinutes Int32, ArrDel15 Int32, ArrivalDelayGroups Int32, ArrTimeBlk
String, Cancelled UInt8, CancellationCode FixedString(1), Diverted UInt8,
CRSElapsedTime Int32, ActualElapsedTime Int32, AirTime Int32, Flights Int32,
Distance Int32, DistanceGroup UInt8, CarrierDelay Int32, WeatherDelay Int32,
NASDelay Int32, SecurityDelay Int32, LateAircraftDelay Int32, FirstDepTime
String, TotalAddGTime String, LongestAddGTime String, DivAirportLandings
String, DivReachedDest String, DivActualElapsedTime String, DivArrDelay String,
DivDistance String, Div1Airport String, Div1AirportID Int32, Div1AirportSeqID
Int32, Div1WheelsOn String, Div1TotalGTime String, Div1LongestGTime String,
Div1WheelsOff String, Div1TailNum String, Div2Airport String, Div2AirportID
Int32, Div2AirportSeqID Int32, Div2WheelsOn String, Div2TotalGTime String,
Div2LongestGTime String, Div2WheelsOff String, Div2TailNum String, Div3Airport
String, Div3AirportID Int32, Div3AirportSeqID Int32, Div3WheelsOn String,
Div3TotalGTime String, Div3LongestGTime String, Div3WheelsOff String,
Div3TailNum String, Div4Airport String, Div4AirportID Int32, Div4AirportSeqID
Int32, Div4WheelsOn String, Div4TotalGTime String, Div4LongestGTime String,
Div4WheelsOff String, Div4TailNum String, Div5Airport String, Div5AirportID
Int32, Div5AirportSeqID Int32, Div5WheelsOn String, Div5TotalGTime String,
Div5LongestGTime String, Div5WheelsOff String, Div5TailNum String ) ENGINE =
MergeTree(FlightDate, (Year, FlightDate), 8192); CREATE TABLE ontimetest AS
ontime ENGINE = Distributed(perftest_3shards_1replicas, test, ontime, rand());
Download data
for s in `seq 1987 2017` do for m in `seq 1 12` do wget${s}_${m}.zip done
Import data
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' |
clickhouse-client --host --query="INSERT INTO test.ontimetest
FORMAT CSVWithNames"; done
Two . performance testing

Configuration description

to configure :  3 Machines , Each machine 32 Memory ,200G disk
Data volume :  10 G, 3.4 100 million , Each 109 Fields

count   performance testing  
select count(1) from (select ArrTime,CRSArrTime,FlightDate from ontimetest
limit 200000000) t1 ;
* 1000 ten thousand :0.543      15.81 MB
* 1 Billion :   0.863       187.58 MB      
* 2 Billion :   0.913       261.31 MB      
* 3 Billion :   1.181        485.01 MB 

group by performance testing
SELECT OriginCityName, DestCityName, count(*) AS flights FROM (select
OriginCityName,DestCityName from ontimetest limit 100000000) t1 GROUP BY
OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20 ;
* 100 ten thousand      0.258         42.72 MB
* 1000 ten thousand   1.336 s       438.28 MB
* 1 Billion      11.186 s        4.47 GB 
* 2 Billion      21.983 s        8.89 GB 
* 3 Billion      30.538 sec    13.32 GB
join performance testing  
select * from (select ArrTime,CRSArrTime,FlightDate from ontimetest limit
100000000) t1 ALL INNER JOIN (select ArrTime,CRSArrTime,FlightDate from
ontimetest limit 10000000) t2 on t1.ArrTime=t2.CRSArrTime limit 100 ;
* 1 must  join 10 ten thousand       0.606 s   8.29 MB
* 1 Billion    join 10 ten thousand       0.684 s   8.55 MB
* 2 Billion    join 10 ten thousand       0.962 s   7.99 MB
* 3 Billion    join 10 ten thousand       1.088  s  8.08 MB
* 1 must  join 100 ten thousand       11.756 s  13.03 MB 
* 1 Billion    join 100 ten thousand       11.795 s  13.27  MB
* 2 Billion    join 100 ten thousand       11.972 s  13.50 MB
* 3 Billion    join 100 ten thousand       12.825s   14.47 MB
* 1 must  join 1000 ten thousand        150.931 s   42.33 MB
* 1 Billion    join 1000 ten thousand       164.897 s   43.50 MB
* 2 Billion    join 1000 ten thousand       168.973 s   46.99 MB
* 3 Billion    join 1000 ten thousand       169.694     49.63  MB

Summary of performance test :

* stay count aspect , Very fast , Large memory consumption
* stay group by  aspect , Very fast , It consumes a lot of memory
* stay join aspect , Very fast ( be relative to spark for , As a real-time query system , Or slower ), Less memory consumption , But it costs CPU more
other aspects :

* Less concurrency , Suggestions on official website 100 Queries / second, Therefore, it is not suitable for business high concurrency query
* Column storage : The data format is very similar , So it's easy to compress , reduce IO Consumption of (hadoop
It's row storage , Loading data is time-consuming ); In column queries , Very fast , But when there are many query fields , Slower
location :

* Real time processing , The amount of data processed is much larger than mysql, however count and group by  High memory consumption , In case of high concurrency , The business server can't afford it ;group
by,join Query time The performance still can't meet the requirements of excellent response ; Less concurrency ,100 Queries / second; So it is not suitable for business high concurrency real-time query
* In terms of batch processing , Because of its column storage design , reduce IO Consumption of    And so on Computing performance is far beyond spark,hive ;100 Queries /
second Much higher than spark Number of dozens of concurrent tasks ; So it can be replaced hadoop Cluster as an offline framework for batch processing