1. preface

Always talking about concurrent topics , We're all immune , So this time I'm going to talk about the next topic —— database ( Welcome to correct and supplement )

After reading, ask yourself a question from my test :NoSQL How should I choose ?

Database ranking :https://db-engines.com/en/ranking <https://db-engines.com/en/ranking>

1.1. classification

There are three main categories :( No matter how old the database is )

1. Traditional database (SQL):

* relational database :SQLite,MySQL,SQLServer,PostgreSQL,Oracle...
2. High concurrent products (NoSQL):

* Key value database :Redis,MemCached...
* Document database :MongoDB,CouchBase,CouchDB,RavenDB...
* Column database :Cassandra,HBase,BigTable...
* Search Engine Department :Elasticsearch,Solr,Sphinx...
* Graphic database :Neo4J,ArangoDB,Flockdb,OrientDB,Infinite Graph,InfoGrid...
PS:ArangoDB Is a native multi model database , With documents , Flexible database of graphics and key values

3. Products of the new era (TSDB):

* temporal database :InfluxDB,LogDevice,Graphite,,OpenTSDB...
Let's look at an authoritative picture :( The red one is recommended NoSQL, Grey is traditional SQL)

1.2. concept

Let's talk about it first NoSQL No, don't use tradition SQL 了 , It's not just traditional SQL(not only sql)

1. Advantages and disadvantages of relational database

Let's look at the benefits of traditional databases :

* Keeping data consistent through transactions
* sure Join And so on
* Community improvement ( If you have a problem, simply search it ok了)
Of course, there are also shortcomings :

* Modify the table structure when the amount of data is large .eg: Add a field , If you set this field as an index, it is a card to burst , I dare not do it during working hours
* If the column is not fixed, it will hurt even more , General design database can not be so perfect , They are all more and more perfect in the later period , Even if you reserve reserved fields, it's not humanized
* Big data writing and processing is troublesome ,eg:
* The amount of data is not very good , Write in bulk .
* But the amount of data itself is quite large , Master slave replication was performed , Read data in Salver There's nothing wrong with it , However, a large number of database connections are made Master I can't eat it up , We have to add the master database .
* There is a problem after adding : Although the main database is divided into two parts , However, data inconsistency is easy to occur
( The same data is updated to different values in the two main databases ), At this time, we have to combine the sub database and sub table , Distribute tables in different master databases .
Is it over ?NoNoNo, Think about it between the watches Join how ? Isn't it cross database and cross server join了? It's the rhythm of robbing Peter to pay Paul , So all kinds of middleware are born 【SQLServer This aspect of the expansion is very good , Column storage also comes with , Also cross platform ( It is suggested that Docker Run in )(
Click on me to see an article I wrote a few years ago <https://www.cnblogs.com/dunitian/p/6041323.html>)】
* Welcome to add ~( A word of conscience , Small and medium sized companies SQLServer Absolutely the best choice , It saves a lot of time )

Now let's talk about it NoSQL了:( In fact, you can understand it as :NoSQL It's about the original SQL The expansion and supplement of )

* When splitting tables and databases, the associated tables are generally placed on the same server , This is convenient join operation . and NoSQL I won't support it join, It doesn't have to be so limited , Data is easier to be distributed
A lot of data processing , Reading tradition SQL Not too many disadvantages ,NoSQL Mainly for cache processing , In the aspect of batch data writing, testing is often much higher than that of traditional methods SQL, and NoSQL It's too convenient to expand
* Multi scene type NoSQL( Key value , file , column , graphical )
If you still don't know how to choose NoSQL, Let's talk about the characteristics of each type in detail :

* Key value database : This is very familiar to everyone , It is mainly key value storage , representative =>Redis( Support persistence and data recovery , We'll talk about it later )
* Document database : representative =>MongoDB( Youku's online reviews are based on MongoDB Of )
* Generally, there is no business (MongoDB 4.0 Start supporting ACID Business )
* I won't support it Join(Value Is a mutable class JSON format , It is convenient to modify the table structure )
* Column database : representative :Cassandra,HBase
* Modify and update a large number of rows and a small number of columns ( Add a new field , What kind of batch operation should not be too convenient ~ Read and write for column as a unit )
* High scalability , The increase of data does not reduce the corresponding processing speed ( Especially writing )
* Search Engine Department : representative :Elasticsearch, It's classic. Needless to say ( Traditional fuzzy search can only like too low, So there's this )
* Graphic database : representative :Neo4J,Flockdb,ArangoDB( The data model is graph structured , Mainly used for The relationship is complicated
Design of , For example, draw one QQ Visual graph of group relationship , Or draw a micro blog fan relationship diagram )

We should go back to the remaining topics of concurrency , If you look at it carefully, you will find that no matter what language the underlying implementation is, it is almost the same .

Like the process , The bottom layer is what we said in the first part OS.fork. Let's talk about it ( Line ) Program communication , Yes PIPE,FIFO,Lock,Semaphore It's rarely used ? however Queue
The bottom layer is the implementation , How to read the source code ?

Remember when it was introduced Queue It is mentioned in the article Java Inside CountDownLatch Do you ? If you don't understand Condition How to quickly simulate one yourself Python What about the functions that are not in it ?

It is absolutely not advisable to know what it is and why it is . I'll talk about it later MQ We have to use it again Queue Of knowledge , It can be described as a ring set of a ring ~

Since it's not the cute girl of the company , So what ~ It's up to you to improve your technology ^_^, Come here first , At the end of the article, I will post a common solution :

Python,NetCore Common solutions ( Continuous update )
https://github.com/LessChina <https://github.com/LessChina>

2. concept

It was mentioned in the first part ACID I'm going to talk about it this time , And then talk about it CAP And data consistency

2.1.ACID affair

Let's continue with the example of Xiaoming and Xiaozhang's transfer :

* A: Atomicity (Atomic)
* Xiaoming transfer 1000 To Xiao Zhang : Xiao Ming -=1000 => Xiao Zhang +=1000, this ( affair ) It is an indivisible whole
, If Xiaoming -1000 After the problem , that 1000 I have to give it back to Xiao Ming
* C: uniformity (Consistent)
* Xiaoming transfer 1000 To Xiao Zhang , Xiao Ming must be guaranteed + The total amount of Xiaozhang remains unchanged ( Assuming no other transfer ( affair ) influence )
* I: Isolation (Isolated)
* When Xiao Ming transfers money to Xiao Zhang , Xiao pan also transferred money to Xiao Zhang , We need to make sure they don't interact with each other ( It is mainly isolation in the case of concurrency )
* D: persistence (Durable)
* There should be a record of Xiaoming's transfer to Xiaozhang bank , Even if we argue with each other in the future, we can draw a daily account 【 Persistence after successful transaction execution ( Even if the database is hung, it can pass through Log recovery )】
2.2.CAP concept

CAP <https://baike.baidu.com/item/CAP principle > They are three indexes that need to be considered in distributed system , Data sharing can only satisfy two but not both :

* C: uniformity (Consistency)
* All nodes access the same copy of the latest data ( All data backup in distributed system , Is the same value at the same time )
* eg: After updating in distributed system , All users should read the latest value
* A: usability (Availability)
* After some nodes in the cluster fail , Can the cluster respond to the read and write requests from clients .( High availability for data updates )
* eg: In a distributed system, every operation always returns the result in a certain time ( Overtime does not count 【 What has been waiting for online shopping ? The computer room hangs a few server also does not affect 】)
* P: Partition tolerance (Partition Toleranc)
* In terms of actual effect , The partition is equivalent to the time limit of communication . If the system fails to reach data consistency within the time limit , This means that partitioning has occurred , Must be in the C and A Choose between .
* eg: In distributed system , There is network delay ( partition ) Can still accept requests to meet consistency and availability

representative : Traditional relational database

If you want to avoid partition fault tolerance problems , One way is to take all the data ( Transaction related ) All on one machine . Although not 100% Make sure the system doesn't go wrong , But you won't encounter the negative effects of partitioning ( Will seriously affect the scalability of the system )

As a distributed system , give up P, This is equivalent to abandoning the distribution , Once the concurrency is high , Stand alone service can't bear the pressure at all . Like a lot of banking services , It's really giving up P, Only high performance single minicomputer is used to guarantee service availability .(
All NoSQL Databases are assumptions P It exists )


representative :Zookeeper,Redis( Distributed database , Distributed lock )

Relative to giving up “ Partition tolerance “ Come on , The opposite is to give up usability . In case of partition fault tolerance , Then the affected services need to wait for the data to be consistent ( The system cannot provide external services while waiting for data consistency )


representative :DNS database (IP Distributed database mapping with domain name , Lenovo modification IP Why TTL need 10 About minutes to ensure that all parsing takes effect )

back DNS query :https://www.cnblogs.com/dunitian/p/5074773.html

Abandoning strong consensus , Ensure final consistency . be-all NoSQL Databases are between CP and AP between , Try to go AP by ,(
Traditional relational database focuses on data consistency , For the distributed processing of massive data, the priority of availability and partition fault tolerance is higher than data consistency )eg:

Different data have different consistency requirements ,eg:

* User comments , These are insensitive to consistency , For a long time, inconsistencies do not affect the user experience
Like commodity prices and so on, you dare to have a look ? Consistency is a high requirement , Tolerance must be lower than 10s, Even if the cache is used, the price in the order is up to date ( Pay attention to it at ordinary times JD Cache description under item ,JD Still so , The rest is needless to say )

2.3. Data consistency

Traditional relational database usually uses pessimistic lock , But scenes like the second kill are hou Not moving , The optimistic lock is often used at this time (CAS mechanism , I mentioned it earlier when I talked about concurrency and locking ), As mentioned above, different business requirements have different requirements for consistency CAP Not at the same time , There are mainly two kinds :

* Strong consistency : No matter which copy the update is on , After that, the operation should be able to get the latest data . Multi copy data needs distributed things to ensure data consistency ( This is the reason why we often ask about the project )
* Final consistency : Under this constraint, users can finally read the latest data . Give me a few examples :
* Causal consistency :A,B,C Three independent processes ,A The data was modified and informed B, At this time B What we get is the latest data . because A No notice C, therefore C Not up to date
* Session consistency : Users submit their own updates , He can get the updated data before the end of the session , After the end of the session ( Other users ) It may not be the latest data ( After submission JQ Modify local value , There is no guarantee that the data is up to date )
* Read and write consistency : It's about the same as above , It's just not limited to conversation . After the user updates the data, he gets the latest data himself , Other users may not be up to date ( Certain delay )
* Monotonic reading consistency : The user reads a value , Subsequent operations will not read an earlier version of the data ( New level >= Read value )
* Monotonic writing consistency ( Timeline consistency ): All copies of all databases perform all update operations in the same order ( It's kind of like Redis Of AOF)
2.4. Consistency implementation method

Quorum system NRW strategy ( Commonly used )

Quorum It's a collection A,A It's a complete collection U Subset of ,A Arbitrary collection in B,C, They both intersect .

NRW algorithm :

* N: Represents the number of copies the data has .
* R: Represents the minimum number of copies that need to be read to complete the read operation ( The minimum number of nodes required to participate in a read operation )
* W: Represents the minimum number of copies that need to be written to complete the write operation ( The minimum number of nodes required to participate in a write operation )
* Just a guarantee R + W > N We can ensure strong consistency ( There is overlap between the nodes that read the data and the nodes that are written synchronously ) such as :N=3,W=2,R=2( One node is read + write )
extend :

* In relational database , If N=2, Can be set W=2,R=1( Write consumption performance ), At this time, the system needs to update the data on both nodes to confirm the result and return it to the user
* If R + W <= N, At this time, read and write will not appear on one node at the same time , The system can only guarantee the final consistency . The time for the replica to reach consistency depends on the way the system updates asynchronously
, Inconsistent time = Update node from ~ Time consuming for all nodes to be updated asynchronously
* R and W Settings directly affect the performance of the system , Scale and consistency :
* If W Set to 1, Then a copy is updated and returned to the user , The rest are then updated asynchronously N-W Nodes
* If R Set to 1, As long as one copy is read, the read operation can be completed ,R and W Smaller values of affect consistency , Larger will affect performance
* When W=1,R=N==> The system has high requirements for writing , But reading is slower (N Nodes have 1 I'm dead , I can't finish reading )
* When R=1,W=N==> The system has high requirements for read operation , But the writing performance is low (N Nodes have 1 I'm dead , I can't finish writing )
* Common prescription 法:一般设置R = W = N/2 + 1,这样性价比高,eg:N=3,W=2,R=2(3个节点==>1写,1读,1读写)

* 主要是关系型数据库的日记==>记录事物操作,方便数据恢复
* 还有就是并行数据存储的时候,由于数据是分散存储在不同节点的,对于同一节点来说只要关心数据更新+消息通信(数据同步):
* 保证较晚发生的更新时间>较早发生的更新时间
* 消息接收时间 > 消息发送时刻的时间(要考虑服务器时间差的问题~时间同步服务器)