<>Redis Of KEYS Ordered to cause RDS Database avalanche ,RDS Two outages occurred , It caused millions of capital losses

* original text :Redis Of KEYS Ordered to cause RDS Database avalanche ,RDS Two outages occurred , It caused millions of capital losses
<https://mp.weixin.qq.com/s/SGOyGGfA6GOzxwD5S91hLw>
* author : Chen Haoxiang
Reprinted with authorization , For learning only , The copyright belongs to the original author .

The recent Internet online accidents occur more frequently ,2018 year 9 month 19 There was an online deletion event in Shunfeng , I won't introduce it here .

Here, I would like to talk about the recent accident in our company , And how to avoid it , And how to deal with optimization .

There are many indirect reasons , Technology can't keep up with business development , It's a big step forward from millions to tens of millions a day , The company's processing priority for system optimization is not high , Shortage of technical development personnel

<> First outage

2018 year 9 month 13 At a certain point , A service project of the company RDS Instance connection soaring ,CPU Up to 100%, All requests from other applications were denied

The whole process is as follows :

* Monitoring alarm , display RDS Of CPU The utilization rate reached 80% above ,DBA intervention , get ready KILL slow SQL
* 1 Within minutes , No obvious obstruction was found SQL,CPU Continue to rise to 99%
* 5 Within minutes , Mass application alarm , And denial of service ,RDS The monitoring shows a lot of slow SQL, Contact the server database provider for assistance
* 8 Within minutes , Switch between master and standby database ( Business will suffer , But there's no way , No problem identified )
* 9 Within minutes , Partial business recovery , However, some business orders have accumulated more than 20w, Backup database CPU Usage also continued to rise
* 15 Within minutes , Backup database CPU The utilization rate exceeds 97%, Business interruption again , Switch back to the main database , And carry out current limiting
* 20 Within minutes , Turn off the traffic inlet of some secondary applications
* 25 Within minutes , Main library CPU The utilization rate returned to normal
* 30 Within minutes , Open and close current limiting application step by step
* 35 Within minutes , All applications returned to normal
*
The next step is to set up an emergency response team with the server database provider, and emergency optimization may be slow SQL, Although said that may have solved some slow SQL, But this time, there is no specific problem , It also laid the groundwork for another outage in a few days
<> Impact of the accident

The service of a service-oriented project cannot be used for dozens of minutes , As a result, the number of orders decreased by hundreds of thousands , Loss of millions .

<> Cause analysis

At that time, there was no specific reason , But the following reasons are also part of the possible causes of downtime .

The business growth of a service-oriented project is very fast , At the peak , database QPS Breach 35000, The system is in high load state .

In the peak period, if several full table scans are performed at the same time SQL, Will cause the database pressure to rise sharply , Application timeout increases , Timeout front end application , User retrying , Traffic surge , The avalanche effect is formed .


The main reason is that it is related to some old projects SQL Poor query performance , And use the main library , It has a great impact on the database . database QPS Too high , However, the cache scheme has not been implemented due to manual reasons , slow SQL The priority should be raised

<> Improvement plan

* Create a database account for each application , Use in strict accordance with the specifications
* Real time implementation of cache optimization scheme , slow SQL Give priority to problems , Centralized processing has been found to be slow SQL( Query time exceeds 1S)
* Upgrade database configuration
* Migrate non core business to new RDS Examples
<> Second outage

The cause of the last outage was not found , This is a predictable outage .


2018 year 9 month 19 Number , It's still the same " formula ", It's still the same " taste ". The same RDS,CPU Soaring to 100%, Then there is denial of service , Downtime . of course , With the first experience , Direct master slave switching , All business was restored in a few tens of seconds , But still seriously affected the company's business and image .

<> Cause analysis

After business resumption , The company held an emergency research meeting , of course , My level can't participate . Executives of the company , High level technology architecture ,DBA, The heads of each project held a meeting together .


In this meeting , After viewing the log of each item , Background monitoring data , Found out on that one RDS database CPU When soaring , There is one Redis Database memory nearly 100%, And then it dropped sharply . Contact the first outage , It's similar .

The next step is to contact the server database provider , Take that one Redis The last week's commands are all called out , Finally found out , At that point in time, a keys *...*
command . One of the company's engineers keys The fuzzy match command is to clean up useless keys , But it didn't take into account keys *
Trigger by fuzzy matching Redis lock , cause Redis lock up ,CPU soar , Caused all call links to time out and stuck , etc. Redis The seconds of the lock are over , All request traffic requests to RDS In the database , It made the database avalanche , Bring the database down .

<> Improvement plan

* All online operations , All of them can be executed after passing the operation and maintenance , The operation and maintenance department will gradually and quickly recover all rights
* newly added Redis example , To separate
* If there is a use similar to keys Regular command requirements , use scan Order substitution
<> summary


Two accidents occurred in the incident , It's completely caused by human operation , If the Engineer , Yes Redis Development Specification for , It is suggested to disable it keys Imperative . in addition , Command operation on wire , Operation can only be carried out after operation and maintenance evaluation , I guess that engineer is an old employee , Have authority , And then it goes straight to the operation .

in addition , The company's business is really growing fast , Technology can't keep up , This is very, very dangerous , It greatly increases the probability of downtime .

In the case of small business volume , The engineer's operation is completely without any problem , After all, concurrency is not big , But now , With the development of the company , The business volume has doubled and multiplied , Technology has not expanded as fast as it has grown .


The company's lack of technical staff is also one aspect , Most people are doing new functions while maintaining old projects , But for the reconstruction optimization of the project , There's a lot less manpower , The priority of project optimization is not high , This is also a big reason , A similar situation is likely to occur , New service construction is imminent .

Last, last , Any command that operates online , You can't be too careful , Because an accident caused by one of your symbols may not be affordable .

<>Redis Development Suggestions

Finally, attached Redis Some development specifications and suggestions for

<>1. Separation of hot and cold data , Don't put all the data in the Redis in


although Redis Support persistence , however Redis All of the data storage is in memory , It's expensive . It is recommended that only high frequency thermal data be stored in the Redis in 【QPS greater than 5000】, For low frequency cold data, it can be used MySQL/ElasticSearch/MongoDB And disk based storage , Not only save memory costs , And the amount of data is small, and the operation speed is faster , More efficient !

<>2. Different business data should be stored separately


Don't put irrelevant business data in one Redis In the instance , Suggest new business apply for new individual instance . because Redis For single threaded processing , Independent storage reduces the impact of different business interactions , Speed up request response ; At the same time, it also avoids the data expansion of single instance , Faster service recovery in the event of an exception !
In the actual use process ,redis The biggest bottleneck is usually CPU, Because it is a single thread job, it is easy to run full of logic CPU, have access to redis Agent or distributed solution to improve redis Of CPU Utilization rate .

<>3. Stored Key Be sure to set the timeout


If the application will Redis Positioning as cache Cache use , For stored Key Be sure to set the timeout ! Because if it is not set , these ones here Key The memory will be occupied all the time , It's a huge waste , And as time goes on, it will lead to more and more memory consumption , Until the server memory limit is reached ! in addition Key The length of overtime should be evaluated comprehensively according to the business , Not the longer the better !

<>4. Large text data that must be stored must be compressed and stored


For large text 【+ exceed 500 byte 】 Write to Redis Time , Be sure to compress and store ! Big text data storage Redis, In addition to the huge memory consumption , When the traffic is high , It's easy to fill up the network card traffic , As a result, all services on the entire server are unavailable , And cause avalanche effect , All systems are paralyzed !

<>5. on-line Redis No use Keys Regular matching operation


Redis It's single threaded , Online KEY When the quantity is large , Extremely inefficient operation 【 The time complexity is O(N)】, Once the command is executed, it will seriously block the normal requests of other commands on the line , And it's high QPS It will be directly caused by Redis Service crash ! If there is a similar need , Please use scan Order substitution !

<>6. Reliable Message Queuing service

Redis
List It is often used for message queuing services . Suppose the consumer program crashes immediately after it gets the message out of the queue , However, since the message has been taken out and has not been processed normally , Then it can be considered that the message has been lost , This may result in loss of business data , Or the business status is inconsistent .

To avoid this ,Redis Provided RPOPLPUSH command , The consumer program atomically takes messages from the main message queue and inserts them into the backup queue , The message is not removed from the backup queue until the consumer program has completed the normal processing logic . It also provides a daemon , When a message in the backup queue is found to be out of date , It can be put back into the main message queue again , So that other consumer programs can continue processing .

<>7. Careful full operation Hash,Set Equal set structure


in use HASH When structure stores object properties , At first, there were only a few dozen field, Often used HGETALL Get all members , It's also very efficient , But as the business grows , Will field Expand to hundreds or even hundreds , Also used at this time HGETALL There will be a sharp drop in efficiency , The network card is frequently full 【 Time complexity O(N)】, At this time, it is recommended to split it into multiple services Hash structure ; Or if most of them are operations to get all the properties , You can serialize all properties into one STRING Type storage ! Also in use SMEMBERS operation SET The same is true for structure types !

<>8. Reasonable use of different data structure types according to business scenarios

at present Redis Many types of database structure are supported : character string (String), Hash (Hash), list (List), aggregate (Set), Ordered set (Sorted Set),
Bitmap, HyperLogLog And geospatial index (geospatial) etc. , You need to choose the appropriate type according to the business scenario .


Common examples are :String Can be used as a normal K-V, Class count class ;Hash Can be used as an object such as a commodity , Brokers, etc , Information with more attributes ;List Can be used as a message queue , fans / Follow list, etc ;Set Can be used for recommendation ;Sorted
Set Can be used for leaderboard, etc !

<>9. Nomenclature


Although Redis Support multiple databases ( default 32 individual , More can be configured ), But except for the default 0 Outside the library , The others need to be used through an additional request . So it might be wise to use prefixes as a namespace .

in addition , In the use of prefixes as a namespace distinction is different key When , It is best to use global configuration in the program , Writing prefixes directly in code should be strictly avoided , This maintainability is too poor .

as : System name : Business name : Business data : other

But pay attention ,key Don't be too long , Try to be clear and clear , Easy to understand , You need to measure it yourself

<>10. No use online monitor command

No use in production environment monitor command ,monitor Command under high concurrency condition , There will be memory explosion and impact Redis Hidden trouble of performance

<>11. Forbidding big string


Core cluster disabled 1mb Of string large key( although redis support 512MB Size string), If 1mb Of key Repeated writes per second 10 second , This will result in writing to the network IO reach 10MB;

<>12. redis capacity

The memory size of a single instance is not recommended to be too large , It is suggested that 10~20GB within .redis The number of keys contained in the instance should be controlled in 1kw within , The number of single instance keys is too large , It may lead to delayed recovery of expired keys .

<>13. reliability

Regular monitoring is required redis Health : Use a variety of redis Health monitoring tools , I can't. We can return regularly redis Of info information . passenger 户端连接尽量使用连接池(长链接和自动重连).

<>关于Fundebug

Fundebug <https://www.fundebug.com/>专注于JavaScript,微信小程序,微信小游戏,支付宝小程序,React
Native,Node.js和Java实时BUG监控.
自从2016年双十一正式上线,Fundebug累计处理了6亿+错误事件,得到了Google,360,金山软件等众多知名用户的认可.欢迎免费试用!



<>版权声明

转载时请注明作者Fundebug <https://www.fundebug.com/>以及本文地址:
https://blog.fundebug.com/2018/09/21/redis_incident/
<https://blog.fundebug.com/2018/09/21/redis_incident/>