Programmers learn every language from printing “hello
world” Starting , Logs are also learning for novice programmers , A powerful tool for debugging programs . When the project goes online , There will be all kinds of logs , For example, record the behavior of users , Status of the server , Exceptions, etc . Printing a log seems like a simple thing , Something not worth mentioning . however , When you see online projects with such logs , Can't help but ask us if we really print logs ?

***********************************

Error Happened, F*ck

User xxx can not login  

   in my opinion , Printing log is an art , And it's been ignored by programmers for a long time , This paper attempts to answer the following questions , Share my thoughts on Printing log :

  (1) why or when to log

  (2) what to log

  (3) how to log and tips

   be careful : This article will use “ Printing " Log to uniformly represent the output of log , But not the output of our narrow sense understanding to the terminal or stdout,stderr,
It generally refers to the output of log content to any target , Including but not limited to terminal , file , network transmission .

   Address of this article :http://www.cnblogs.com/xybaby/p/7954610.html
<http://www.cnblogs.com/xybaby/p/7954610.html> 

why or when to log


   Why to print the log , Or when to print the log , It seems to me the same question , That's the purpose of printing logs . Different purposes , Determines the format of log content , Frequency of output , Destination of output . What are the possible reasons for printing logs ?

  (1) Debugging and development


   The purpose is to use the debugging program in the development period , This kind of log is large , And there's no real significance , Should only appear in the development period , It should not be output after the project goes online . How to control whether this type of log is output , It will be discussed in detail later .

  (2) User behavior log


   This type of log , Record the operation behavior of users , For big data analysis , For example, monitoring , risk management , Recommendations, etc . This kind of log , It is generally used for analysis by other teams , And it could be multiple teams , Therefore, there are certain format requirements , Developers should record in this format , Easy for other teams to use . of course , What behaviors to document , operation , Generally agreed , therefore , The main role of developers is to execute .

  (3) Program running log


   Record the health of the program , Especially unexpected behavior , Abnormal conditions , This kind of log , Mainly for development , Use by maintenance personnel . When to record , What to record , It's all up to the developers , Developers are highly autonomous . This paper mainly discusses this type of log , Because as a server development , Operation and maintenance personnel , Program running log is often the life-saving straw to solve online problems .

  (4) Record the status of the system or machine

   For example, network request , system CPU, Memory ,IO Usage, etc , This kind of log is mainly used by operation and maintenance personnel , Generate a variety of more intuitive forms of presentation , Alarm in case of system failure .

 

what to log

   What should a log contain , I think it should include when,where,how,what,who,context, The specific meaning will be described in the next section .


   We need to keep a log , Always at a certain time , For example, a user's request , A network call , Or the internal state has changed . In the following text , Collectively referred to as events (event), That is, the log content is a description of an event .

when: the time event happens


  when, That's when we print the logs ( time stamp ), Note that the time here refers to the time when the logged event occurred , Instead of the time when the log was finally output . For example, if the output target of a log is a file , So here when Not write to file time , Because there are often delays .

    The importance of time , stay 《Distributed systems for fun and profit
<http://book.mixu.net/distsys/time.html>》 It's a good story .


   first , Time can be read (Interpretaion) To be concrete , Realistic date , Let us know the time environment when the event happened , For example, is there any activity when there is a problem , Analyze the network of the server at that time combined with the system log ,CPU,IO How is it . Specific time points can also help us analyze whether there is a certain rule in the occurrence of events , Like every day , weekly , There's a problem at a fixed time every month .


   secondly , Time can represent a ( a set ) Duration of the event (duration). such as , It can monitor the execution time of a section of code , It can also record the time consumption of a network request . This time difference can also give us a lot of information , For example, predict the operation status of the program at that time based on experience , Compare or not ‘ card ’.


   last , Time represents the sequence of events (order), We sort multiple logs by time , This sequence can help us debug How did the problem come about , In what order . This is for multi process ( Multithreading ), asynchronous , Very important for distributed programs . Although we know the time of global maintenance in distributed system (global
clock) It's a complicated thing , But we use NTP agreement , Basically solve most of the problems .

where: where the event happens


  where, Where are the logs recorded , In essence , It's where the incident happened . According to the situation , Can be specific to which module , Which file , Even which function , Which line of code . Practice notification , At least module information should be included .

  where The point is to be able to let yourself or other programmers see at a glance where this log is generated , So we can roughly locate the problem , Instead of global log content grep code

how:how importance of the event

  how
important, Represents the importance of the event , We print many kinds of logs , But the importance of different logs is different . such as , Debug logs are the least important , There should be no online projects , But the error log of program running needs to be taken seriously , Because an exception has occurred in the delegate program . Even the error log of the program , There are also different degrees of urgency , A network request timeout is totally different from the abnormal exit of a subprocess .

   therefore , Every log should have log level,log
level Represents the importance of logs , Urgency . Different languages , Framed level There are some differences in segmentation , But it will include debug,info,warn,error,fatal(critical). Its importance can be seen from its name .


   of course , occasionally ,warn And error, perhaps error And fatal The line between them is not so obvious , This needs to be agreed between teams . In our online project , about error,fatal Level log will alarm , If it does error journal , Then we need to deal with it the next day at the latest , If it is fatal journal , Even in the middle of the night , We need to get up and analyze it immediately , handle .

what:what is the log message


   what Is the main content of the log , What happened should be described briefly and concisely . Requirements can be obtained through the log itself , Instead of Rereading the code that generated the log , To get a sense of what happened . therefore , The following log is unqualified :
1 def user_login(username, password): 2 if not valid_username(username): 3
logger.warn('Error Happened') 4 return 5 # some others code
   yes , I Know , Something's wrong , But the log should tell me what's wrong , So the log should at least be like this :
1 def user_login(username, password): 2 if not valid_username(username): 3
logger.warn('user_login failed due to unvalid_username') 4 return 5 # some
others code
 

who:the uniq identify


  who Unique identifier representing the event producer (identity), Used to distinguish the same events . Especially on the server side , It's a lot of users , Concurrency of requests , If the log content does not contain unique identification information , Then the log will be submerged in the vast sea , For example, the following log :
1 def user_login(username, password): 2 # some code has check the username 3 if
not valid_password(password) or not check_password(username, password): 4
logger.warn('user_login failed due to password') 5 return 6 # some others code
 

   What's wrong with the code above , But it doesn't include who's wrong , At least it should be
def user_login(username, password): # some code has check the username if not
valid_password(password) or not check_password(username, password): logger.warn(
'user_login failed due to password, username %s', username) return

   The unique identification of the log can also help us retrieve , filter , Find all logs of the last period of time for this unique ID , Sort by time , It can restore the active trace of the log body , For example, a user's operation on the website will spread to multiple processes , service . When through user name ( Unique identification of the user ) When searching for logs , Can restore the player's operation record , This is very effective when looking up online problems .


   of course , This unique logo is very extensive , It needs to be decided according to the specific situation , If network requests , Maybe better requestid,sessionid; If it is system log , So it could be a process , thread ID; If it's a distributed cluster , Then it could be the only copy id

context: environment when event happens


   The context in which logged events occur is intuitive and important , Be able to tell you what happened . of course , As mentioned above when,where,who All belong to context , These are fixed , General . And in this section ,context Information highly dependent on specific log content , This information , Is used to locate the specific cause of the problem . such as
1 def user_login(username, password): 2 # some code has check the username 3 if
not valid_password(password) or not check_password(username, password): 4
logger.warn('user_login failed due to password, username %s', username) 5 return
6 # some others code
 

    Read code , When password Non conformance to specifications , Or verification password When you fail, you make mistakes , But that's what happened , The content of the log does not indicate , So this log is still unqualified .

   in my opinion , This part is the most elastic , But the most important part , There are many exceptions , What information to record , It all depends on the programmer who wrote the code . But it's also very simple , Just follow one principle :
Think about what information can be added to locate the cause of the problem !

how to log and tips

use logging framework


   The logging functions provided by the early programming languages are relatively rudimentary , Usually output to terminal or file , as C++,Java, And it does not support the configuration of unified output format . For these languages , There's usually a separate one log frame , as glog,log4cpp,log4j,elf4j wait . And newer languages , Generally built-in powerful logging modular , as python.

   I'm talking about it earlier “what to
log” When , We pointed out that we should record the time when the log happened , place , Grade and other information . Is this information printed in the log every time , These contents are really necessary for a complete log , But if you want programmers to write every log, you have to add these things manually , So it's a burden . frame , The function of modules is to free programmers from these red tape . For example, in many logging Frame module (thon
logging,log4j) in , All of them named logger This concept , this name It could be module, filename,
classname perhaps instance, This solves the problem mentioned in the previous chapter “where the event happen” This question .

   ad locum , Yes logging The introduction of the framework is not detailed , Just to summarize logging frame , Function of the module

  (1) set up log Output level of


   In this way, the program code can not be changed , Modify only log Output level of , Which log output can be controlled , Which logs are not output . For example, our debugging log in the development period , Can be set to DEBUG, Set the output level to INFO, Then these debug logs will not be output .

  (2) Set the default content of each log

   with Python logging
<https://docs.python.org/2/library/logging.html#logrecord-attributes>
take as an example , Through formatter Set the default information of each log , Like time , file , Line No , process , Thread information .

  (3) Set the output target of the log


   Through configuration , You can specify that the log is output to stdout, Or a document , Or the Internet . Especially in linux On the server , Output log to syslog, Reuse syslog Powerful handling , Distribution function , coordination elk System analysis , Is a common practice for many applications .

log never throw


   We print logs , To record the scene of the accident , In order to find out the problem , solve the problem . Then you have to promise , The act of printing logs by itself cannot introduce new problems , such as , Can't throw exception in error . It's like , The fire engines that deal with the accident can't rush into the scene .

   however , More important logs , such as Error,
Fatal Level log , The lower the probability of occurrence , And we don't want to have a log like this , Because if it happens, it means something unusual or an online accident . Such logs , Unit test must be done well , Coverage test , Ensure that the log itself works properly .

log when u think something never happen

   this point , It should be aimed at why or when to
log for . That is to say , When you think something will never happen , According to Murphy's law , It's still likely to happen , Then it should be recorded in case of occurrence log,Error(Fatal) Level log. Especially in asynchronous , In the case of concurrency , The impossibility of a programmer's task becomes possible .

   For example, the following pseudo code
1 def magic_func(): 2 if check_ok(): 3 do_something() 4 5 return
 


   The code is simple , If the conditions are met , Then deal with it accordingly , Otherwise, return directly . There are two kinds 可能,第一种是条件不满足是可能的,可预期的情况,只是说,在这种情况下什么都不用作;第二种情况是程序员觉得条件一定会满足,一定不会出现else的情况,所以什么都没做.如果是第二种情况,那么就一定得加日志,因为一切都可能发生.

lazy logging

  日志的内容,在最终输出的时候,应该是一个字符串,那么这个字符串是什么时候产生呢?前面提到,我们可以通过log
level来控制一条日志是否被输出,那么字符串的生成越迟越好,因为有可能日志根本就无需输出,也就无需生成这个字符串,这也是我之前提到过的lazy思想
<https://www.cnblogs.com/xybaby/p/6425735.html>.

  至于lazy logging,各个语言,框架都是大同小异的,在这篇文章
<https://garygregory.wordpress.com/2015/10/08/the-art-of-logging-advanced-message-formatting/>
中,举出了log4j的例子,在这里还是用我比较熟悉的python讲解这个例子
1 #coding=utf-8 2 import logging 3 4 logger = logging.getLogger('
LazyLogging') 5 logger.setLevel(logging.DEBUG) 6 hander =
logging.StreamHandler() 7 formatter = logging.Formatter('%(asctime)s - %(name)s
- %(levelname)s - %(message)s') 8 hander.setFormatter(formatter) 9
logger.addHandler(hander)10 11 def getUserCount(): 12 logger.info('getUserCount
is called') 13 return 1 14 15 logger.debug("There are " + str(getUserCount()) +
" users logged in now.")
 

  上面的代码可谓麻雀虽小五脏俱全,包含了使用logging框架的方方面面.第4行是named logger,第5行设置log
level,第6行指定输出目标,第7行指定输出格式(包括时间,logger name,log level,具体的日志内容).

  运行结果如下:

2017-12-02 17:52:20,049 - LazyLogging - DEBUG - getUserCount is called
2017-12-02 17:52:20,049 - LazyLogging - INFO - There are 1 users logged in now.

  由于当前的log level是DEBUG,那么第15行的日志会输出,这是正常的.

  但是将log level改成INFO,“getUserCount is called”
这条日志仍然会被输出.这就暴露出两个问题,第一,即使日志内容最终不会被输出,仍然会生成一个字符串;第二,一定会调用getUserCount
这个函数,而如果根据log lebel不需要输出日志,这两步都是没有必要的

  要解决第一个问题很简单,将第15行改成

logger.debug("There are %s users logged in now.", getUserCount()) 

  即可,但是“getUserCount is called”这条日志仍然会被输出,即第二个问题仍然没有解决

  按照stackoverflow上的这篇文章
<https://stackoverflow.com/questions/4148790/lazy-logger-message-string-evaluation>
,也是可以解决第二个问题的,方案如下
1 class lazyEval: 2 def __init__(self, f, *args): 3 self.func = f 4 self.args =
args5 6 def __str__(self): 7 return str(self.func(*self.args)) 8 9
logger.debug("There are %s users logged in now.", lazyEval(getUserCount))
 

  在log lever为INFO的时候,可以发现“getUserCount is called”
也不会被输出了,即不会调用getUserCount函数,做到了真正的lazy logging.


  但是,这是非常不好的写法!除非你们保证所调用的函数是没有副作用的(side-effect),即函数的调用无状态,不会改变程序的状态,否则程序的状态居然会依赖于log
level,这是一个陷阱,一个坑,一点都不pythonic.

consider async logging


  日志的记录也是要消耗时间的,这也是我们不能到处打印日志的原因.一般来说,logging框架的效率都是比较高的,但是如果profile发现日志的输出确实带来了不容忽视的消耗,那么也是值得优化的,其中的一个方法就是异步log,以避免日志输出阻塞线程

conclusion

  在网上看到这么一句话

Logs are like car insurance. Nobody wants to pay for it, but when something
goes wrong everyone wants the best available  

  log就像车辆保险,没人愿意为保险付钱,但是一旦出了问题谁都又想有保险可用.

  我们打印日志的时候都很随意,于是在差bug的时候,就会骂别人,或者几天前的自己:SB!

  从今天起,写好每一条日志吧,与君共勉!

references

Distributed systems for fun and profit <http://book.mixu.net/distsys/time.html>

python logging#logrecord-attributes
<https://docs.python.org/2/library/logging.html#logrecord-attributes>

the-art-of-logging-advanced-message-formatting
<https://garygregory.wordpress.com/2015/10/08/the-art-of-logging-advanced-message-formatting/>