Programmers learn every language from printing “hello
world” Starting , Logs are also learning for novice programmers , A powerful tool for debugging programs . When the project goes online , There will be all kinds of logs , For example, record the behavior of users , Status of the server , Exceptions, etc . Printing a log seems like a simple thing , Something not worth mentioning . however , When you see online projects with such logs , Can't help but ask us if we really print logs ?


Error Happened, F*ck

User xxx can not login  

   in my opinion , Printing log is an art , And it's been ignored by programmers for a long time , This paper attempts to answer the following questions , Share my thoughts on Printing log :

  (1) why or when to log

  (2) what to log

  (3) how to log and tips

   be careful : This article will use “ Printing " Log to uniformly represent the output of log , But not the output of our narrow sense understanding to the terminal or stdout,stderr,
It generally refers to the output of log content to any target , Including but not limited to terminal , file , network transmission .

   Address of this article :

why or when to log

   Why to print the log , Or when to print the log , It seems to me the same question , That's the purpose of printing logs . Different purposes , Determines the format of log content , Frequency of output , Destination of output . What are the possible reasons for printing logs ?

  (1) Debugging and development

   The purpose is to use the debugging program in the development period , This kind of log is large , And there's no real significance , Should only appear in the development period , It should not be output after the project goes online . How to control whether this type of log is output , It will be discussed in detail later .

  (2) User behavior log

   This type of log , Record the operation behavior of users , For big data analysis , For example, monitoring , risk management , Recommendations, etc . This kind of log , It is generally used for analysis by other teams , And it could be multiple teams , Therefore, there are certain format requirements , Developers should record in this format , Easy for other teams to use . of course , What behaviors to document , operation , Generally agreed , therefore , The main role of developers is to execute .

  (3) Program running log

   Record the health of the program , Especially unexpected behavior , Abnormal conditions , This kind of log , Mainly for development , Use by maintenance personnel . When to record , What to record , It's all up to the developers , Developers are highly autonomous . This paper mainly discusses this type of log , Because as a server development , Operation and maintenance personnel , Program running log is often the life-saving straw to solve online problems .

  (4) Record the status of the system or machine

   For example, network request , system CPU, Memory ,IO Usage, etc , This kind of log is mainly used by operation and maintenance personnel , Generate a variety of more intuitive forms of presentation , Alarm in case of system failure .


what to log

   What should a log contain , I think it should include when,where,how,what,who,context, The specific meaning will be described in the next section .

   We need to keep a log , Always at a certain time , For example, a user's request , A network call , Or the internal state has changed . In the following text , Collectively referred to as events (event), That is, the log content is a description of an event .

when: the time event happens

  when, That's when we print the logs ( time stamp ), Note that the time here refers to the time when the logged event occurred , Instead of the time when the log was finally output . For example, if the output target of a log is a file , So here when Not write to file time , Because there are often delays .

    The importance of time , stay 《Distributed systems for fun and profit
<>》 It's a good story .

   first , Time can be read (Interpretaion) To be concrete , Realistic date , Let us know the time environment when the event happened , For example, is there any activity when there is a problem , Analyze the network of the server at that time combined with the system log ,CPU,IO How is it . Specific time points can also help us analyze whether there is a certain rule in the occurrence of events , Like every day , weekly , There's a problem at a fixed time every month .

   secondly , Time can represent a ( a set ) Duration of the event (duration). such as , It can monitor the execution time of a section of code , It can also record the time consumption of a network request . This time difference can also give us a lot of information , For example, predict the operation status of the program at that time based on experience , Compare or not ‘ card ’.

   last , Time represents the sequence of events (order), We sort multiple logs by time , This sequence can help us debug How did the problem come about , In what order . This is for multi process ( Multithreading ), asynchronous , Very important for distributed programs . Although we know the time of global maintenance in distributed system (global
clock) It's a complicated thing , But we use NTP agreement , Basically solve most of the problems .

where: where the event happens

  where, Where are the logs recorded , In essence , It's where the incident happened . According to the situation , Can be specific to which module , Which file , Even which function , Which line of code . Practice notification , At least module information should be included .

  where The point is to be able to let yourself or other programmers see at a glance where this log is generated , So we can roughly locate the problem , Instead of global log content grep code

how:how importance of the event

important, Represents the importance of the event , We print many kinds of logs , But the importance of different logs is different . such as , Debug logs are the least important , There should be no online projects , But the error log of program running needs to be taken seriously , Because an exception has occurred in the delegate program . Even the error log of the program , There are also different degrees of urgency , A network request timeout is totally different from the abnormal exit of a subprocess .

   therefore , Every log should have log level,log
level Represents the importance of logs , Urgency . Different languages , Framed level There are some differences in segmentation , But it will include debug,info,warn,error,fatal(critical). Its importance can be seen from its name .

   of course , occasionally ,warn And error, perhaps error And fatal The line between them is not so obvious , This needs to be agreed between teams . In our online project , about error,fatal Level log will alarm , If it does error journal , Then we need to deal with it the next day at the latest , If it is fatal journal , Even in the middle of the night , We need to get up and analyze it immediately , handle .

what:what is the log message

   what Is the main content of the log , What happened should be described briefly and concisely . Requirements can be obtained through the log itself , Instead of Rereading the code that generated the log , To get a sense of what happened . therefore , The following log is unqualified :
1 def user_login(username, password): 2 if not valid_username(username): 3
logger.warn('Error Happened') 4 return 5 # some others code
   yes , I Know , Something's wrong , But the log should tell me what's wrong , So the log should at least be like this :
1 def user_login(username, password): 2 if not valid_username(username): 3
logger.warn('user_login failed due to unvalid_username') 4 return 5 # some
others code

who:the uniq identify

  who Unique identifier representing the event producer (identity), Used to distinguish the same events . Especially on the server side , It's a lot of users , Concurrency of requests , If the log content does not contain unique identification information , Then the log will be submerged in the vast sea , For example, the following log :
1 def user_login(username, password): 2 # some code has check the username 3 if
not valid_password(password) or not check_password(username, password): 4
logger.warn('user_login failed due to password') 5 return 6 # some others code

   What's wrong with the code above , But it doesn't include who's wrong , At least it should be
def user_login(username, password): # some code has check the username if not
valid_password(password) or not check_password(username, password): logger.warn(
'user_login failed due to password, username %s', username) return

   The unique identification of the log can also help us retrieve , filter , Find all logs of the last period of time for this unique ID , Sort by time , It can restore the active trace of the log body , For example, a user's operation on the website will spread to multiple processes , service . When through user name ( Unique identification of the user ) When searching for logs , Can restore the player's operation record , This is very effective when looking up online problems .

   of course , This unique logo is very extensive , It needs to be decided according to the specific situation , If network requests , Maybe better requestid,sessionid; If it is system log , So it could be a process , thread ID; If it's a distributed cluster , Then it could be the only copy id

context: environment when event happens

   The context in which logged events occur is intuitive and important , Be able to tell you what happened . of course , As mentioned above when,where,who All belong to context , These are fixed , General . And in this section ,context Information highly dependent on specific log content , This information , Is used to locate the specific cause of the problem . such as
1 def user_login(username, password): 2 # some code has check the username 3 if
not valid_password(password) or not check_password(username, password): 4
logger.warn('user_login failed due to password, username %s', username) 5 return
6 # some others code

    Read code , When password Non conformance to specifications , Or verification password When you fail, you make mistakes , But that's what happened , The content of the log does not indicate , So this log is still unqualified .

   in my opinion , This part is the most elastic , But the most important part , There are many exceptions , What information to record , It all depends on the programmer who wrote the code . But it's also very simple , Just follow one principle :
Think about what information can be added to locate the cause of the problem !

how to log and tips

use logging framework

   The logging functions provided by the early programming languages are relatively rudimentary , Usually output to terminal or file , as C++,Java, And it does not support the configuration of unified output format . For these languages , There's usually a separate one log frame , as glog,log4cpp,log4j,elf4j wait . And newer languages , Generally built-in powerful logging modular , as python.

   I'm talking about it earlier “what to
log” When , We pointed out that we should record the time when the log happened , place , Grade and other information . Is this information printed in the log every time , These contents are really necessary for a complete log , But if you want programmers to write every log, you have to add these things manually , So it's a burden . frame , The function of modules is to free programmers from these red tape . For example, in many logging Frame module (thon
logging,log4j) in , All of them named logger This concept , this name It could be module, filename,
classname perhaps instance, This solves the problem mentioned in the previous chapter “where the event happen” This question .

   ad locum , Yes logging The introduction of the framework is not detailed , Just to summarize logging frame , Function of the module

  (1) set up log Output level of

   In this way, the program code can not be changed , Modify only log Output level of , Which log output can be controlled , Which logs are not output . For example, our debugging log in the development period , Can be set to DEBUG, Set the output level to INFO, Then these debug logs will not be output .

  (2) Set the default content of each log

   with Python logging
take as an example , Through formatter Set the default information of each log , Like time , file , Line No , process , Thread information .

  (3) Set the output target of the log

   Through configuration , You can specify that the log is output to stdout, Or a document , Or the Internet . Especially in linux On the server , Output log to syslog, Reuse syslog Powerful handling , Distribution function , coordination elk System analysis , Is a common practice for many applications .

log never throw

   We print logs , To record the scene of the accident , In order to find out the problem , solve the problem . Then you have to promise , The act of printing logs by itself cannot introduce new problems , such as , Can't throw exception in error . It's like , The fire engines that deal with the accident can't rush into the scene .

   however , More important logs , such as Error,
Fatal Level log , The lower the probability of occurrence , And we don't want to have a log like this , Because if it happens, it means something unusual or an online accident . Such logs , Unit test must be done well , Coverage test , Ensure that the log itself works properly .

log when u think something never happen

   this point , It should be aimed at why or when to
log for . That is to say , When you think something will never happen , According to Murphy's law , It's still likely to happen , Then it should be recorded in case of occurrence log,Error(Fatal) Level log. Especially in asynchronous , In the case of concurrency , The impossibility of a programmer's task becomes possible .

   For example, the following pseudo code
1 def magic_func(): 2 if check_ok(): 3 do_something() 4 5 return

   The code is simple , If the conditions are met , Then deal with it accordingly , Otherwise, return directly . There are two kinds 可能,第一种是条件不满足是可能的,可预期的情况,只是说,在这种情况下什么都不用作;第二种情况是程序员觉得条件一定会满足,一定不会出现else的情况,所以什么都没做.如果是第二种情况,那么就一定得加日志,因为一切都可能发生.

lazy logging


  至于lazy logging,各个语言,框架都是大同小异的,在这篇文章
1 #coding=utf-8 2 import logging 3 4 logger = logging.getLogger('
LazyLogging') 5 logger.setLevel(logging.DEBUG) 6 hander =
logging.StreamHandler() 7 formatter = logging.Formatter('%(asctime)s - %(name)s
- %(levelname)s - %(message)s') 8 hander.setFormatter(formatter) 9
logger.addHandler(hander)10 11 def getUserCount(): 12'getUserCount
is called') 13 return 1 14 15 logger.debug("There are " + str(getUserCount()) +
" users logged in now.")

  上面的代码可谓麻雀虽小五脏俱全,包含了使用logging框架的方方面面.第4行是named logger,第5行设置log
level,第6行指定输出目标,第7行指定输出格式(包括时间,logger name,log level,具体的日志内容).


2017-12-02 17:52:20,049 - LazyLogging - DEBUG - getUserCount is called
2017-12-02 17:52:20,049 - LazyLogging - INFO - There are 1 users logged in now.

  由于当前的log level是DEBUG,那么第15行的日志会输出,这是正常的.

  但是将log level改成INFO,“getUserCount is called”
这个函数,而如果根据log lebel不需要输出日志,这两步都是没有必要的


logger.debug("There are %s users logged in now.", getUserCount()) 

  即可,但是“getUserCount is called”这条日志仍然会被输出,即第二个问题仍然没有解决

1 class lazyEval: 2 def __init__(self, f, *args): 3 self.func = f 4 self.args =
args5 6 def __str__(self): 7 return str(self.func(*self.args)) 8 9
logger.debug("There are %s users logged in now.", lazyEval(getUserCount))

  在log lever为INFO的时候,可以发现“getUserCount is called”
也不会被输出了,即不会调用getUserCount函数,做到了真正的lazy logging.


consider async logging




Logs are like car insurance. Nobody wants to pay for it, but when something
goes wrong everyone wants the best available  





Distributed systems for fun and profit <>

python logging#logrecord-attributes