Programmers learn every language from printing“hello
world” Beginning, Logs are also learning for novice programmers, A powerful tool for debugging programs. When the project goes online, There will be all kinds of logs, For example, record the behavior of users, Status of the server, Exceptions, etc. Printing a log seems like a simple thing, Something not worth mentioning. however, When you see online projects with such logs, Can't help but ask us if we really print logs?

***********************************

Error Happened, F*ck

User xxx can not login  

   In my opinion, Printing log is an art, And it's been ignored by programmers for a long time, This paper attempts to answer the following questions, Share my thoughts on Printing log:

  (1) why or when to log

  (2) what to log

  (3) how to log and tips

   Be careful: This article will use“ Printing" Log to uniformly represent the output of log, But not the output of our narrow sense understanding to the terminal orstdout,stderr,
It generally refers to the output of log content to any target, Including but not limited to terminal, file, network transmission.

   Address of this article:http://www.cnblogs.com/xybaby/p/7954610.html
<http://www.cnblogs.com/xybaby/p/7954610.html> 

why or when to log


   Why to print the log, Or when to print the log, It seems to me the same question, That's the purpose of printing logs. Different purposes, Determines the format of log content, Frequency of output, Destination of output. What are the possible reasons for printing logs?

  (1) Debug development


   The purpose is to use the debugging program in the development period, This kind of log is large, And there's no real significance, Should only appear in the development period, It should not be output after the project goes online. How to control whether this type of log is output, It will be discussed in detail later.

  (2) User behavior log


   This type of log, Record the operation behavior of users, For big data analysis, For example monitoring, Risk management, Recommendation, etc.. This log, It is generally used for analysis by other teams, And it could be multiple teams, Therefore, there are certain format requirements, Developers should record in this format, Easy for other teams to use. Of course, What behaviors to document, operation, Generally agreed, therefore, The main role of developers is to execute.

  (3) Program running log


   Record the health of the program, Especially unexpected behavior, Abnormal situation, This log, Mainly for development, Use by maintenance personnel. When to record, What to record, It's all up to the developers, Developers are highly autonomous. This paper mainly discusses this type of log, Because as a server development, Operation and maintenance personnel, Program running log is often the life-saving straw to solve online problems.

  (4) Record the status of the system or machine

   For example, network request, systemCPU, Memory,IO Usage, etc, This kind of log is mainly used by operation and maintenance personnel, Generate a variety of more intuitive forms of presentation, Alarm in case of system failure.

 

what to log

   What should a log contain, I think it should include when,where,how,what,who,context, The specific meaning will be described in the next section.


   We need to keep a log, Always at a certain time, For example, a user's request, A network call, Or the internal state has changed. In the latter article, Collectively referred to as events(event), That is, the log content is a description of an event.

when: the time event happens


  when, That's when we print the logs( time stamp), Note that the time here refers to the time when the logged event occurred, Instead of the time when the log was finally output. For example, if the output target of a log is a file, So herewhen Not write to file time, Because there are often delays.

    The importance of time, stay《Distributed systems for fun and profit
<http://book.mixu.net/distsys/time.html>》 It's a good story.


   First, Time can be read(Interpretaion) Concrete, Realistic date, Let us know the time environment when the event happened, For example, is there any activity when there is a problem, Analyze the network of the server at that time combined with the system log,CPU,IO What is it like?. Specific time points can also help us analyze whether there is a certain rule in the occurrence of events, Like every day, weekly, There's a problem at a fixed time every month.


   Secondly, Time can represent a( a set) Duration of the event(duration). such as, It can monitor the execution time of a section of code, It can also record the time consumption of a network request. This time difference can also give us a lot of information, For example, predict the operation status of the program at that time based on experience, Are they compared?‘ card’.


   Last, Time represents the sequence of events(order), We sort multiple logs by time, This sequence can help usdebug How did the problem come about, In what order. This is for multi process( Multithreading), asynchronous, Very important for distributed programs. Although we know the time of global maintenance in distributed system(global
clock) It's a complicated thing, But we useNTP Agreement, Basically solve most of the problems.

where: where the event happens


  where, Where are the logs recorded, In essence, It's where the incident happened. According to the situation, Can be specific to which module, Which document? Even which function, Which line of code. Practice informing, At least module information should be included.

  where The point is to be able to let yourself or other programmers see at a glance where this log is generated, So we can roughly locate the problem, Instead of global log contentgrep Code

how:how importance of the event

  how
important, Represents the importance of the event, We print many kinds of logs, But the importance of different logs is different. such as, Debug logs are the least important, There should be no online projects, But the error log of program running needs to be taken seriously, Because an exception has occurred in the delegate program. Even the error log of the program, There are also different degrees of urgency, A network request timeout is totally different from the abnormal exit of a subprocess.

   therefore, Every log should havelog level,log
level Represents the importance of logs, Emergency level. Different languages, Framelevel There are some differences in segmentation, But it will includedebug,info,warn,error,fatal(critical). Its importance can be seen from its name.


   Of course, Occasionally,warn Anderror, perhapserror Andfatal The line between them is not so obvious, This needs to be agreed between teams. In our online project, abouterror,fatal Level log will alarm, If it doeserror Journal, Then we need to deal with it the next day at the latest, If it isfatal Journal, Even in the middle of the night, We need to get up and analyze it immediately, Handle.

what:what is the log message


   what Is the main content of the log, What happened should be described briefly and concisely. Requirements can be obtained through the log itself, Instead of Rereading the code that generated the log, To get a sense of what happened. therefore, The following log is unqualified:
1 def user_login(username, password): 2 if not valid_username(username): 3
logger.warn('Error Happened') 4 return 5 # some others code
   Yes, That's true. Something's wrong, But the log should tell me what's wrong, So the log should at least be like this:
1 def user_login(username, password): 2 if not valid_username(username): 3
logger.warn('user_login failed due to unvalid_username') 4 return 5 # some
others code
 

who:the uniq identify


  who Unique identifier representing the event producer(identity), Used to distinguish the same events. Especially on the server side, It's a lot of users, Concurrency of requests, If the log content does not contain unique identification information, Then the log will be submerged in the vast sea, For example, the following log:
1 def user_login(username, password): 2 # some code has check the username 3 if
not valid_password(password) or not check_password(username, password): 4
logger.warn('user_login failed due to password') 5 return 6 # some others code
 

   What's wrong with the code above, But it doesn't include who's wrong, At least it should be
def user_login(username, password): # some code has check the username if not
valid_password(password) or not check_password(username, password): logger.warn(
'user_login failed due to password, username %s', username) return

   The unique identification of the log can also help us retrieve, filter, Find all logs of the last period of time for this unique ID, Sort by time, It can restore the active trace of the log body, For example, a user's operation on the website will spread to multiple processes, service. When through user name( Unique identification of the user) When searching for logs, Can restore the player's operation record, This is very effective when looking up online problems.


   Of course, This unique logo is very extensive, It needs to be decided according to the specific situation, If network requests, Maybe betterrequestid,sessionid; If it is system log, So it could be a process, threadID; If it's a distributed cluster, Then it could be the only copyid

context: environment when event happens


   The context in which logged events occur is intuitive and important, Be able to tell you what happened. Of course, As mentioned abovewhen,where,who All belong to context, These are fixed, General purpose. And in this section,context Information highly dependent on specific log content, Such information, Is used to locate the specific cause of the problem. such as
1 def user_login(username, password): 2 # some code has check the username 3 if
not valid_password(password) or not check_password(username, password): 4
logger.warn('user_login failed due to password, username %s', username) 5 return
6 # some others code
 

    Reading code, Whenpassword Non conformance to specifications, Or checkpassword When you fail, you make mistakes, But that's what happened, The content of the log does not indicate, So this log is still unqualified.

   In my opinion, This part is the most elastic, But the most important part, There are many exceptions, What information to record, It all depends on the programmer who wrote the code. But it's also very simple, Just follow one principle:
Think about what information can be added to locate the cause of the problem!

how to log and tips

use logging framework


   The logging functions provided by the early programming languages are relatively rudimentary, Usually output to terminal or file, asC++,Java, And it does not support the configuration of unified output format. For these languages, There's usually a separate onelog frame, asglog,log4cpp,log4j,elf4j Wait. And newer languages, Generally built-in powerfullogging Modular, aspython.

   I'm talking about it earlier“what to
log” When, We pointed out that we should record the time when the log happened, place, Grade and other information. Is this information printed in the log every time, These contents are really necessary for a complete log, But if you want programmers to write every log, you have to add these things manually, So it's a burden. frame, The function of modules is to free programmers from these red tape. For example, in manylogging Frame module(thon
logging,log4j) in, Both havenamed logger This concept, thisname It can bemodule, filename,
classname perhapsinstance, This solves the problem mentioned in the previous chapter“where the event happen” This problem.

   Ad locum, Yeslogging The introduction of the framework is not detailed, Just to summarizelogging frame, Function of the module

  (1) Set uplog Output level of


   In this way, the program code can not be changed, Just modifylog Output level of, Which log output can be controlled, Which logs are not output. For example, our debugging log in the development period, Can be set toDEBUG, Set the output level toINFO, Then these debug logs will not be output.

  (2) Set the default content of each log

   withPython logging
<https://docs.python.org/2/library/logging.html#logrecord-attributes>
take as an example, Can passformatter Set the default information of each log, Such as time, file, Line number, process, Thread information.

  (3) Set the output target of the log


   By configuration, You can specify that the log is output tostdout, Or document? Or network?. Especially inlinux Server, Output log tosyslog, Reusesyslog Powerful handling, Distribution function, Coordinationelk System analysis, Is a common practice for many applications.

log never throw


   We print logs, To record the scene of the accident, In order to find out the problem, Solve the problem. Then make sure, The act of printing logs by itself cannot introduce new problems, such as, Can't throw exception in error. This is like comparison. The fire engines that deal with the accident can't rush into the scene.

   however, More important logs, such asError,
Fatal Level log, The lower the probability of occurrence, And we don't want to have a log like this, Because if it happens, it means something unusual or an online accident. Such logs, Unit test must be done well, Coverage test, Ensure that the log itself works properly.

log when u think something never happen

   this point, It should be aimed atwhy or when to
log For. That is to say, When you think something will never happen, According to Murphy's law, It's still likely to happen, Then it should be recorded in case of occurrencelog,Error(Fatal) Gradelog. Especially in asynchronous, In the case of concurrency, The impossibility of a programmer's task becomes possible.

   For example, the following pseudo code
1 def magic_func(): 2 if check_ok(): 3 do_something() 4 5 return
 


   The code is simple, If the conditions are met, Then deal with it accordingly, Otherwise, return directly. There are two kinds here可能,第一种是条件不满足是可能的,可预期的情况,只是说,在这种情况下什么都不用作;第二种情况是程序员觉得条件一定会满足,一定不会出现else的情况,所以什么都没做.如果是第二种情况,那么就一定得加日志,因为一切都可能发生.

lazy logging

  日志的内容,在最终输出的时候,应该是一个字符串,那么这个字符串是什么时候产生呢?前面提到,我们可以通过log
level来控制一条日志是否被输出,那么字符串的生成越迟越好,因为有可能日志根本就无需输出,也就无需生成这个字符串,这也是我之前提到过的lazy思想
<https://www.cnblogs.com/xybaby/p/6425735.html>.

  至于lazy logging,各个语言,框架都是大同小异的,在这篇文章
<https://garygregory.wordpress.com/2015/10/08/the-art-of-logging-advanced-message-formatting/>
中,举出了log4j的例子,在这里还是用我比较熟悉的python讲解这个例子
1 #coding=utf-8 2 import logging 3 4 logger = logging.getLogger('
LazyLogging') 5 logger.setLevel(logging.DEBUG) 6 hander =
logging.StreamHandler() 7 formatter = logging.Formatter('%(asctime)s - %(name)s
- %(levelname)s - %(message)s') 8 hander.setFormatter(formatter) 9
logger.addHandler(hander)10 11 def getUserCount(): 12 logger.info('getUserCount
is called') 13 return 1 14 15 logger.debug("There are " + str(getUserCount()) +
" users logged in now.")
 

  上面的代码可谓麻雀虽小五脏俱全,包含了使用logging框架的方方面面.第4行是named logger,第5行设置log
level,第6行指定输出目标,第7行指定输出格式(包括时间,logger name,log level,具体的日志内容).

  运行结果如下:

2017-12-02 17:52:20,049 - LazyLogging - DEBUG - getUserCount is called
2017-12-02 17:52:20,049 - LazyLogging - INFO - There are 1 users logged in now.

  由于当前的log level是DEBUG,那么第15行的日志会输出,这是正常的.

  但是将log level改成INFO,“getUserCount is called”
这条日志仍然会被输出.这就暴露出两个问题,第一,即使日志内容最终不会被输出,仍然会生成一个字符串;第二,一定会调用getUserCount
这个函数,而如果根据log lebel不需要输出日志,这两步都是没有必要的

  要解决第一个问题很简单,将第15行改成

logger.debug("There are %s users logged in now.", getUserCount()) 

  即可,但是“getUserCount is called”这条日志仍然会被输出,即第二个问题仍然没有解决

  按照stackoverflow上的这篇文章
<https://stackoverflow.com/questions/4148790/lazy-logger-message-string-evaluation>
,也是可以解决第二个问题的,方案如下
1 class lazyEval: 2 def __init__(self, f, *args): 3 self.func = f 4 self.args =
args5 6 def __str__(self): 7 return str(self.func(*self.args)) 8 9
logger.debug("There are %s users logged in now.", lazyEval(getUserCount))
 

  在log lever为INFO的时候,可以发现“getUserCount is called”
也不会被输出了,即不会调用getUserCount函数,做到了真正的lazy logging.


  但是,这是非常不好的写法!除非你们保证所调用的函数是没有副作用的(side-effect),即函数的调用无状态,不会改变程序的状态,否则程序的状态居然会依赖于log
level,这是一个陷阱,一个坑,一点都不pythonic.

consider async logging


  日志的记录也是要消耗时间的,这也是我们不能到处打印日志的原因.一般来说,logging框架的效率都是比较高的,但是如果profile发现日志的输出确实带来了不容忽视的消耗,那么也是值得优化的,其中的一个方法就是异步log,以避免日志输出阻塞线程

conclusion

  在网上看到这么一句话

Logs are like car insurance. Nobody wants to pay for it, but when something
goes wrong everyone wants the best available  

  log就像车辆保险,没人愿意为保险付钱,但是一旦出了问题谁都又想有保险可用.

  我们打印日志的时候都很随意,于是在差bug的时候,就会骂别人,或者几天前的自己:SB!

  从今天起,写好每一条日志吧,与君共勉!

references

Distributed systems for fun and profit <http://book.mixu.net/distsys/time.html>

python logging#logrecord-attributes
<https://docs.python.org/2/library/logging.html#logrecord-attributes>

the-art-of-logging-advanced-message-formatting
<https://garygregory.wordpress.com/2015/10/08/the-art-of-logging-advanced-message-formatting/>