This paper mainly explains thatTCP During connection, Operation on socket in each stage, I hope I can understand what socket is for people without network programming foundation, The role played helps. If an error is found, Please point out

<>

background

1. Full socket format{protocol,src_addr,src_port,dest_addr,dest_port}.

This is often referred to as the quintuple of a socket. amongprotocol Yes, yes.TCP stillUDP Connect, The rest specify the source address separately, Source port, Destination address, Target port. But how did it come about?

2.TCP The protocol stack maintains twosocket Buffer:send buffer andrecv buffer.

To passTCP The data sent by the connection is copied tosend buffer, Probably from user spaceapp buffer Copied in, Or from the kernelkernel
buffer Copied in, The process of copying in is throughsend() Function completed, Because it can also be usedwrite() Function write data, So it's also called writing data, Correspondingsend
buffer There's another namewrite buffer. Howeversend() Function ratiowrite() Functions are more efficient.

The final data flows out through the network card, thereforesend
buffer Data in need to be copied to the network card. Because one end is memory, One end is network card device, Can be used directlyDMA Copy by, No needCPU Participation. In other words,send
buffer Data in passesDMA Copy to network card and transfer toTCP The other end of the connection: receiving end.

When passedTCP When connecting to receive data, The data must flow in through the network card first, And then againDMA Copy torecv buffer in, Re passrecv() Function to transfer data fromrecv
buffer Copy into theapp buffer in.

The general process is as follows:



3. Two sockets: Listening and connected sockets.


Listening socket is when the service process reads the configuration file, Resolve the address to listen to from the configuration file, port, Then passsocket() Function created, And then throughbind() Function to bind the listening socket to the corresponding address and port. subsequently, process/ The thread can passlisten() Function to listen on this port( Strictly speaking, monitoring this monitoring socket).

Connected socket is listening toTCP After connecting the request and shaking hands three times, adoptaccept() Socket returned by function, Follow up process/ The thread can use the connected socket and the client toTCP Signal communication.


In order to distinguishsocket() Function sumaccept() Two socket descriptors returned by function, Some people uselistenfd andconnfd Indicates listening socket and connected socket respectively, Vivid, This is occasionally used below.

Here are the functions, Analyze these functions, It's also connecting, Process of disconnection.

<>

Specific process analysis of connection

Following chart:



<>

socket() function

socket() The socket file descriptor function generates a socket file descriptor for communicationsockfd(socket() creates an endpoint for
communication and returns a descriptor). This socket descriptor can be used as a laterbind() Binding object for function.

<>

bind() function


Service program analyzes configuration file through, Resolve the address and port you want to listen to, Plus you can get throughsocket() Socket generated by functionsockfd, Can be usedbind() Function to bind the socket to the address and port combination to listen to"addr:port" upper. A socket with a port bound can be used aslisten() Listener for function.

Socket with address and port bound has source address and port( Source for the server itself), Plus the protocol type specified in the configuration file, There are five tuples3 Tuple. Namely:
{protocal,src_addr,src_port}
however, It is common to see that some service programs can configure to listen to multiple addresses, Port implementation multiple instances. This is actually through many timessocket()+bind() System call generates and binds multiple sockets.

<>

listen() Function sumconnect() function


Seeing the name of a thing one thinks of its function,listen() Function is listening has passedbind() Boundaddr+port Of. After listening, Socket fromCLOSE State transition toLISTEN state, So the socket can be provided externallyTCP Connected window.


andconnect() Function to initiate a connection request to a listening socket, That is, to initiateTCP Three handshake process of. It can be seen from here, Connection requester( Such as client) Will be usedconnect() function, Of course, Initiatingconnect() before, The connection initiator also needs to generate asockfd, And it is likely to use socket with random port bound. Sinceconnect() Function to initiate a connection to a socket, Nature in useconnect() Function with connected destination, I.e. destination address and destination port, This is the address and port bound on the listening socket of the server. meanwhile, It also has its own address and port, For the server, This is the source address and port of the connection request. Therefore,TCP The sockets at both ends of the connection have become the complete format of the quintuples.

<>

depth analysislisten()

More detailslisten() function. If you listen to multiple addresses+ port, You need to listen to multiple sockets, So now I'm in charge of the monitoring process/ Thread will adoptselect(),poll() To poll these sockets( Of course, It can also be usedepoll() Pattern), When only one socket is monitored, These modes are also used to poll, Justselect() orpoll() There is only one socket descriptor of interest.
Regardless of useselect() stillpoll() Pattern( As forepoll We don't need to talk about the different monitoring methods),
In process/ thread( monitor) In the process of monitoring, It's blocked.select() orpoll() upper. Until there's data(SYN information) Write to what it listens forsockfd in( Namelyrecv
buffer), Kernel wake up( Be careful notapp Process wake up, becauseTCP Three handshakes and four waves are done by the kernel in kernel space, No user space involved) And willSYN Copy data tokernel
buffer We need to deal with it( For example, judgment.SYN Is it reasonable?), And prepareSYN+ACK data, This data needs to be collected fromkernel buffer Copy insend
buffer in, Copy in the network card and send it out. The connection to the unfinished queue(syn
queue) Create a new project for this connection in, And set toSYN_RECV state. Then use it againselect()/poll() Way to monitor socketslistenfd, Until data is written to this againlistenfd in, Kernel wakes up again, If the data written this time isACK information, It means that a client sends it to the server kernelSYN Response, So copy the data tokernel
buffer After some treatment, Move the corresponding items in the connection incomplete queue to the connection completed queue(accept queue/established
queue), And set toESTABLISHED state, If it's not received this timeACK, It must beSYN, New connection request, So it's the same process as above, Put in the connection incomplete queue. For connections that have been placed in the completed queue, Will wait for kernel to passaccept() Function to consume
( Initiated by a user space processaccept() system call, Consumption operation completed by kernel), Just go throughaccept() Over connection, The connection will be removed from the completed queue, It also meansTCP It has been established, The user space processes at both ends can transfer real data through this connection, Until useclose() orshutdown() When the connection is closed4 Second wave, The kernel is no longer needed in the middle. That's how the monitor handles the whole thingTCP Loop process of connection
.
In other words,listen() The function also maintains two queues: Connection incomplete queue(syn queue) And connection completed queues(accept queue)
. When a listener receives a message from a clientSYN And replied.SYN+ACK after, An entry about this client will be created at the end of the unfinished connection queue, And set its status toSYN_RECV. Obviously, This entry must contain information about the address and port of the client( May behash Past, I'm not sure). When the server receives the message sent by the client againACK After information, By analyzing the data, the listener thread knows which item in the unfinished connection queue this message is returned to, Move this item to the completed connection queue, And set its status toESTABLISHED, Finally, wait for the kernel to useaccept() Function to consume and receive this connection. From then on, The kernel is temporarily out of the stage, Until4 Second wave.

When the unfinished connection queue is full, Listener blocked no longer receives new connection requests, And passselect()/poll() Wait for two queues to trigger writable events. When the completed connection queue is full, The listener will not receive new connection requests, meanwhile, The action that is preparing to move into the completed connection queue is blocked. stayLinux
2.2 before,listen() Function has abacklog Parameters, Used to set the maximum total length of these two queues( There's actually only one queue, But there are two states, See below." Little knowledge"), fromLinux
2.2 start, This parameter only indicates the completed queue(accept
queue) Maximum length of, and/proc/sys/net/ipv4/tcp_max_syn_backlog Used to set the unfinished queue(syn queue/syn
backlog) Maximum length of./proc/sys/net/core/somaxconn Hard limit the maximum length of completed queues, Default is128, Ifbacklog Parameter greater thansomaxconn, bebacklog Will be truncated to this hard limit.
When a connection in the queue is completedaccept() after, ExpressTCP Connection established, This connection will use its ownsocket buffer Data transmission with client
. thissocket buffer And monitoring socketsocket buffer It's all for storageTCP collect, Data sent, But their meaning is no longer the same: Listening on socketsocket
buffer admit of only interpretationTCP During connection requestsyn andack data; Just establishedTCP Connectedsocket
buffer The main stored content is transmitted at both ends" formal" data, For example, response data built by the server, Client initiatedHttp Request data.
Little knowledge: two typesTCP socket Actually, There are two different types ofTCP Socket implementation. The two types of queues described above areLinux
2.2 One of the following. There is another kind.(BSD Derivation) Only one queue is used for socket type of, In this single queue3 All connections during handshake, But each connection in the queue has two states:syn-recv andestablished.
<>

Recv-Q andSend-Q Explanation

netstat ImperativeSend-Q andRecv-Q The list showssocket buffer Related content, Below isman netstat Explanation.
Recv-Q Established: The count of bytes not copied by the user program
connected tothis socket. Listening: Since Kernel 2.6.18 this column contains
the current syn backlog. Send-QEstablished: The count of bytes not acknowledged
by the remote host. Listening: Since Kernel 2.6.18 this column contains the
maximum sizeof the syn backlog.
For listening socket,Recv-Q Represents the currentsyn backlog, Piled upsyn Number of messages, That is, the current number of connections in the unfinished queue,Send-Q It means thatsyn
backlog Maximum value, That is to say, the maximum number of connections in the unfinished connection queue;
For establishedtcp Connect,Recv-Q The list showsrecv buffer The size of data not copied by user process in
,Send-Q The list shows that the remote host has not returnedACK Message data size.

Why the distinction has been establishedTCP Connected socket and listening socket, Because the sockets in these two states are differentsocket
buffer, Listening socket pays more attention to the length of queue, Just buildTCP Connected sockets pay more attention to, Data size sent.
[[email protected] ~]# netstat -tnl Active Internet connections (only servers) Proto
Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp6 0 0 :::80 :::* LISTEN tcp6 0 0
:::22 :::* LISTEN tcp6 0 0 ::1:25 :::* LISTEN [[email protected] ~]# ss -tnl State Recv
-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 *:22 *:* LISTEN 0
100 127.0.0.1:25 *:* LISTEN 0 128 :::80 :::* LISTEN 0 128 :::22 :::* LISTEN 0
100 ::1:25 :::*

Be careful,Listen Socket in state,netstat OfSend-Q andss ImperativeSend-Q Columns have different values, becausenetstat The maximum length of the unfinished queue is not written at all. therefore, Determine whether there is any free position in the queue to receive the newtcp On connection request, Should be used as much as possibless Command, notnetstat.

<>

syn flood Influence


in addition, If the listener sendsSYN+ACK after, The client can't receive the returnedACK news, The monitor will beselect()/poll() Set timeout wake up, And resend it to the clientSYN+ACK news, Prevent this message from being lost in the vast network. however, There's a problem with this reissue, If the client callsconnect() Time forgery source address, So the listener repliedSYN+ACK The message must not reach the host of the other party, In other words, The monitor will be lateACK news, So it's resendSYN+ACK. But whether it's a monitor, becauseselect()/poll() The set timeout is woken up again and again, Or copy data in again and againsend
buffer, All this timeCPU Participating, Andsend
buffer MediumSYN+ACK And copy in the network card( This timeDMA Copy, UnwantedCPU). If, This client is an attacker, Thousands of them have been sent continuously, Ten thousandSYN, The monitor almost collapsed, The network card will be blocked seriously. This is what we callsyn
flood attack.

Solvesyn
flood There are many ways, for example, narrowlisten() Maximum length of two queues maintained, Reduce retransmissionsyn+ack Number of times, Increase retransmission interval, Reduce receiptack Wait timeout for, Usesyncookie etc. But direct modificationtcp None of the options is good for performance and efficiency. Therefore, it is extremely important to filter packets before the connection reaches the listener thread要的手段.

<>

accept()函数

accpet()函数的作用是读取已完成连接队列中的第一项(读完就从队列中移除),并对此项生成一个用于后续连接的套接字描述符
,假设使用connfd来表示.有了新的连接套接字,工作进程/线程(称其为工作者)就可以通过这个连接套接字和客户端进行数据传输,而前文所说的监听套接字(sockfd)则仍然被监听者监听.


例如,prefork模式的httpd,每个子进程既是监听者,又是工作者,每个客户端发起连接请求时,子进程在监听时将它接收进来,并释放对监听套接字的监听,使得其他子进程可以去监听这个套接字.多个来回后,终于是通过accpet()函数生成了新的连接套接字,于是这个子进程就可以通过这个套接字专心地和客户端建立交互,当然,中途可能会因为各种io等待而多次被阻塞或睡眠.这种效率真的很低,仅仅考虑从子进程收到SYN消息开始到最后生成新的连接套接字这几个阶段,这个子进程一次又一次地被阻塞.当然,可以将监听套接字设置为非阻塞IO模式,只是即使是非阻塞模式,它也要不断地去检查状态.


再考虑worker/event处理模式,每个子进程中都使用了一个专门的监听线程和N个工作线程.监听线程专门负责监听并建立新的连接套接字描述符,放入apache的套接字队列中.这样监听者和工作者就分开了,在监听的过程中,工作者可以仍然可以自由地工作.如果只从监听这一个角度来说,worker/event模式比prefork模式性能高的不是一点半点.


当监听者发起accept()系统调用的时候,如果已完成连接队列中没有任何数据,那么监听者会被阻塞.当然,可将套接字设置为非阻塞模式,这时accept()在得不到数据时会返回EWOULDBLOCK或EAGAIN的错误.可以使用select()或poll()或epoll来等待已完成连接队列的可读事件.还可以将套接字设置为信号驱动IO模式,让已完成连接队列中新加入的数据通知监听者将数据复制到app
buffer中并使用accept()进行处理.


常听到同步连接和异步连接的概念,它们到底是怎么区分的?同步连接的意思是,从监听者监听到某个客户端发送的SYN数据开始,它必须一直等待直到建立连接套接字,并和客户端数据交互结束,在和这个客户端的连接关闭之前,中间不会接收任何其他客户端的连接请求.细致一点解释,那就是同步连接时需要保证socket
buffer和app
buffer数据保持一致.通常以同步连接的方式处理时,监听者和工作者是同一个进程,例如httpd的prefork模型.而异步连接则可以在建立连接和数据交互的任何一个阶段接收,处理其他连接请求.通常,监听者和工作者不是同一个进程时使用异步连接的方式,例如httpd的event模型,尽管worker模型中监听者和工作者分开了,但是仍采用同步连接,监听者将连接请求接入并创建了连接套接字后,立即交给工作线程,工作线程处理的过程中一直只服务于该客户端直到连接断开,而event模式的异步也仅仅是在工作线程处理特殊的连接(如处于长连接状态的连接)时,可以将它交给监听线程保管而已,对于正常的连接,它仍等价于同步连接的方式,因此httpd的event所谓异步,其实是伪异步.
通俗而不严谨地说,同步连接是一个进程/线程处理一个连接,异步连接是一个进程/线程处理多个连接.

<>

tcp连接和套接字的关系

先明确一点,每个tcp连接的两端都会关联一个套接字和该套接字指向的文件描述符.


前面说过,当服务端收到了ack消息后,就表示三次握手完成了,表示和客户端的这个tcp连接已经建立好了.连接建立好的一开始,这个tcp连接会放在listen()打开的established
queue队列中等待accept()的消费.这个时候的tcp连接在服务端所关联的套接字是listen套接字和它指向的文件描述符.

当established
queue中的tcp连接被accept()消费后,这个tcp连接就会关联accept()所指定的套接字,并分配一个新的文件描述符.也就是说,经过accept()之后,这个连接和listen套接字已经没有任何关系了.




换句话说,连接还是那个连接,只不过服务端偷偷地换掉了这个tcp连接所关联的套接字和文件描述符,而客户端并不知道这一切.但这并不影响双方的通信,因为数据传输是基于连接而不是基于套接字的,只要能从文件描述符中将数据放入tcp连接这根"管道"里,数据就能到达另一端.


实际上,并不一定需要accept()才能进行tcp通信,因为在accept()之前连接就以建立好了,只不过它关联的是listen套接字对应的文件描述符,而这个套接字只识别三次握手和四次挥手涉及到的数据,而且这个套接字中的数据是由操作系统内核负责的.可以想像一下,只有listen()没有accept()时,客户端不断地发起connect(),服务端将一直将建立仅只连接而不做任何操作,直到listen的队列满了.

<>

send()和recv()函数

send()函数是将数据从app buffer复制到send buffer中(当然,也可能直接从内核的kernel
buffer中复制),recv()函数则是将recv buffer中的数据复制到app
buffer中.当然,对于tcp套接字来说,更多的是使用write()和read()函数来发送,读取socket
buffer数据,这里使用send()/recv()来说明仅仅只是它们的名称针对性更强而已.

这两个函数都涉及到了socket
buffer,但是在调用send()或recv()时,复制的源buffer中是否有数据,复制的目标buffer中是否已满而导致不可写是需要考虑的问题.不管哪一方,只要不满足条件,调用send()/recv()时进程/线程会被阻塞(假设套接字设置为阻塞式IO模型).当然,可以将套接字设置为非阻塞IO模型,这时在buffer不满足条件时调用send()/recv()函数,调用函数的进程/线程将返回错误状态信息EWOULDBLOCK或EAGAIN.buffer中是否有数据,是否已满而导致不可写,其实可以使用select()/poll()/epoll去监控对应的文件描述符(对应socket
buffer则监控该socket描述符),当满足条件时,再去调用send()/recv()就可以正常操作了.还可以将套接字设置为信号驱动IO或异步IO模型,这样数据准备好,复制好之前就不用再做无用功去调用send()/recv()了.

<>

close(),shutdown()函数

通用的close()函数可以关闭一个文件描述符,当然也包括面向连接的网络套接字描述符.当调用close()时,将会尝试发送send
buffer中的所有数据.但是close()函数只是将这个套接字引用计数减1,就像rm一样,删除一个文件时只是移除一个硬链接数,只有这个套接字的所有引用计数都被删除,套接字描述符才会真的被关闭,才会开始后续的四次挥手中.对于父子进程共享套接字的并发服务程序,调用close()关闭子进程的套接字并不会真的关闭套接字,因为父进程的套接字还处于打开状态,如果父进程一直不调用close()函数,那么这个套接字将一直处于打开状态,将一直进入不了四次挥手过程.


而shutdown()函数专门用于关闭网络套接字的连接,和close()对引用计数减一不同的是,它直接掐断套接字的所有连接,从而引发四次挥手的过程.可以指定3种关闭方式:

1.关闭写.此时将无法向send buffer中再写数据,send buffer中已有的数据会一直发送直到完毕.
2.关闭读.此时将无法从recv buffer中再读数据,recv buffer中已有的数据只能被丢弃.
3.关闭读和写.此时无法读,无法写,send buffer中已有的数据会发送直到完毕,但recv buffer中已有的数据将被丢弃.

无论是shutdown()还是close(),每次调用它们,在真正进入四次挥手的过程中,它们都会发送一个FIN.

<>

地址/端口重用技术

正常情况下,一个addr+port只能被一个套接字绑定,换句话说,addr+port不能被重用,不同套接字只能绑定到不同的addr+port上
.举个例子,如果想要开启两个sshd实例,先后启动的sshd实例配置文件中,必须不能配置同样的addr+port.同理,配置web虚拟主机时,除非是基于域名,否则两个虚拟主机必须不能配置同一个addr+port,而基于域名的虚拟主机能绑定同一个addr+port的原因是http的请求报文中包含主机名信息,实际上在这类连接请求到达的时候,仍是通过同一个套接字进行监听的,只不过监听到之后,httpd的工作进程/线程可以将这个连接分配到对应的主机上.


既然上面说的是正常情况下,当然就有非正常情况,也就是地址重用和端口重用技术,组合起来就是套接字重用.在现在的Linux内核中,已经有支持地址重用的socket选项SO_REUSEADDR和支持端口重用的socket选项SO_REUSEPORT.设置了端口重用选项后,再去绑定套接字,就不会再有错误了.而且,一个实例绑定了两个addr+port之后(可以绑定多个,此处以两个为例),就可以同一时刻使用两个监听进程/线程分别去监听它们,客户端发来的连接也就可以通过round-robin的均衡算法轮流地被接待.

对于监听进程/线程来说,每次重用的套接字被称为监听桶(listener bucket),即每个监听套接字都是一个监听桶.

以httpd的worker或event模型为例,假设目前有3个子进程,每个子进程中都有一个监听线程和N个工作线程.


那么,在没有地址重用的情况下,各个监听线程是争抢式监听的.在某一时刻,这个监听套接字上只能有一个监听线程在监听(通过获取互斥锁mutex方式获取监听资格),当这个监听线程接收到请求后,让出监听的资格,于是其他监听线程去抢这个监听资格,并只有一个线程可以抢的到.如下图:




当使用了地址重用和端口重用技术,就可以为同一个addr+port绑定多个套接字.例如下图中是多使用一个监听桶时,有两个套接字,于是有两个监听线程可以同时进行监听,当某个监听线程接收到请求后,让出资格,让其他监听线程去争抢资格.



如果再多绑定一个套接字,那么这三个监听线程都不用让出监听资格,可以无限监听.如下图.




似乎感觉上去,性能很好,不仅减轻了监听资格(互斥锁)的争抢,避免"饥饿问题",还能更高效地监听,并因为可以负载均衡,从而可以减轻监听线程的压力.但实际上,每个监听线程的监听过程都是需要消耗CPU的,如果只有一核CPU,即使重用了也体现不出重用的优势,反而因为切换监听线程而降低性能.因此,要使用端口重用,必须考虑是否已将各监听进程/线程隔离在各自的cpu中,也就是说是否重用,重用几次都需考虑cpu的核数以及是否将进程与cpu相互绑定.

暂时就先写这么多了.