This paper mainly explains that TCP During connection , Operation on socket in each stage , I hope I can understand what socket is for people without network programming foundation , The role played helps . If an error is found , Please point out

<>

background

1. Full socket format {protocol,src_addr,src_port,dest_addr,dest_port}.

This is often referred to as the quintuple of a socket . among protocol Specified yes TCP still UDP connect , The rest specify the source address separately , Source port , Destination address , Target port . But how did it come about ?

2.TCP The protocol stack maintains two socket buffer :send buffer and recv buffer.

To pass TCP The data sent by the connection is copied to send buffer, Probably from user space app buffer Copied in , Or from the kernel kernel
buffer Copied in , The process of copying in is through send() Function completed , Because it can also be used write() Function write data , So it's also called writing data , Corresponding send
buffer There's another name write buffer. however send() Functional ratio write() Functions are more efficient .

The final data flows out through the network card , therefore send
buffer Data in need to be copied to the network card . Because one end is memory , One end is network card device , Can be used directly DMA Copy by , No need CPU Participation of . in other words ,send
buffer Data in passes DMA Copy to network card and transfer to TCP The other end of the connection : receiving end .

When passed TCP When connecting to receive data , The data must flow in through the network card first , And then again DMA Copy to recv buffer in , Pass again recv() Function to transfer data from recv
buffer Copy into the app buffer in .

The general process is as follows :



3. Two sockets : Listening and connected sockets .


Listening socket is when the service process reads the configuration file , Resolve the address to listen to from the configuration file , port , And then through socket() Function created , And then through bind() Function to bind the listening socket to the corresponding address and port . subsequently , process / The thread can pass listen() Function to listen to this port ( Strictly speaking, monitoring this monitoring socket ).

Connected socket is listening to TCP After connecting the request and shaking hands three times , adopt accept() Socket returned by function , Follow up process / The thread can use the connected socket and the client to TCP signal communication .


To differentiate socket() Functions and accept() Two socket descriptors returned by function , Some people use listenfd and connfd Indicates listening socket and connected socket respectively , Pretty good , This is occasionally used below .

Here are the functions , Analyze these functions , It's also connecting , Process of disconnection .

<>

Specific process analysis of connection

As shown below :



<>

socket() function

socket() The socket file descriptor function generates a socket file descriptor for communication sockfd(socket() creates an endpoint for
communication and returns a descriptor). This socket descriptor can be used as a later bind() Binding object for function .

<>

bind() function


Service program analyzes configuration file through , Resolve the address and port you want to listen to , Plus you can get through socket() Socket generated by function sockfd, Can be used bind() Function to bind the socket to the address and port combination to listen to "addr:port" upper . A socket with a port bound can be used as listen() Listener for function .

Socket with address and port bound has source address and port ( Source for the server itself ), Plus the protocol type specified in the configuration file , There are five tuples 3 Tuples . Namely :
{protocal,src_addr,src_port}
however , It is common to see that some service programs can configure to listen to multiple addresses , Port implementation multiple instances . This is actually through many times socket()+bind() System call generates and binds multiple sockets .

<>

listen() Functions and connect() function


seeing the name of a thing one thinks of its function ,listen() Function is listening has passed bind() Bound addr+port Of . After monitoring , Socket from CLOSE State transition to LISTEN state , So the socket can be provided externally TCP Connected window .


and connect() Function to initiate a connection request to a listening socket , That is, to initiate TCP Three handshake process of . It can be seen from here , Connection requester ( Such as client ) Will be used connect() function , of course , Starting connect() before , The connection initiator also needs to generate a sockfd, And it is likely to use socket with random port bound . since connect() The function initiates a connection to a socket , Nature in use connect() Function with connected destination , I.e. destination address and destination port , This is the address and port bound on the listening socket of the server . meanwhile , It also has its own address and port , For the server , This is the source address and port of the connection request . therefore ,TCP The sockets at both ends of the connection have become the complete format of the quintuple .

<>

depth analysis listen()

Let's talk about it listen() function . If you listen to multiple addresses + port , You need to listen to multiple sockets , So now I'm in charge of the monitoring process / Thread will adopt select(),poll() To poll these sockets ( of course , It can also be used epoll() pattern ), When only one socket is monitored , These modes are also used to poll , It's just select() or poll() There is only one socket descriptor of interest .
Regardless of use select() still poll() pattern ( as for epoll We don't need to talk about the different monitoring methods ),
In process / thread ( monitor ) In the process of monitoring , It's blocked in select() or poll() upper . Until there's data (SYN information ) Write to what it listens for sockfd in ( Namely recv
buffer), Kernel wake up ( Notice it's not app Process wake up , because TCP Three handshakes and four waves are done by the kernel in kernel space , No user space involved ) And will SYN Copy data to kernel
buffer We need to deal with it ( For example, judgment SYN Is it reasonable ), And prepare SYN+ACK data , This data needs to be collected from kernel buffer Copy in send
buffer in , Copy in the network card and send it out . The connection to the unfinished queue (syn
queue) Create a new project for this connection in , And set to SYN_RECV state . Then use it again select()/poll() Way to monitor sockets listenfd, Until data is written to this again listenfd in , Kernel wakes up again , If the data written this time is ACK information , It means that a client sends it to the server kernel SYN Response to , So copy the data to kernel
buffer After some treatment , Move the corresponding items in the connection incomplete queue to the connection completed queue (accept queue/established
queue), And set to ESTABLISHED state , If it's not received this time ACK, It must be SYN, New connection request , So it's the same process as above , Put in the connection incomplete queue . For connections that have been placed in the completed queue , Will wait for kernel to pass accept() Function to consume
( Initiated by a user space process accept() system call , Consumption operation completed by kernel ), Just go by accept() Over connection , The connection will be removed from the completed queue , It means TCP It has been established , The user space processes at both ends can transfer real data through this connection , Until use close() or shutdown() When the connection is closed 4 Second wave , The kernel is no longer needed in the middle . That's how the monitor handles the whole thing TCP Loop process of connection
.
in other words ,listen() The function also maintains two queues : Connection incomplete queue (syn queue) And connection completed queues (accept queue)
. When a listener receives a message from a client SYN And replied SYN+ACK after , An entry about this client will be created at the end of the unfinished connection queue , And set its status to SYN_RECV. obviously , This entry must contain information about the address and port of the client ( It could be hash Yes , I'm not sure ). When the server receives the message sent by the client again ACK After information , By analyzing the data, the listener thread knows which item in the unfinished connection queue this message is returned to , Move this item to the completed connection queue , And set its status to ESTABLISHED, Finally, wait for the kernel to use accept() Function to consume and receive this connection . From here on , The kernel is temporarily out of the stage , until 4 Second wave .

When the unfinished connection queue is full , Listener blocked no longer receives new connection requests , And passed select()/poll() Wait for two queues to trigger writable events . When the completed connection queue is full , The listener will not receive new connection requests , meanwhile , The action that is preparing to move into the completed connection queue is blocked . stay Linux
2.2 before ,listen() Function has a backlog Parameters of , Used to set the maximum total length of these two queues ( There's actually only one queue , But there are two states , See below " Little knowledge "), from Linux
2.2 start , This parameter only indicates the completed queue (accept
queue) Maximum length of , and /proc/sys/net/ipv4/tcp_max_syn_backlog Used to set the unfinished queue (syn queue/syn
backlog) Maximum length of ./proc/sys/net/core/somaxconn Hard limit the maximum length of completed queues , Default is 128, If backlog Parameter greater than somaxconn, be backlog Will be truncated to this hard limit .
When a connection in the queue is completed accept() after , express TCP Connection established , This connection will use its own socket buffer Data transmission with client
. this socket buffer And monitoring socket socket buffer It's all for storage TCP collect , Data sent , But their meaning is no longer the same : Listening on socket socket
buffer admit of only interpretation TCP During connection request syn and ack data ; Just established TCP Connected socket
buffer The main stored content is transmitted at both ends " formal " data , For example, response data built by the server , Client initiated Http Request data .
Little knowledge : two types TCP socket actually , There are two different types of TCP Socket implementation . The two types of queues described above are Linux
2.2 One of the following . There is another (BSD Derivative ) Only one queue is used for socket type of , In this single queue 3 All connections during handshake , But each connection in the queue has two states :syn-recv and established.
<>

Recv-Q and Send-Q Interpretation of

netstat Ordered Send-Q and Recv-Q The list shows socket buffer Related content , Here is man netstat Interpretation of .
Recv-Q Established: The count of bytes not copied by the user program
connected tothis socket. Listening: Since Kernel 2.6.18 this column contains
the current syn backlog. Send-QEstablished: The count of bytes not acknowledged
by the remote host. Listening: Since Kernel 2.6.18 this column contains the
maximum sizeof the syn backlog.
For listening socket ,Recv-Q Represents the current syn backlog, I.e. stacked syn Number of messages , That is, the current number of connections in the unfinished queue ,Send-Q It means syn
backlog Max of , That is to say, the maximum number of connections in the unfinished connection queue ;
For established tcp connect ,Recv-Q The list shows recv buffer The size of data not copied by user process in
,Send-Q The list shows that the remote host has not returned ACK Data size of message .

Why the distinction has been established TCP Connected socket and listening socket , Because the sockets in these two states are different socket
buffer, Listening socket pays more attention to the length of queue , Just established TCP Connected sockets pay more attention to , Data size sent .
[[email protected] ~]# netstat -tnl Active Internet connections (only servers) Proto
Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp6 0 0 :::80 :::* LISTEN tcp6 0 0
:::22 :::* LISTEN tcp6 0 0 ::1:25 :::* LISTEN [[email protected] ~]# ss -tnl State Recv
-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 *:22 *:* LISTEN 0
100 127.0.0.1:25 *:* LISTEN 0 128 :::80 :::* LISTEN 0 128 :::22 :::* LISTEN 0
100 ::1:25 :::*

be careful ,Listen Socket in state ,netstat Of Send-Q and ss Ordered Send-Q Columns have different values , because netstat The maximum length of the unfinished queue is not written at all . therefore , Determine whether there is any free position in the queue to receive the new tcp On connection request , Should be used as much as possible ss Command, not netstat.

<>

syn flood Impact of


in addition , If the listener sends SYN+ACK after , The client can't receive the returned ACK news , The monitor will be select()/poll() Set timeout wake up , And resend it to the client SYN+ACK news , Prevent this message from being lost in the vast network . however , There's a problem with this reissue , If the client calls connect() Time forgery source address , So the listener replied SYN+ACK The message must not reach the host of the other party , in other words , The monitor will be late ACK news , So it's resend SYN+ACK. But whether it's a monitor, because select()/poll() The set timeout is woken up again and again , Or copy data in again and again send
buffer, All this time CPU Participating , and send
buffer In SYN+ACK And copy in the network card ( This time DMA Copy , unwanted CPU). If , This client is an attacker , Thousands of them have been sent continuously , Ten thousand SYN, The monitor almost collapsed , The network card will be blocked seriously . This is what we call syn
flood attack .

solve syn
flood There are many ways , for example , narrow listen() Maximum length of two queues maintained , Reduce retransmission syn+ack Times of , Increase retransmission interval , Reduce receipt ack Wait timeout for , use syncookie etc. , But direct modification tcp None of the options is good for performance and efficiency . Therefore, it is extremely important to filter packets before the connection reaches the listener thread 要的手段.

<>

accept()函数

accpet()函数的作用是读取已完成连接队列中的第一项(读完就从队列中移除),并对此项生成一个用于后续连接的套接字描述符
,假设使用connfd来表示.有了新的连接套接字,工作进程/线程(称其为工作者)就可以通过这个连接套接字和客户端进行数据传输,而前文所说的监听套接字(sockfd)则仍然被监听者监听.


例如,prefork模式的httpd,每个子进程既是监听者,又是工作者,每个客户端发起连接请求时,子进程在监听时将它接收进来,并释放对监听套接字的监听,使得其他子进程可以去监听这个套接字.多个来回后,终于是通过accpet()函数生成了新的连接套接字,于是这个子进程就可以通过这个套接字专心地和客户端建立交互,当然,中途可能会因为各种io等待而多次被阻塞或睡眠.这种效率真的很低,仅仅考虑从子进程收到SYN消息开始到最后生成新的连接套接字这几个阶段,这个子进程一次又一次地被阻塞.当然,可以将监听套接字设置为非阻塞IO模式,只是即使是非阻塞模式,它也要不断地去检查状态.


再考虑worker/event处理模式,每个子进程中都使用了一个专门的监听线程和N个工作线程.监听线程专门负责监听并建立新的连接套接字描述符,放入apache的套接字队列中.这样监听者和工作者就分开了,在监听的过程中,工作者可以仍然可以自由地工作.如果只从监听这一个角度来说,worker/event模式比prefork模式性能高的不是一点半点.


当监听者发起accept()系统调用的时候,如果已完成连接队列中没有任何数据,那么监听者会被阻塞.当然,可将套接字设置为非阻塞模式,这时accept()在得不到数据时会返回EWOULDBLOCK或EAGAIN的错误.可以使用select()或poll()或epoll来等待已完成连接队列的可读事件.还可以将套接字设置为信号驱动IO模式,让已完成连接队列中新加入的数据通知监听者将数据复制到app
buffer中并使用accept()进行处理.


常听到同步连接和异步连接的概念,它们到底是怎么区分的?同步连接的意思是,从监听者监听到某个客户端发送的SYN数据开始,它必须一直等待直到建立连接套接字,并和客户端数据交互结束,在和这个客户端的连接关闭之前,中间不会接收任何其他客户端的连接请求.细致一点解释,那就是同步连接时需要保证socket
buffer和app
buffer数据保持一致.通常以同步连接的方式处理时,监听者和工作者是同一个进程,例如httpd的prefork模型.而异步连接则可以在建立连接和数据交互的任何一个阶段接收,处理其他连接请求.通常,监听者和工作者不是同一个进程时使用异步连接的方式,例如httpd的event模型,尽管worker模型中监听者和工作者分开了,但是仍采用同步连接,监听者将连接请求接入并创建了连接套接字后,立即交给工作线程,工作线程处理的过程中一直只服务于该客户端直到连接断开,而event模式的异步也仅仅是在工作线程处理特殊的连接(如处于长连接状态的连接)时,可以将它交给监听线程保管而已,对于正常的连接,它仍等价于同步连接的方式,因此httpd的event所谓异步,其实是伪异步.
通俗而不严谨地说,同步连接是一个进程/线程处理一个连接,异步连接是一个进程/线程处理多个连接.

<>

tcp连接和套接字的关系

先明确一点,每个tcp连接的两端都会关联一个套接字和该套接字指向的文件描述符.


前面说过,当服务端收到了ack消息后,就表示三次握手完成了,表示和客户端的这个tcp连接已经建立好了.连接建立好的一开始,这个tcp连接会放在listen()打开的established
queue队列中等待accept()的消费.这个时候的tcp连接在服务端所关联的套接字是listen套接字和它指向的文件描述符.

当established
queue中的tcp连接被accept()消费后,这个tcp连接就会关联accept()所指定的套接字,并分配一个新的文件描述符.也就是说,经过accept()之后,这个连接和listen套接字已经没有任何关系了.




换句话说,连接还是那个连接,只不过服务端偷偷地换掉了这个tcp连接所关联的套接字和文件描述符,而客户端并不知道这一切.但这并不影响双方的通信,因为数据传输是基于连接而不是基于套接字的,只要能从文件描述符中将数据放入tcp连接这根"管道"里,数据就能到达另一端.


实际上,并不一定需要accept()才能进行tcp通信,因为在accept()之前连接就以建立好了,只不过它关联的是listen套接字对应的文件描述符,而这个套接字只识别三次握手和四次挥手涉及到的数据,而且这个套接字中的数据是由操作系统内核负责的.可以想像一下,只有listen()没有accept()时,客户端不断地发起connect(),服务端将一直将建立仅只连接而不做任何操作,直到listen的队列满了.

<>

send()和recv()函数

send()函数是将数据从app buffer复制到send buffer中(当然,也可能直接从内核的kernel
buffer中复制),recv()函数则是将recv buffer中的数据复制到app
buffer中.当然,对于tcp套接字来说,更多的是使用write()和read()函数来发送,读取socket
buffer数据,这里使用send()/recv()来说明仅仅只是它们的名称针对性更强而已.

这两个函数都涉及到了socket
buffer,但是在调用send()或recv()时,复制的源buffer中是否有数据,复制的目标buffer中是否已满而导致不可写是需要考虑的问题.不管哪一方,只要不满足条件,调用send()/recv()时进程/线程会被阻塞(假设套接字设置为阻塞式IO模型).当然,可以将套接字设置为非阻塞IO模型,这时在buffer不满足条件时调用send()/recv()函数,调用函数的进程/线程将返回错误状态信息EWOULDBLOCK或EAGAIN.buffer中是否有数据,是否已满而导致不可写,其实可以使用select()/poll()/epoll去监控对应的文件描述符(对应socket
buffer则监控该socket描述符),当满足条件时,再去调用send()/recv()就可以正常操作了.还可以将套接字设置为信号驱动IO或异步IO模型,这样数据准备好,复制好之前就不用再做无用功去调用send()/recv()了.

<>

close(),shutdown()函数

通用的close()函数可以关闭一个文件描述符,当然也包括面向连接的网络套接字描述符.当调用close()时,将会尝试发送send
buffer中的所有数据.但是close()函数只是将这个套接字引用计数减1,就像rm一样,删除一个文件时只是移除一个硬链接数,只有这个套接字的所有引用计数都被删除,套接字描述符才会真的被关闭,才会开始后续的四次挥手中.对于父子进程共享套接字的并发服务程序,调用close()关闭子进程的套接字并不会真的关闭套接字,因为父进程的套接字还处于打开状态,如果父进程一直不调用close()函数,那么这个套接字将一直处于打开状态,将一直进入不了四次挥手过程.


而shutdown()函数专门用于关闭网络套接字的连接,和close()对引用计数减一不同的是,它直接掐断套接字的所有连接,从而引发四次挥手的过程.可以指定3种关闭方式:

1.关闭写.此时将无法向send buffer中再写数据,send buffer中已有的数据会一直发送直到完毕.
2.关闭读.此时将无法从recv buffer中再读数据,recv buffer中已有的数据只能被丢弃.
3.关闭读和写.此时无法读,无法写,send buffer中已有的数据会发送直到完毕,但recv buffer中已有的数据将被丢弃.

无论是shutdown()还是close(),每次调用它们,在真正进入四次挥手的过程中,它们都会发送一个FIN.

<>

地址/端口重用技术

正常情况下,一个addr+port只能被一个套接字绑定,换句话说,addr+port不能被重用,不同套接字只能绑定到不同的addr+port上
.举个例子,如果想要开启两个sshd实例,先后启动的sshd实例配置文件中,必须不能配置同样的addr+port.同理,配置web虚拟主机时,除非是基于域名,否则两个虚拟主机必须不能配置同一个addr+port,而基于域名的虚拟主机能绑定同一个addr+port的原因是http的请求报文中包含主机名信息,实际上在这类连接请求到达的时候,仍是通过同一个套接字进行监听的,只不过监听到之后,httpd的工作进程/线程可以将这个连接分配到对应的主机上.


既然上面说的是正常情况下,当然就有非正常情况,也就是地址重用和端口重用技术,组合起来就是套接字重用.在现在的Linux内核中,已经有支持地址重用的socket选项SO_REUSEADDR和支持端口重用的socket选项SO_REUSEPORT.设置了端口重用选项后,再去绑定套接字,就不会再有错误了.而且,一个实例绑定了两个addr+port之后(可以绑定多个,此处以两个为例),就可以同一时刻使用两个监听进程/线程分别去监听它们,客户端发来的连接也就可以通过round-robin的均衡算法轮流地被接待.

对于监听进程/线程来说,每次重用的套接字被称为监听桶(listener bucket),即每个监听套接字都是一个监听桶.

以httpd的worker或event模型为例,假设目前有3个子进程,每个子进程中都有一个监听线程和N个工作线程.


那么,在没有地址重用的情况下,各个监听线程是争抢式监听的.在某一时刻,这个监听套接字上只能有一个监听线程在监听(通过获取互斥锁mutex方式获取监听资格),当这个监听线程接收到请求后,让出监听的资格,于是其他监听线程去抢这个监听资格,并只有一个线程可以抢的到.如下图:




当使用了地址重用和端口重用技术,就可以为同一个addr+port绑定多个套接字.例如下图中是多使用一个监听桶时,有两个套接字,于是有两个监听线程可以同时进行监听,当某个监听线程接收到请求后,让出资格,让其他监听线程去争抢资格.



如果再多绑定一个套接字,那么这三个监听线程都不用让出监听资格,可以无限监听.如下图.




似乎感觉上去,性能很好,不仅减轻了监听资格(互斥锁)的争抢,避免"饥饿问题",还能更高效地监听,并因为可以负载均衡,从而可以减轻监听线程的压力.但实际上,每个监听线程的监听过程都是需要消耗CPU的,如果只有一核CPU,即使重用了也体现不出重用的优势,反而因为切换监听线程而降低性能.因此,要使用端口重用,必须考虑是否已将各监听进程/线程隔离在各自的cpu中,也就是说是否重用,重用几次都需考虑cpu的核数以及是否将进程与cpu相互绑定.

暂时就先写这么多了.