This paper mainly explains that TCP During connection , Operation on socket in each stage , I hope I can understand what socket is for people without network programming foundation , The role played helps . If an error is found , Please point out



1. Full socket format {protocol,src_addr,src_port,dest_addr,dest_port}.

This is often referred to as the quintuple of a socket . among protocol Specified yes TCP still UDP connect , The rest specify the source address separately , Source port , Destination address , Target port . But how did it come about ?

2.TCP The protocol stack maintains two socket buffer :send buffer and recv buffer.

To pass TCP The data sent by the connection is copied to send buffer, Probably from user space app buffer Copied in , Or from the kernel kernel
buffer Copied in , The process of copying in is through send() Function completed , Because it can also be used write() Function write data , So it's also called writing data , Corresponding send
buffer There's another name write buffer. however send() Functional ratio write() Functions are more efficient .

The final data flows out through the network card , therefore send
buffer Data in need to be copied to the network card . Because one end is memory , One end is network card device , Can be used directly DMA Copy by , No need CPU Participation of . in other words ,send
buffer Data in passes DMA Copy to network card and transfer to TCP The other end of the connection : receiving end .

When passed TCP When connecting to receive data , The data must flow in through the network card first , And then again DMA Copy to recv buffer in , Pass again recv() Function to transfer data from recv
buffer Copy into the app buffer in .

The general process is as follows :

3. Two sockets : Listening and connected sockets .

Listening socket is when the service process reads the configuration file , Resolve the address to listen to from the configuration file , port , And then through socket() Function created , And then through bind() Function to bind the listening socket to the corresponding address and port . subsequently , process / The thread can pass listen() Function to listen to this port ( Strictly speaking, monitoring this monitoring socket ).

Connected socket is listening to TCP After connecting the request and shaking hands three times , adopt accept() Socket returned by function , Follow up process / The thread can use the connected socket and the client to TCP signal communication .

To differentiate socket() Functions and accept() Two socket descriptors returned by function , Some people use listenfd and connfd Indicates listening socket and connected socket respectively , Pretty good , This is occasionally used below .

Here are the functions , Analyze these functions , It's also connecting , Process of disconnection .


Specific process analysis of connection

As shown below :


socket() function

socket() The socket file descriptor function generates a socket file descriptor for communication sockfd(socket() creates an endpoint for
communication and returns a descriptor). This socket descriptor can be used as a later bind() Binding object for function .


bind() function

Service program analyzes configuration file through , Resolve the address and port you want to listen to , Plus you can get through socket() Socket generated by function sockfd, Can be used bind() Function to bind the socket to the address and port combination to listen to "addr:port" upper . A socket with a port bound can be used as listen() Listener for function .

Socket with address and port bound has source address and port ( Source for the server itself ), Plus the protocol type specified in the configuration file , There are five tuples 3 Tuples . Namely :
however , It is common to see that some service programs can configure to listen to multiple addresses , Port implementation multiple instances . This is actually through many times socket()+bind() System call generates and binds multiple sockets .


listen() Functions and connect() function

seeing the name of a thing one thinks of its function ,listen() Function is listening has passed bind() Bound addr+port Of . After monitoring , Socket from CLOSE State transition to LISTEN state , So the socket can be provided externally TCP Connected window .

and connect() Function to initiate a connection request to a listening socket , That is, to initiate TCP Three handshake process of . It can be seen from here , Connection requester ( Such as client ) Will be used connect() function , of course , Starting connect() before , The connection initiator also needs to generate a sockfd, And it is likely to use socket with random port bound . since connect() The function initiates a connection to a socket , Nature in use connect() Function with connected destination , I.e. destination address and destination port , This is the address and port bound on the listening socket of the server . meanwhile , It also has its own address and port , For the server , This is the source address and port of the connection request . therefore ,TCP The sockets at both ends of the connection have become the complete format of the quintuple .


depth analysis listen()

Let's talk about it listen() function . If you listen to multiple addresses + port , You need to listen to multiple sockets , So now I'm in charge of the monitoring process / Thread will adopt select(),poll() To poll these sockets ( of course , It can also be used epoll() pattern ), When only one socket is monitored , These modes are also used to poll , It's just select() or poll() There is only one socket descriptor of interest .
Regardless of use select() still poll() pattern ( as for epoll We don't need to talk about the different monitoring methods ),
In process / thread ( monitor ) In the process of monitoring , It's blocked in select() or poll() upper . Until there's data (SYN information ) Write to what it listens for sockfd in ( Namely recv
buffer), Kernel wake up ( Notice it's not app Process wake up , because TCP Three handshakes and four waves are done by the kernel in kernel space , No user space involved ) And will SYN Copy data to kernel
buffer We need to deal with it ( For example, judgment SYN Is it reasonable ), And prepare SYN+ACK data , This data needs to be collected from kernel buffer Copy in send
buffer in , Copy in the network card and send it out . The connection to the unfinished queue (syn
queue) Create a new project for this connection in , And set to SYN_RECV state . Then use it again select()/poll() Way to monitor sockets listenfd, Until data is written to this again listenfd in , Kernel wakes up again , If the data written this time is ACK information , It means that a client sends it to the server kernel SYN Response to , So copy the data to kernel
buffer After some treatment , Move the corresponding items in the connection incomplete queue to the connection completed queue (accept queue/established
queue), And set to ESTABLISHED state , If it's not received this time ACK, It must be SYN, New connection request , So it's the same process as above , Put in the connection incomplete queue . For connections that have been placed in the completed queue , Will wait for kernel to pass accept() Function to consume
( Initiated by a user space process accept() system call , Consumption operation completed by kernel ), Just go by accept() Over connection , The connection will be removed from the completed queue , It means TCP It has been established , The user space processes at both ends can transfer real data through this connection , Until use close() or shutdown() When the connection is closed 4 Second wave , The kernel is no longer needed in the middle . That's how the monitor handles the whole thing TCP Loop process of connection
in other words ,listen() The function also maintains two queues : Connection incomplete queue (syn queue) And connection completed queues (accept queue)
. When a listener receives a message from a client SYN And replied SYN+ACK after , An entry about this client will be created at the end of the unfinished connection queue , And set its status to SYN_RECV. obviously , This entry must contain information about the address and port of the client ( It could be hash Yes , I'm not sure ). When the server receives the message sent by the client again ACK After information , By analyzing the data, the listener thread knows which item in the unfinished connection queue this message is returned to , Move this item to the completed connection queue , And set its status to ESTABLISHED, Finally, wait for the kernel to use accept() Function to consume and receive this connection . From here on , The kernel is temporarily out of the stage , until 4 Second wave .

When the unfinished connection queue is full , Listener blocked no longer receives new connection requests , And passed select()/poll() Wait for two queues to trigger writable events . When the completed connection queue is full , The listener will not receive new connection requests , meanwhile , The action that is preparing to move into the completed connection queue is blocked . stay Linux
2.2 before ,listen() Function has a backlog Parameters of , Used to set the maximum total length of these two queues ( There's actually only one queue , But there are two states , See below " Little knowledge "), from Linux
2.2 start , This parameter only indicates the completed queue (accept
queue) Maximum length of , and /proc/sys/net/ipv4/tcp_max_syn_backlog Used to set the unfinished queue (syn queue/syn
backlog) Maximum length of ./proc/sys/net/core/somaxconn Hard limit the maximum length of completed queues , Default is 128, If backlog Parameter greater than somaxconn, be backlog Will be truncated to this hard limit .
When a connection in the queue is completed accept() after , express TCP Connection established , This connection will use its own socket buffer Data transmission with client
. this socket buffer And monitoring socket socket buffer It's all for storage TCP collect , Data sent , But their meaning is no longer the same : Listening on socket socket
buffer admit of only interpretation TCP During connection request syn and ack data ; Just established TCP Connected socket
buffer The main stored content is transmitted at both ends " formal " data , For example, response data built by the server , Client initiated Http Request data .
Little knowledge : two types TCP socket actually , There are two different types of TCP Socket implementation . The two types of queues described above are Linux
2.2 One of the following . There is another (BSD Derivative ) Only one queue is used for socket type of , In this single queue 3 All connections during handshake , But each connection in the queue has two states :syn-recv and established.

Recv-Q and Send-Q Interpretation of

netstat Ordered Send-Q and Recv-Q The list shows socket buffer Related content , Here is man netstat Interpretation of .
Recv-Q Established: The count of bytes not copied by the user program
connected tothis socket. Listening: Since Kernel 2.6.18 this column contains
the current syn backlog. Send-QEstablished: The count of bytes not acknowledged
by the remote host. Listening: Since Kernel 2.6.18 this column contains the
maximum sizeof the syn backlog.
For listening socket ,Recv-Q Represents the current syn backlog, I.e. stacked syn Number of messages , That is, the current number of connections in the unfinished queue ,Send-Q It means syn
backlog Max of , That is to say, the maximum number of connections in the unfinished connection queue ;
For established tcp connect ,Recv-Q The list shows recv buffer The size of data not copied by user process in
,Send-Q The list shows that the remote host has not returned ACK Data size of message .

Why the distinction has been established TCP Connected socket and listening socket , Because the sockets in these two states are different socket
buffer, Listening socket pays more attention to the length of queue , Just established TCP Connected sockets pay more attention to , Data size sent .
[[email protected] ~]# netstat -tnl Active Internet connections (only servers) Proto
Recv-Q Send-Q Local Address Foreign Address State tcp 0 0*
LISTEN tcp 0 0* LISTEN tcp6 0 0 :::80 :::* LISTEN tcp6 0 0
:::22 :::* LISTEN tcp6 0 0 ::1:25 :::* LISTEN [[email protected] ~]# ss -tnl State Recv
-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 *:22 *:* LISTEN 0
100 *:* LISTEN 0 128 :::80 :::* LISTEN 0 128 :::22 :::* LISTEN 0
100 ::1:25 :::*

be careful ,Listen Socket in state ,netstat Of Send-Q and ss Ordered Send-Q Columns have different values , because netstat The maximum length of the unfinished queue is not written at all . therefore , Determine whether there is any free position in the queue to receive the new tcp On connection request , Should be used as much as possible ss Command, not netstat.


syn flood Impact of

in addition , If the listener sends SYN+ACK after , The client can't receive the returned ACK news , The monitor will be select()/poll() Set timeout wake up , And resend it to the client SYN+ACK news , Prevent this message from being lost in the vast network . however , There's a problem with this reissue , If the client calls connect() Time forgery source address , So the listener replied SYN+ACK The message must not reach the host of the other party , in other words , The monitor will be late ACK news , So it's resend SYN+ACK. But whether it's a monitor, because select()/poll() The set timeout is woken up again and again , Or copy data in again and again send
buffer, All this time CPU Participating , and send
buffer In SYN+ACK And copy in the network card ( This time DMA Copy , unwanted CPU). If , This client is an attacker , Thousands of them have been sent continuously , Ten thousand SYN, The monitor almost collapsed , The network card will be blocked seriously . This is what we call syn
flood attack .

solve syn
flood There are many ways , for example , narrow listen() Maximum length of two queues maintained , Reduce retransmission syn+ack Times of , Increase retransmission interval , Reduce receipt ack Wait timeout for , use syncookie etc. , But direct modification tcp None of the options is good for performance and efficiency . Therefore, it is extremely important to filter packets before the connection reaches the listener thread 要的手段.

















send()函数是将数据从app buffer复制到send buffer中(当然,也可能直接从内核的kernel
buffer中复制),recv()函数则是将recv buffer中的数据复制到app






1.关闭写.此时将无法向send buffer中再写数据,send buffer中已有的数据会一直发送直到完毕.
2.关闭读.此时将无法从recv buffer中再读数据,recv buffer中已有的数据只能被丢弃.
3.关闭读和写.此时无法读,无法写,send buffer中已有的数据会发送直到完毕,但recv buffer中已有的数据将被丢弃.






对于监听进程/线程来说,每次重用的套接字被称为监听桶(listener bucket),即每个监听套接字都是一个监听桶.