1.1 Background : The whole process of network data transmission

In every network io process , Data must go through several caches , Send it out again . As shown below :

Browser on the right , On the left httpd Server as an example .

When httpd Service received from browser index.html On file request , Responsible for processing requests httpd Subprocess / Thread always initiates system call first , Let the kernel index.html Load from storage device . But it's loaded in a kernel space buffer kernel
buffer, Not directly to the process / Memory area of thread . Because of the data transmission between memory device and storage device , No, CPU Participation of , So this time DMA operation .
* When the data is ready , Kernel wake up httpd Subprocess / thread , Let it use read() Function to copy data to its own buffer , That's what's in the picture app buffer. here we are app
buffer Data in , It's a process / thread , It can also be read , Modify and so on . Because this time it is used CPU To copy , So it will consume CPU resources . Because of this phase, we switch from kernel space to user space , So context switching .
When data modification is completed ( Maybe I didn't do anything ) after , As we think , It needs to respond to the browser , Which means to pass TCP Connection transmission out . but TCP The stack has its own buffer , To send data through it , Data must be written to its buffer in , For the sender send
buffer, For the recipient recv buffer. therefore , adopt write() Function to transfer data from the app buffer Copy to send
buffer. This time, too CPU Replication in progress , So it will consume CPU. Context switching is also possible .
* Non local data will eventually be transmitted through the network card , So use it again send() The send
buffer The data in is handed over to the network card and transmitted through the network card . Because this time it's data transfer between memory and device , No, CPU Participation of , So this time, too DMA operation .
* When the response data is received by the network card of the host where the browser is located ( of course , Data is continuously transmitted ), Transfer it to TCP Of recv buffer. This time DMA operation .
* Data is continuously filled in recv buffer in , But browsers don't have to read it , Instead, you need to notify the browser process to use recv() Function to transfer data from read
buffer Take away from . This time CPU operation ( Forgot to mark in the picture ).
Need attention , about httpd End to end , If the network speed is slow , and httpd Subprocess / The data that the thread needs to respond to is large enough ( than send buffer Still big ), Likely to lead to socket
buffer Fill up , At this time write() Function will return EWOULDBLOCK or EAGAIN, Subprocess / The thread will enter the waiting state .

On the browser side , If the browser process is slow to transfer data from the socket buffer(recv buffer) Take away from , Likely to lead to socket buffer Be filled .

Let's talk about it httpd End network data " experience ". As shown below :

Every process / When a thread needs a piece of data , Always copy to kernel buffer, Copy to app buffer, Copy to socket
buffer, Finally, copy it to the network card . in other words , Always passing by 4 Segment copy experience .

But think about it , Under normal circumstances , Data from storage device to kernel buffer It's a must , from socket buffer reach NIC It's also necessary , But from kernel
buffer reach app
buffer Is it necessary ? Process must access , Do you want to modify the data ? not always , Even for web In terms of service , If not to be modified http response message , Data can be completely free of user space . That is to say, there is no need to start from kernel
buffer copy to app buffer, This is the concept of zero replication .

The concept of zero replication is to avoid copying data in kernel space and user space . The main purpose is to reduce unnecessary copies , Avoid letting CPU Do a lot of data copy tasks .

notes : It's just normal , For example, some hardware can complete TCP/IP The work of protocol stack , Data may not pass through socket buffer, Directly in app
buffer Data transfer between and hardware ,RDMA Technology is realized on this basis .


1.2 zero-copy:mmap()

mmap() Function to map a file directly into the memory of a user program , Returns a pointer to the target area when the mapping succeeds . This memory space can be used as shared memory space between processes , The kernel can also directly operate this space .

After mapping files , No data will be copied to memory temporarily , Only when this memory is accessed , No data found , Page missing access is generated , use DMA Operations copy data into this space . The data in this space can be copied directly to socket
buffer in . So it's zero replication . As shown in the figure :

The code is as follows :
#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int
flags,int fd, off_t offset);

1.3 zero-copy:sendfile()

man Document description of this function :
sendfile() copies data between one file descriptor and another. Because this
copyingis done within the kernel, sendfile() is more efficient than the
combination of read(2) and write(2), which would require transferring data to
andfrom user space.

sendfile() Function to copy data with file descriptors : Describe the document directly in_fd Data copied to file descriptors out_fd, among in_fd Data provider ,out_fd Is the data receiver . The operation of file descriptors is performed in the kernel , No user space , So data doesn't need to be copied to app
buffer, Zero replication enabled . As shown below

sendfile() The code of is as follows :
#include<sys/sendfile.h> ssize_t sendfile(int out_fd, int in_fd, off_t
*offset, size_t count);
however sendfile Of in_fd Must point to support mmap Documents of , It's a real document , It can't be socket, Pipeline and other documents . stay Linux
2.6.33 before , Also limited out_fd Must be pointing socket Descriptors for files , So people always think that it is specially used for network data copying . But from Linux
2.6.33 start ,out_fd Can be any file , And if it's a normal file , be sendfile() Will reasonably modify the document offset.

with nginx It's on tcp_nopush Of sendfile take as an example , When it's on tcp_nopush After function ,nginx Build the response header in user space first , And put it in socket
send buffer in , And then sender buffer The ID of a file to be loaded
( for example , Declare that I will read it later a.txt The data in the file is sent to you ), These two parts are sent to the client first , Then load the disk file (sendfile Mode loading ), Every time it's full send
buffer Just send it once , Until all data is sent .


1.4 zero-copy:splice()

man Document description of this function :
splice() moves data between two file descriptors without copying between
kernel address space and user address space.
It transfers up to len bytes of data from the file descriptor fd_in to the
file descriptor fd_out,where one of
thedescriptors must refer to a pipe.
splice() Function to move data between two file descriptors , And one of the descriptors must be a pipeline descriptor . Because there is no need to kernel buffer and app
buffer Copy data between , So zero replication is implemented . As shown in the figure :

notes : Because there must be a pipeline descriptor , So in the picture above , If to socket File descriptor , So no storage-->kernel buffer Of DMA Operational .

The code is as follows :
#define _GNU_SOURCE /* See feature_test_macros(7) */ #include <fcntl.h> ssize_t
splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len,
unsigned int flags);

1.5 zero-copy: tee()

man Document description of this function :
tee() duplicates up to len bytes of data from the pipe referred to by the file
descriptor fd_in to the pipe
referred to by the file descriptor fd_out. It does not consume the data that is
duplicatedfrom fd_in;
therefore, that data can be copied by a subsequent splice(2).

tee() Function to copy data between two pipeline descriptors . As a result of in_fd Copy to another pipe out_fd Time , Don't think the data came from in_fd Of , So after copying the data ,in_fd Still usable splice() Function to move data . Because there is no user space , So zero replication is implemented . As shown in the figure :

Linux Next tee Program is to use tee Function combination splice Function implemented , Pass the data first tee() Copy function to pipeline , Reuse splice() Function to move data to another file descriptor .

The code is as follows :
#define _GNU_SOURCE /* See feature_test_macros(7) */ #include <fcntl.h> ssize_t
tee(int fd_in, int fd_out, size_t len, unsigned int flags);

1.6 Write time replication technology (copy-on-write,COW)

When parent process fork When generating child processes , Will copy all its memory pages . This leads to at least two problems : Consume a lot of memory ; Copy operation time consuming . especially fork Post use exec When loading a new program , Because memory space will be initialized , So replication is almost redundant .

use copy-on-write technology , Make the fork Do not copy memory pages when subprocesses , It's a shared memory page ( in other words , The child process also points to the physical space of the parent process ), Only when the subprocess needs to modify a certain piece of data , To copy this piece of data to your own app
buffer And make changes , Then this piece of data belongs to the private data of the subprocess , Free access , modify , copy . This enables zero replication to some extent , Even if some data blocks are copied , It's also being replicated in a process that's gradually needed .

Too many copies on write , A brief overview is about the above .