# SSE Image algorithm optimization series 7： Be based onSSE Fast rectangular core corrosion and expansion（ Maximum and minimum） algorithm.

The algorithm time and efficiency of other authors are not tested, This article dare not claim to be the fastest, But it's also pretty fast, In one stationI5 Single core resource processing on machine 3000 *

2000 Gray data time of 20ms, And the algorithm is independent of the size of the core, So calledo(1) algorithm.

Before implementing this algorithm, We also refer to the algorithm in a reference paper proposed by he Kaiming in the dark channel defogging： STREAMING MAXIMUM-MINIMUM FILTER

USING NO MORE THAN THREE COMPARISONS PER ELEMENT

<https://wenku.baidu.com/view/ebabbf6a25c52cc58bd6bedc.html>

, The process described in this article is alsoo(1) Of, It's also pretty fast, But not fast enough.

I had an idea of my own, It's also based on the separation of rows and columns, Faster than the code above, And alsoo(1) algorithm, But the speed of the algorithm is related to the content of the picture, For example, after an algorithm is performed on a graph, Perform the same algorithm on the result again, Maybe it'll be a lot slower later, It has something to do with my improved algorithm, But it's rare.

The algorithm introduced in this paper was also seen a long time ago, But it never attracted my attention, The corresponding reference paper is A fast algorithm for local

minimum and maximum filters on rectangular and octagonal kernels

<http://www.docin.com/p-125421756.html> , At that time, I thought that the realization of the paper might not be as fast as my own idea, So I didn't go into it.

The implementation steps of this paper are also based on row column separation, That is to say, one-dimensional operation of row direction is carried out first, Then one-dimensional calculation of column direction is carried out for the results, Specific theoretical description, let's go to the research paper.

In fact, the core of the paper is the following figure.

In Input data representing one-dimensional row direction,g/h Two intermediate data in separate operation, Same size as input data.

As shown in the figure above, We assume that the kernel size to be calculated isR, Then divide a row into multiple sizes as D =（2R+1) Segmentation, For example, in the pictureR=2, D=5

, Preprocess each segment, among x Position No. stores the maximum value of the point on the straight line segment where the arrow is located（ minimum value）, We can deal with it in this way g and h

Two arrays, So for a point（ Index isI）, Its radiusR Maximum inside（ Small） The value is：Max/ Min(g(I+R),h(I-R)).

It's a simple process.

Let's take a set of data to illustrate the process, Suppose a row of data is as follows, We're going to expand（ Maximum value）, Nuclear radius2：

In: 20 12 35 9 10 7 32 15 20 45 28

Correspondingg/h by：

g: 20 20 35 35 35 7 32 32 32 45 45

h: 35 35 35 10 9 45 45 45 45 45 28

If we want to calculate the4 The radius of points is2 Maximum value, Corresponding g(I+R) = g(4+2) = 7, h(I-R)=h(4-2)=35,

The result ismax(7,35) = 35;

If we want to calculate the6 The radius of points is2 Maximum value, Corresponding g(I+R) = g(6+2) = 32, h(I-R)=h(6-2)=10,

The result ismax(32,10) = 32;

Note that the index above is based on1 Of the starting point of counting.

Edge processing：

Notice at the edge, Like the left edge, When the index of the point to be calculated is less thanR Time, Thenh Value is invalid, On the right edge,g Value is unreachable, But watch carefully, The problem is simple, Take the above data as an example, If you want to calculate the index as2 The radius is2 Maximum value, Becauseh(2-2) Is out of index（ As I said before, in this case1 Is the starting point of subscript）, So we can't use the above method, Then we return to the essence of the problem, Calculation2 The radius is2 Maximum value, That is computation.max(In(1), In(2), In(3), In(4)),

And that's not the valueg The index in the data is2+2 Data at, Before that, I've helped us with the algorithm, Just take a direct value.

At the end of the data（ Right）, Things are different, We shouldH Value in.

From the above analysis, we can see that, This algorithm has a feature, The larger the radius is, The less time the algorithm takes, Because edge processing only requires copying data, Without more calculation, Maybe a lot of people don't believe it.

Algorithm implementation：

With the above description, To implement a fast corrosion or expansion algorithm, it is believed that it should be a very easy thing for parts, Advance direction processing, In column direction, Very simple..

I've been fascinated recentlySSE Algorithm optimization, So I thought about this algorithmSSE optimization, I was looking at it before.SSE Function time, I've been thinking_mm_max_epi8/_mm_min_epi8 This one-time access16 Can the most valued function of byte data be used for corrosion and expansion, But because they're comparing two consecutive numbers, It's hard to use in the direction of business, But if the data comparison is in the column direction, Then it can be used.

We have a comparison of column directions for the above algorithms, No, there's a place.

First, We give the update in column directiong value/h Values within the range of each segmentC Language implementation code, For example, gettingg The approximate code for the value is as follows：

memcpy(G + StartY * ValidDataLength, Src + StartY * Stride, ValidDataLength);

for (int Y = StartY + 1; Y < EndY; Y++) { unsigned char *LinePS = Src + Y *

Stride; unsignedchar *LinePD = G + Y * ValidDataLength; unsigned char *LinePD =

G + (Y -1) * ValidDataLength; for (int X = 0; X < ValidDataLength; X++) {

LinePD[X]= IM_Max(LinePS[X], LinePL[X]); } }

StartY Is the starting point of the calculated segment range,EndY Is the end of the segment range, We observeg The law of data, Know that the maximum value of the first row in the range of segments is the data itself, And the latter is the result compared with the previous line.

Obviously：

for (int X = 0; X < Width * Channel; X++) { LinePD[X] = IM_Max(LinePS[X],

LinePL[X]); }

This code is easy to vectorize, If this is a floating-point operation, The compiler will directly help us with vector processing, But for bytes, It seems that the compiler is not so smart, We can quantify it manually, The code is as follows：

memcpy(G + StartY * ValidDataLength, Dest + StartY * ValidDataLength,

ValidDataLength);// Each paragraphG The first line of data is the original data for (int Y = StartY + 1; Y < EndY; Y++) {

unsignedchar *LinePS = Dest + Y * ValidDataLength; unsigned char *LinePD = G +

Y * ValidDataLength; unsigned char *LinePL = G + (Y - 1) * ValidDataLength; for

(int X = 0; X < Block * BlockSize; X += BlockSize) { _mm_storeu_si128((__m128i

*)(LinePD + X), _mm_max_epu8(_mm_loadu_si128((__m128i *)(LinePS + X)),

_mm_loadu_si128((__m128i *)(LinePL + X)))); } for (int X = Block * BlockSize; X

< ValidDataLength; X++) { LinePD[X] = IM_Max(LinePS[X], LinePL[X]); } }

amongBlockSize = 16, Block = ValidDataLength / BlockSize;

This code is very simple, abouth The same is true of.

When we getg,h After data, Of the following processesC The code is very simple：

for (int Y = Radius; Y < IM_Min(BlockV * Size - Radius, Width); Y++) //

The data in the middle is consistent with G+Radius and R - Radius Take the big demand { unsigned char *LinePG = G +

IM_Min(Y + Radius, Width -1) * Width; unsigned char *LinePH = H + (Y - Radius) *

ValidDataLength; unsignedchar *LinePD = T + Y * ValidDataLength;for (int X =

0; X < ValidDataLength; X++) { LinePD[X] = IM_Max(LinePG[X], LinePH[X]); } }

It's the same thing, Internalfor Circulation can be used directlySSE replace.

But it's just the direction of the column, Is it possible to use the line directionSSE How to deal with it? To be sure, Absolutely. But unless you're very patient, Think of all kinds ofpack andunpack perhapsshuffle It's going to drive you crazy, And the final efficiency may not be as good as using ordinary ones directlyC language.

So how to deal with it, I think you can think of transposition, Exactly, After the data is transposed, the column direction is processed, And then transpose it back is equivalent to processing the row direction of the original data.

About transposition, It's always been a time-consuming process, But I transposed itSSE optimization（ Support8 position,24 position,32 position）, Increase speed4-6 times

<http://www.cnblogs.com/Imageshop/p/6796485.html>

The use ofSSE High speed transpose operation, Using it to realize the flow of this paper is very reliable.

So we post most of the overall processing code：

Vertical core：

int IM_Vert_MaxFilter(unsigned char *Src, unsigned char *Dest, int Width, int

Height,int Stride, int Radius) { int Channel = Stride / Width; if ((Src ==

NULL) || (Dest == NULL))return IM_STATUS_NULLREFRENCE; if ((Width <= 0) ||

(Height <=0)) return IM_STATUS_INVALIDPARAMETER; if ((Channel != 1) && (Channel

!=3) && (Channel != 4)) return IM_STATUS_INVALIDPARAMETER; //

From the perspective of memory saving, It can only take twice the extra memory unsigned char *G = (unsigned char *)malloc(Height *

Width * Channel *sizeof(unsigned char)); unsigned char *H = (unsigned char

*)malloc(Height * Width * Channel *sizeof(unsigned char)); if ((G == NULL) ||

(H == NULL)) { if (G != NULL) free(G); if (H != NULL) free(H); return

IM_STATUS_OUTOFMEMORY; }// Vertical processing int Size = Radius * 2 + 1, ValidDataLength =

Width * Channel; int BlockSize = 16, Block = ValidDataLength / BlockSize; int

BlockV = ((Height % Size) ==0 ? Height / Size : (Height / Size) + 1); for (int

Index =0; Index < BlockV; Index++) { int StartY = Index * Size, EndY =

IM_Min(Index * Size + Size, Height); memcpy(G + StartY * ValidDataLength, Src +

StartY * Stride, ValidDataLength);// Each paragraphG The first line of data is the original data for (int Y = StartY + 1; Y

< EndY; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD

= G + Y * ValidDataLength; unsigned char *LinePL = G + (Y - 1) *

ValidDataLength;for (int X = 0; X < Block * BlockSize; X += BlockSize) {

_mm_storeu_si128((__m128i*)(LinePD + X), _mm_max_epu8(_mm_loadu_si128((__m128i

*)(LinePS + X)), _mm_loadu_si128((__m128i *)(LinePL + X)))); } for (int X =

Block * BlockSize; X < ValidDataLength; X++) { LinePD[X] = IM_Max(LinePS[X],

LinePL[X]); } } memcpy(H+ StartY * ValidDataLength, G + (EndY - 1) *

ValidDataLength, ValidDataLength);// Each paragraphH The first row of data is correspondingG Last row of data memcpy(H + (EndY - 1

) * ValidDataLength, Src + (EndY -1) * Stride, ValidDataLength); //

Each paragraphH The last row of data is the original data for (int Y = EndY - 2; Y > StartY; Y--) // Pay attention to the change of cycle times {

unsignedchar *LinePS = Src + Y * Stride; unsigned char *LinePD = H + Y *

ValidDataLength; unsignedchar *LinePL = H + (Y + 1) * ValidDataLength; for (int

X =0; X < Block * BlockSize; X += BlockSize) { _mm_storeu_si128((__m128i

*)(LinePD + X), _mm_max_epu8(_mm_loadu_si128((__m128i *)(LinePS + X)),

_mm_loadu_si128((__m128i *)(LinePL + X)))); } for (int X = Block * BlockSize; X

< ValidDataLength; X++) { LinePD[X] = IM_Max(LinePS[X], LinePL[X]); } } //

For the maximum algorithm, The last block in the column direction is notSize Big hour, The following data can only be duplicate edge pixels, It's like thisG/H Value sumHeight - 1 The size is the same } //

The whole data is divided into three parts,[0, Radius] Group I,[Radius, BlockV * Size - Radius] The second group.[BlockV * Size

- Radius, BlockV * Size] The third group.// Results of the first set of dataG in[Radius, 2 * Radius] Value, Access to the second group of dataG +

Radius andH - Radius Small values in, Third sets ofH - Radius Value.// Top half data, At this pointH Invalid data // // Several codes are deleted here //

for (int Y = Radius; Y < IM_Min(BlockV * Size - Radius, Height); Y++) //

The data in the middle is consistent with G + Radius and R - Radius Take the big demand { unsigned char *LinePG = G +

IM_Min(Y + Radius, Height -1) * ValidDataLength; // May be out of range unsigned char

*LinePH = H + (Y - Radius) * ValidDataLength; unsigned char *LinePD = Dest + Y *

Stride;for (int X = 0; X < Block * BlockSize; X += BlockSize) {

_mm_storeu_si128((__m128i*)(LinePD + X), _mm_max_epu8(_mm_loadu_si128((__m128i

*)(LinePG + X)), _mm_loadu_si128((__m128i *)(LinePH + X)))); } for (int X =

Block * BlockSize; X < ValidDataLength; X++) { LinePD[X] = IM_Max(LinePG[X],

LinePH[X]); } }// Bottom half data, At this pointG Data useless // // Several codes are deleted here // free(G); free(H); return

IM_STATUS_OK; }

Integrated call：

int IM_MaxFilter(unsigned char *Src, unsigned char *Dest, int Width, int

Height,int Stride, int Radius) { int Channel = Stride / Width; if ((Src ==

NULL) || (Dest == NULL))return IM_STATUS_NULLREFRENCE; if ((Width <= 0) ||

(Height <=0)) return IM_STATUS_INVALIDPARAMETER; if ((Channel != 1) && (Channel

!=3) && (Channel != 4)) return IM_STATUS_INVALIDPARAMETER; int Status =

IM_STATUS_OK; unsignedchar *T = (unsigned char *)malloc(Height * Stride * sizeof

(unsignedchar)); if (T == NULL) return IM_STATUS_OUTOFMEMORY; Status =

IM_Vert_MaxFilter(Src, T, Width, Height, Stride, Radius);if (Status !=

IM_STATUS_OK)goto FreeMemory; Status = IM_Transpose(T, Dest, Width, Height,

Stride, Height, Width, Height * Channel);// Transposition, Be carefulDest I only used it.Height * Channel Data if

(Status != IM_STATUS_OK)goto FreeMemory; Status = IM_Vert_MaxFilter(Dest, T,

Height, Width, Height * Channel, Radius); if (Status != IM_STATUS_OK) goto

FreeMemory; Status= IM_Transpose(T, Dest, Height, Width, Height * Channel,

Width, Height, Stride); FreeMemory: free(T);return Status; }

There are many details in the above code, Including the processing of incomplete data at the end of the block, You can understand.

Two parts of the code have been deleted, Deleted code is easy to fill up, Because I don't like my code being copied and pasted directly by others.

Further analysis：

You can see from the above code, To realize the whole process, We need the original3 Extra memory times the size, Is it possible to reduce this, Yes, there are. When processing column direction, We can only deal with it at one time16 Column or32 column, suchg/h Data only neededHeight

* 16（32） * sizeof(unsigned

char) Size of memory, And the other advantage of this is that when comparing each segment internally, Due to less updated content, You can use onexmm Register holds the most valuable temporary result, In this way, it is not necessary to load the memory data of the previous row, Many times less memory reading and writing process, A simple example code is as follows：

unsigned char *LinePS = Src + StartY * Stride + X; unsigned char *LinePD = G

+ StartY *32; __m128i Max1 = _mm_setzero_si128(), Max2 = _mm_setzero_si128(); //

This can reduce one memory load for (int Y = StartY; Y < EndY; Y++) { Max1 =

_mm_max_epu8(_mm_loadu_si128((__m128i *)(LinePS +0)), Max1); // Or use oneAVX instructions

Max2 = _mm_max_epu8(_mm_loadu_si128((__m128i *)(LinePS +16)), Max2);

_mm_store_si128((__m128i*)(LinePD + 0), Max1); _mm_store_si128((__m128i

*)(LinePD +16), Max2); LinePS += Stride; LinePD += 32; }

Test in my notebook, This is a little faster than the previous version, And it takes up less memory, Kill two birds with one stone.

You are welcome to provide a faster way to implement the algorithm.

This paperDemo Download address：

http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar

<http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar>

, All the algorithms in it are based onSSE Realized.

If you think this article will help you, Please like this article.

标签

30天阅读排行