# SSE Image algorithm optimization series IX ： Flexible use SIMD instructions 16 Double increase Sobel Speed of edge detection （4000*3000 Of 24 Bit image time by 480ms Down to 30ms）.

More than half a year , Basically, we're struggling with some basic optimizations , A lot of it was a decade ago , From the perspective of following the trend , It's a waste of time for many people to study these things , That is, we can't make money , It's not helpful for the improvement of work ability . But I think the so-called happiness of human beings , Can be divided into material enjoyment , And more complicated spiritual wealth , It's worth it even if it's only in a short period of self satisfaction .

Talk less ,

SIMD Instruction set , This ancient thing , From the first generation , It's close 20 Years of history , From the beginning MMX technology , reach SSE, And later SSE2,SSE3,SSE4,AVX as well as 11 Years later AVX2, Gradually mature and rich , But for now, we're looking at commonality ,AVX The radiation range is still limited , Most of them are still considered for optimization 128 Bitwise SSE Instruction set , In my previous series of articles , There are also many articles related to the optimization of this aspect .

Let's learn today Sobel Optimization of algorithm , first , We give the traditional C++ Implemented algorithm code ：

int IM_Sobel(unsigned char *Src, unsigned char *Dest, int Width, int Height,

int Stride) { int Channel = Stride / Width; if ((Src == NULL) || (Dest == NULL))

return IM_STATUS_NULLREFRENCE; if ((Width <= 0) || (Height <= 0)) return

IM_STATUS_INVALIDPARAMETER;if ((Channel != 1) && (Channel != 3)) return

IM_STATUS_INVALIDPARAMETER; unsignedchar *RowCopy = (unsigned char *)malloc

((Width +2) * 3 * Channel); if (RowCopy == NULL) return IM_STATUS_OUTOFMEMORY;

unsignedchar *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) *

Channel; unsignedchar *Third = RowCopy + (Width + 2) * 2 * Channel;

memcpy(Second, Src, Channel); memcpy(Second+ Channel, Src, Width * Channel); //

Copy data to the middle memcpy(Second + (Width + 1) * Channel, Src + (Width - 1) * Channel,

Channel); memcpy(First, Second, (Width+ 2) * Channel); // The first line is the same as the second

memcpy(Third, Src+ Stride, Channel); // Copy the second row of data memcpy(Third + Channel, Src +

Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride +

(Width -1) * Channel, Channel); for (int Y = 0; Y < Height; Y++) { unsigned char

*LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y !=

0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp;

}if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else {

memcpy(Third, Src+ (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y

+1) * Stride, Width * Channel); // Because the data in the previous row is backed up , Even though Src and Dest There is no problem with the same

memcpy(Third + (Width +1) * Channel, Src + (Y + 1) * Stride + (Width - 1) *

Channel, Channel); }if (Channel == 1) { for (int X = 0; X < Width; X++) { int

GX = First[X] - First[X +2] + (Second[X] - Second[X + 2]) * 2 + Third[X] -

Third[X +2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) *

2 - Third[X] - Third[X + 2]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY

+0.0F)); } } else { for (int X = 0; X < Width * 3; X++) { int GX = First[X] -

First[X +6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY

= First[X] + First[X +6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] -

Third[X +6]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } }

free(RowCopy); return IM_STATUS_OK; }

Very short code , But this code is classic , first , This code supports In-Place operation , that is Src and Dest Can be the same block of memory , second , This code essentially supports the edge . Many reference codes on the network only deal with the middle effective area . I don't want to talk about the specific implementation details , See for yourself .

that Sobel The core is computing X Gradient of direction GX and Y Gradient of direction GY, Finally, a time-consuming operation is to GX*GX+GY*GY Square of .

The above code , Without opening the compiler SSE Optimization and fast floating point computing , Direct use FPU, Yes 4000*3000 The color map of 480ms, When on SSE after , The approximate time is 220ms

, So the system compiler's SSE Optimization is also very powerful , After decompilation, you can see such parts in the assembly ：

59AD12F8 movd xmm0,ecx 59AD12FC cvtdq2ps xmm0,xmm0 59AD12FF sqrtss xmm0,xmm0

59AD1303 cvttss2si ecx,xmm0

It can be seen that it is a single floating-point number called sqrt optimization .

Because of the Sobel The last step is to record the data in the form of images , therefore ,IM_ClampToByte(sqrtf(GX * GX + GY * GY +

0.0F)) It can be realized by looking up the table . Simply change to the following version , Avoid floating point calculations .

int IM_SobelTable(unsigned char *Src, unsigned char *Dest, int Width, int

Height,int Stride) { int Channel = Stride / Width; if ((Src == NULL) || (Dest

== NULL))return IM_STATUS_NULLREFRENCE; if ((Width <= 0) || (Height <= 0))

return IM_STATUS_INVALIDPARAMETER; if ((Channel != 1) && (Channel != 3)) return

IM_STATUS_INVALIDPARAMETER; unsignedchar *RowCopy = (unsigned char *)malloc

((Width +2) * 3 * Channel); if (RowCopy == NULL) return IM_STATUS_OUTOFMEMORY;

unsignedchar *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) *

Channel; unsignedchar *Third = RowCopy + (Width + 2) * 2 * Channel;

memcpy(Second, Src, Channel); memcpy(Second+ Channel, Src, Width * Channel); //

Copy data to the middle memcpy(Second + (Width + 1) * Channel, Src + (Width - 1) * Channel,

Channel); memcpy(First, Second, (Width+ 2) * Channel); // The first line is the same as the second

memcpy(Third, Src+ Stride, Channel); // Copy the second row of data memcpy(Third + Channel, Src +

Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride +

(Width -1) * Channel, Channel); int BlockSize = 16, Block = (Width * Channel) /

BlockSize; unsignedchar Table[65026]; for (int Y = 0; Y < 65026; Y++) Table[Y]

= (sqrtf(Y +0.0f) + 0.5f); for (int Y = 0; Y < Height; Y++) { unsigned char

*LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y !=

0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp;

}if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else {

memcpy(Third, Src+ (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y

+1) * Stride, Width * Channel); // Because the data in the previous row is backed up , Even though Src and Dest There is no problem with the same

memcpy(Third + (Width +1) * Channel, Src + (Y + 1) * Stride + (Width - 1) *

Channel, Channel); }if (Channel == 1) { for (int X = 0; X < Width; X++) { int

GX = First[X] - First[X +2] + (Second[X] - Second[X + 2]) * 2 + Third[X] -

Third[X +2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) *

2 - Third[X] - Third[X + 2]; LinePD[X] = Table[IM_Min(GX * GX + GY * GY, 65025

)]; } }else { for (int X = 0; X < Width * 3; X++) { int GX = First[X] - First[X

+6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY =

First[X] + First[X +6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X

+6]; LinePD[X] = Table[IM_Min(GX * GX + GY * GY, 65025)]; } } } free(RowCopy);

return IM_STATUS_OK; }

Yes 4000*3000 The color map of 180ms, More systematic SSE Optimization is fast 40ms, There is no floating-point calculation in this process , therefore , Can know the calculation GX and GY In this case, the time consuming of .

This process is best suited to SSE Handled .

We analyze .

First, take a look at these two sentences ：

int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X]

- Third[X +2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1])

*2 - Third[X] - Third[X + 2];

It involves 8 Different pixels , Consider the characteristics of the calculation and the range of data , In internal calculation, this int It can be used short replace , That is, to convert the loaded byte image data into short Type first , such SSE Optimized for use 8 individual SSE Variables are recorded separately 8 Color value of continuous pixel air volume , For each color value 16 Bit data representation .

This can be used _mm_unpacklo_epi8 coordination _mm_loadl_epi64 realization ：

__m128i FirstP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X)),

Zero); __m128i FirstP1= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X

+3)), Zero); __m128i FirstP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i

*)(First + X +6)), Zero); __m128i SecondP0 =

_mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X)), Zero); __m128i

SecondP2= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X + 6)),

Zero); __m128i ThirdP0= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third +

X)), Zero); __m128i ThirdP1= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i

*)(Third + X +3)), Zero); __m128i ThirdP2 =

_mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X +6)), Zero);

And then there's building blocks , use SSE Instead of ordinary C Function instruction implementation of GX and GY Calculation of .

__m128i GX16 =

_mm_abs_epi16(_mm_add_epi16(_mm_add_epi16(_mm_sub_epi16(FirstP0, FirstP2),

_mm_slli_epi16(_mm_sub_epi16(SecondP0, SecondP2),1)), _mm_sub_epi16(ThirdP0,

ThirdP2))); __m128i GY16=

_mm_abs_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP2),

_mm_slli_epi16(_mm_sub_epi16(FirstP1, ThirdP1),1)), _mm_add_epi16(ThirdP0,

ThirdP2)));

Find a time GX16 and GY16 What's in there 8 individual 16 Intermediate result of bit , because SSE Floating point only sqrt operation , We have to convert them to floating-point numbers , The first step of this transformation is to convert them to int Shaping number of , such , It has to be broken down one by one 2 individual , Namely ：

__m128i GX32L = _mm_unpacklo_epi16(GX16, Zero); __m128i GX32H =

_mm_unpackhi_epi16(GX16, Zero); __m128i GY32L= _mm_unpacklo_epi16(GY16, Zero);

__m128i GY32H= _mm_unpackhi_epi16(GY16, Zero);

And then there's building blocks ：

__m128i ResultL =

_mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32L,

GX32L), _mm_mullo_epi32(GY32L, GY32L))))); __m128i ResultH=

_mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32H,

GX32H), _mm_mullo_epi32(GY32H, GY32H)))));

Convert integer to float , After the final calculation, you need to convert floating-point numbers to integer numbers .

Last step , obtain 8 individual int Results of type , This result has to be converted to byte type , And the data may be beyond the range of bytes , So we need to use SSE Anti saturation downward packing function , As follows ：

_mm_storel_epi64((__m128i *)(LinePD + X),

_mm_packus_epi16(_mm_packus_epi32(ResultL, ResultH), Zero));

Ok, It's all done , There are still some details to deal with. Let's add them slowly .

function , wow , as long as 37ms 了 , It's fast N times , But the result seems to be different from other ways , What's going on? .

I've been looking for it for a long time, and I haven't found the problem , Then step by step testing , The final problem lies in 16 Bit to 32 I'm going there .

usually , We all expand the byte data of pixels upward , They are all positive numbers , So use unpack Something like that zero Put high 8 Bit or high 16 The data of bits is filled with 0 That's it , But in this case ,GX16 perhaps GY16 It's probably a negative number , And the highest bit of a negative number is the sign bit , If all are filled with 0, It becomes a positive number , Obviously changed the original data , So we got the wrong result .

How to solve the problem , For this example , It's simple , Because there's only one square operation , therefore , Yes GX Taking absolute value first will not change the result of calculation , So there's no negative data , After modification , It turned out to be right .

You can continue to optimize , Let's look at the final calculation GX*GX+GY*GY The process of , We know ,SSE3 Provides a _mm_madd_epi16 instructions , Its function is ：

r0 := (a0 * b0) + (a1 * b1) r1 := (a2 * b2) + (a3 * b3) r2 := (a4 * b4) + (a5 *

b5) r3 := (a6 * b6) + (a7 * b7)

If we can GX and GY The other two data ：

GXYL = GX0 GY0 GX1 GY1 GX2 GY2 GX3 GY3

GXYH = GX4 GY4 GX5 GY5 GX6 GY6 GX7 GY7

Then call _mm_madd_epi16(GXYL ,GXYL ) and _mm_madd_epi16(GXYH ,GXYH

) Is it possible to get the same result as before , And this splice SSE There are ready-made functions ：

__m128i GXYL = _mm_unpacklo_epi16(GX16, GY16); __m128i GXYH =

_mm_unpackhi_epi16(GX16, GY16);

In this way, the original needs 10 Instructions to 4 Instructions , Code is simpler and faster .

Try to modify it like this , The whole calculation process time is reduced to 32ms about .

in addition , The other thing that can be optimized is borrowing _mm_maddubs_epi16 Function to add, subtract, multiply, divide and expand pixels .

This function does the following ：

r0 := SATURATE_16((a0 * b0) + (a1 * b1)) r1 := SATURATE_16((a2 * b2) + (a3 *

b3)) ... r7 := SATURATE_16((a14 * b14) + (a15 * b15))

His first parameter is 16 Unsigned byte data , The second parameter is 16 Signed char data .

coordination unpack Using a technology like the one above, you can do it all at once 16 The number of pixels in bytes has been reduced , This will probably speed up the whole process 2ms, To the end 30ms.

Source code address ：http://files.cnblogs.com/files/Imageshop/Sobel.rar

<http://files.cnblogs.com/files/Imageshop/Sobel.rar> （ Of which SSE Please sort out the code according to the idea of this article .）

http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar

<http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar>

, Here is one I use all SSE Optimized image processing Demo, Interested friends can have a look .

Welcome to praise and reward .