More than half a year, Basically, I'm struggling with some basic optimizations, A lot of it was a decade ago, From the perspective of following the trend, It's a waste of time for many people to study these things, That is, we can't make money, It's not helpful for the improvement of work ability. But I think the so-called happiness of human beings, Can be divided into material enjoyment, And more complicated spiritual wealth, It's worth it even if it's only in a short period of self satisfaction. 

      Gossip,
SIMD Instruction set, This ancient thing, From the first generation, Soon enough20 Years of history, From the beginningMMX technology, reachSSE, And laterSSE2,SSE3,SSE4,AVX as well as11 After the yearAVX2, Gradually mature and rich, But for now, we're looking at commonality,AVX The radiation range is still limited, Most of them are still considered for optimization128 BitSSE Instruction set, In my previous series of articles, There are also many articles related to the optimization of this aspect.

      Let's learn todaySobel Optimization of algorithm, First, We give the traditionalC++ Implemented algorithm code:
int IM_Sobel(unsigned char *Src, unsigned char *Dest, int Width, int Height,
int Stride) { int Channel = Stride / Width; if ((Src == NULL) || (Dest == NULL))
return IM_STATUS_NULLREFRENCE; if ((Width <= 0) || (Height <= 0)) return
IM_STATUS_INVALIDPARAMETER;if ((Channel != 1) && (Channel != 3)) return
IM_STATUS_INVALIDPARAMETER; unsignedchar *RowCopy = (unsigned char *)malloc
((Width +2) * 3 * Channel); if (RowCopy == NULL) return IM_STATUS_OUTOFMEMORY;
unsignedchar *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) *
Channel; unsignedchar *Third = RowCopy + (Width + 2) * 2 * Channel;
memcpy(Second, Src, Channel); memcpy(Second+ Channel, Src, Width * Channel); //
Copy data to the middle memcpy(Second + (Width + 1) * Channel, Src + (Width - 1) * Channel,
Channel); memcpy(First, Second, (Width+ 2) * Channel); // The first line is the same as the second
memcpy(Third, Src+ Stride, Channel); // Copy the second row of data memcpy(Third + Channel, Src +
Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride +
(Width -1) * Channel, Channel); for (int Y = 0; Y < Height; Y++) { unsigned char
*LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y !=
0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp;
}if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else {
memcpy(Third, Src+ (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y
+1) * Stride, Width * Channel); // Because the data in the previous row is backed up, Even hereSrc andDest There is no problem with the same
memcpy(Third + (Width +1) * Channel, Src + (Y + 1) * Stride + (Width - 1) *
Channel, Channel); }if (Channel == 1) { for (int X = 0; X < Width; X++) { int
GX = First[X] - First[X +2] + (Second[X] - Second[X + 2]) * 2 + Third[X] -
Third[X +2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) *
2 - Third[X] - Third[X + 2]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY
+0.0F)); } } else { for (int X = 0; X < Width * 3; X++) { int GX = First[X] -
First[X +6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY
= First[X] + First[X +6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] -
Third[X +6]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } }
free(RowCopy); return IM_STATUS_OK; }

   The code is very short. But this code is classic, first, This code supportsIn-Place operation, that isSrc andDest Can be the same block of memory, Second, This code essentially supports the edge. Many reference codes on the network only deal with the middle effective area. I don't want to talk about the specific implementation details, See for yourself.

   thatSobel The core is computingX Gradient of directionGX andY Gradient of directionGY, Finally, a time-consuming operation is toGX*GX+GY*GY Squared.

     
The above code, Without opening the compilerSSE Optimization and fast floating point computing, Direct useFPU, Yes4000*3000 The color map of480ms, When openSSE after, The approximate time is220ms
, So the system compiler'sSSE Optimization is also very powerful, After decompilation, you can see such parts in the assembly:
59AD12F8 movd xmm0,ecx 59AD12FC cvtdq2ps xmm0,xmm0 59AD12FF sqrtss xmm0,xmm0
59AD1303 cvttss2si ecx,xmm0
   It can be seen that it is a single floating-point number calledsqrt optimization.

    Because of thisSobel The last step is to record the data in the form of images, therefore,IM_ClampToByte(sqrtf(GX * GX + GY * GY +
0.0F)) It can be realized by looking up the table. Simply change to the following version, Avoid floating point calculations.
int IM_SobelTable(unsigned char *Src, unsigned char *Dest, int Width, int
Height,int Stride) { int Channel = Stride / Width; if ((Src == NULL) || (Dest
== NULL))return IM_STATUS_NULLREFRENCE; if ((Width <= 0) || (Height <= 0))
return IM_STATUS_INVALIDPARAMETER; if ((Channel != 1) && (Channel != 3)) return
IM_STATUS_INVALIDPARAMETER; unsignedchar *RowCopy = (unsigned char *)malloc
((Width +2) * 3 * Channel); if (RowCopy == NULL) return IM_STATUS_OUTOFMEMORY;
unsignedchar *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) *
Channel; unsignedchar *Third = RowCopy + (Width + 2) * 2 * Channel;
memcpy(Second, Src, Channel); memcpy(Second+ Channel, Src, Width * Channel); //
Copy data to the middle memcpy(Second + (Width + 1) * Channel, Src + (Width - 1) * Channel,
Channel); memcpy(First, Second, (Width+ 2) * Channel); // The first line is the same as the second
memcpy(Third, Src+ Stride, Channel); // Copy the second row of data memcpy(Third + Channel, Src +
Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride +
(Width -1) * Channel, Channel); int BlockSize = 16, Block = (Width * Channel) /
BlockSize; unsignedchar Table[65026]; for (int Y = 0; Y < 65026; Y++) Table[Y]
= (sqrtf(Y +0.0f) + 0.5f); for (int Y = 0; Y < Height; Y++) { unsigned char
*LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y !=
0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp;
}if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else {
memcpy(Third, Src+ (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y
+1) * Stride, Width * Channel); // Because the data in the previous row is backed up, Even hereSrc andDest There is no problem with the same
memcpy(Third + (Width +1) * Channel, Src + (Y + 1) * Stride + (Width - 1) *
Channel, Channel); }if (Channel == 1) { for (int X = 0; X < Width; X++) { int
GX = First[X] - First[X +2] + (Second[X] - Second[X + 2]) * 2 + Third[X] -
Third[X +2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) *
2 - Third[X] - Third[X + 2]; LinePD[X] = Table[IM_Min(GX * GX + GY * GY, 65025
)]; } }else { for (int X = 0; X < Width * 3; X++) { int GX = First[X] - First[X
+6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY =
First[X] + First[X +6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X
+6]; LinePD[X] = Table[IM_Min(GX * GX + GY * GY, 65025)]; } } } free(RowCopy);
return IM_STATUS_OK; }
  
Yes4000*3000 The color map of180ms, System specificSSE Optimization is fast.40ms, There is no floating-point calculation in this process, therefore, Can know the calculationGX andGY In this case, the time consuming of.

   This process is best suited toSSE Processed.

   We analyze.

   First, take a look at these two sentences:
int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X]
- Third[X +2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1])
*2 - Third[X] - Third[X + 2];
     
  It involves8 Different pixels, Consider the characteristics of the calculation and the range of data, In internal calculation, thisint Can useshort replace, That is, to convert the loaded byte image data intoshort Type first, suchSSE Optimized for use8 individualSSE Variables are recorded separately8 Color value of continuous pixel air volume, For each color value16 Bit data representation.

   This can be used_mm_unpacklo_epi8 Coordination_mm_loadl_epi64 Realization:
__m128i FirstP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X)),
Zero); __m128i FirstP1= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X
+3)), Zero); __m128i FirstP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i
*)(First + X +6)), Zero); __m128i SecondP0 =
_mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X)), Zero); __m128i
SecondP2= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X + 6)),
Zero); __m128i ThirdP0= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third +
X)), Zero); __m128i ThirdP1= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i
*)(Third + X +3)), Zero); __m128i ThirdP2 =
_mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X +6)), Zero);
    And then there's building blocks, useSSE Instead of ordinaryC Function instruction implementation ofGX andGY Calculation.
__m128i GX16 =
_mm_abs_epi16(_mm_add_epi16(_mm_add_epi16(_mm_sub_epi16(FirstP0, FirstP2),
_mm_slli_epi16(_mm_sub_epi16(SecondP0, SecondP2),1)), _mm_sub_epi16(ThirdP0,
ThirdP2))); __m128i GY16=
_mm_abs_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP2),
_mm_slli_epi16(_mm_sub_epi16(FirstP1, ThirdP1),1)), _mm_add_epi16(ThirdP0,
ThirdP2)));
  
Find a timeGX16 andGY16 What's in there8 individual16 Intermediate result of bit, BecauseSSE Floating point onlysqrt operation, We have to convert them to floating-point numbers, The first step of this transformation is to convert them toint Shaping number, such, It has to be broken down one by one2 individual, Namely:
__m128i GX32L = _mm_unpacklo_epi16(GX16, Zero); __m128i GX32H =
_mm_unpackhi_epi16(GX16, Zero); __m128i GY32L= _mm_unpacklo_epi16(GY16, Zero);
__m128i GY32H= _mm_unpackhi_epi16(GY16, Zero);
   And then there's building blocks:
__m128i ResultL =
_mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32L,
GX32L), _mm_mullo_epi32(GY32L, GY32L))))); __m128i ResultH=
_mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32H,
GX32H), _mm_mullo_epi32(GY32H, GY32H)))));
   Convert integer to float, After the final calculation, you need to convert floating-point numbers to integer numbers.


   Last step, obtain8 individualint Type result, This result has to be converted to byte type, And the data may be beyond the range of bytes, So we need to useSSE Anti saturation downward packing function, As shown below:
_mm_storel_epi64((__m128i *)(LinePD + X),
_mm_packus_epi16(_mm_packus_epi32(ResultL, ResultH), Zero));
  Ok, It's all done, There are still some details to deal with. Let's add them slowly.

   Function, wow, as long as37ms 了, Speed is fast.N times, But the result seems to be different from other ways, What's going on?.

   I've been looking for it for a long time, but I haven't found the problem, Then step by step testing, The final problem lies in16 Bit conversion to32 I'm going there.

  
usually, We all expand the byte data of pixels upward, They are all positive numbers, So useunpack Something like thatzero Put high8 Position or height16 The data of bits is filled with0 That's all right. But in this case,GX16 perhapsGY16 It's probably a negative number, And the highest bit of a negative number is the sign bit, If all are filled with0, It becomes a positive number, Obviously changed the original data, So we got the wrong result.

  
How to solve the problem, For this example, It's simple, Because there's only one square operation, therefore, YesGX Taking absolute value first will not change the result of calculation, So there's no negative data, After modification, It turned out to be right.

   You can continue to optimize, Let's look at the final calculationGX*GX+GY*GY Process, We know,SSE3 Provides a_mm_madd_epi16 instructions, Its role is:
r0 := (a0 * b0) + (a1 * b1) r1 := (a2 * b2) + (a3 * b3) r2 := (a4 * b4) + (a5 *
b5) r3 := (a6 * b6) + (a7 * b7)
      If we canGX andGY The other two data:

         GXYL   =   GX0  GY0  GX1  GY1  GX2  GY2  GX3  GY3

         GXYH   =   GX4  GY4  GX5  GY5  GX6  GY6  GX7  GY7

   So called_mm_madd_epi16(GXYL ,GXYL ) and_mm_madd_epi16(GXYH ,GXYH
) Is it possible to get the same result as before, And this spliceSSE There are ready-made functions:
__m128i GXYL = _mm_unpacklo_epi16(GX16, GY16); __m128i GXYH =
_mm_unpackhi_epi16(GX16, GY16);
   In this way, the original needs10 Instructions to4 Instruction, Code is simpler and faster.

   Try to modify it like this, The whole calculation process time is reduced to32ms About.

   in addition, The other thing that can be optimized is borrowing  _mm_maddubs_epi16   Function to add, subtract, multiply, divide and expand pixels.

   This function does the following:
r0 := SATURATE_16((a0 * b0) + (a1 * b1)) r1 := SATURATE_16((a2 * b2) + (a3 *
b3)) ... r7 := SATURATE_16((a14 * b14) + (a15 * b15))
   His first parameter is16 Unsigned byte data, The second parameter is16 Signedchar data.

   Coordinationunpack Using a technology like the one above, you can do it all at once16 The number of pixels in bytes has been reduced, This will probably speed up the whole process2ms, To the end30ms.

   Source code address:http://files.cnblogs.com/files/Imageshop/Sobel.rar
<http://files.cnblogs.com/files/Imageshop/Sobel.rar> ( Among themSSE Please sort out the code according to the idea of this article.)

  http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar
<http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar>
, Here is one I use allSSE Optimized image processingDemo, Interested friends can have a look.

       

   Welcome to praise and reward.