staySSE Image algorithm optimization series 5: Implementation and optimization of super high speed exponential fuzzy algorithm(10000*10000 stay100ms Implementation around)
  In one article, I once said optimizedExpBlur thanBoxBlur Faster, At that time, I comparedBoxBlur The algorithm is based on integral graph+SSE Realized, I am here09 This article was once provided on another blog account in
Lazy algorithm for high speed blur of color image <>
, It also introduces a fast image blur algorithm, The execution time of this algorithm is basically independent of the radius. In this year'sSSE On the way of optimizing learning, I have also considered using this algorithmSSE Realization, But at that time, it was thought that the algorithm was pixel by pixel and line by line dependent( Simple pixel by pixel dependency algorithm I mentioned how to use in exponential blurSSE optimization), It can't be used.SSE Processed, Never thought about it, Until recently, a friend proposed an algorithm based on local local variance, hoping to speed up, I was inspired again, Finally, the instruction set of this algorithm is implemented, And the test speed is twice faster than the integral diagram, than
analysisopencv inBox Filter And put forward the scheme of further acceleration( Source code sharing)
This article is fast3 times, thanopencv OfcvSmooth Fast function5 times, In an old oneI3 Handle on notebook3000*2000 The gray scale of6ms Speed, This paper shares the optimization process and provides the gray-scale version of the optimization code for you to learn and discuss.

   Lazy algorithm for high speed blur in color image
In the code attached(VB6 Code) Aimed at24 Bit image, For the convenience of discussion, Let's post it first8 Bit grayscaleC++ Code:
1 /// <summary> 2 /// according toTile Extend data by, Get the position in the original dimension after expansion, with0 Subscript. 3 /// </summary>
4 int IM_GetMirrorPos(int Length, int Pos) 5 { 6 if (Pos < 0) 7 return -Pos;
8 else if (Pos >= Length) 9 return Length + Length - Pos - 2; 10 else 11
return Pos; 12 } 13 14 void FillLeftAndRight_Mirror_C(int *Array, int Length,
int Radius) 15 { 16 for (int X = 0; X < Radius; X++) 17 { 18 Array[X] =
Array[Radius + Radius - X]; 19 Array[Radius + Length + X] = Array[Radius +
Length - X -2]; 20 } 21 } 22 23 int SumofArray_C(int *Array, int Length)
24 { 25 int Sum = 0; 26 for (int X = 0; X < Length; X++) 27 { 28 Sum +=
Array[X]; 29 } 30 return Sum; 31 } 32 33 /// <summary> 34 ///
Limit integer to byte data type. 35 /// </summary> 36 inline unsigned char IM_ClampToByte(int
Value)// modernPC It's better to write fast 37 { 38 if (Value < 0) 39 return 0; 40 else if
(Value >255) 41 return 255; 42 else 43 return (unsigned char)Value; 44 //
return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >>
31)); 45 } 46 47 int IM_BoxBlur_C(unsigned char *Src, unsigned char *Dest,
int Width, int Height, int Stride, int Radius) 48 { 49 int Channel = Stride /
Width; 50 if ((Src == NULL) || (Dest == NULL)) return IM_STATUS_NULLREFRENCE;
51 if ((Width <= 0) || (Height <= 0) || (Radius <= 0)) return
IM_STATUS_INVALIDPARAMETER; 52 if (Radius < 1) return
IM_STATUS_INVALIDPARAMETER; 53 if ((Channel != 1) && (Channel != 3) && (Channel
!=4)) return IM_STATUS_NOTSUPPORTED; 54 55 Radius = IM_Min(IM_Min(Radius,
Width -1), Height - 1); // Due to the requirements of image, Required radius cannot be greater than width or height-1 Data 56 57 int SampleAmount
= (2 * Radius + 1) * (2 * Radius + 1); 58 float Inv = 1.0 / SampleAmount; 59
60 int *ColValue = (int *)malloc((Width + Radius + Radius) * (Channel == 1 ?
Channel :4) * sizeof(int)); 61 int *ColOffset = (int *)malloc((Height + Radius
+ Radius) *sizeof(int)); 62 if ((ColValue == NULL) || (ColOffset == NULL)) 63
{ 64 if (ColValue != NULL) free(ColValue); 65 if (ColOffset != NULL) free
(ColOffset); 66 return IM_STATUS_OUTOFMEMORY; 67 } 68 for (int Y = 0; Y <
Height + Radius + Radius; Y++) 69 ColOffset[Y] = IM_GetMirrorPos(Height, Y -
Radius); 70 71 if (Channel == 1) 72 { 73 for (int Y = 0; Y < Height; Y++)
74 { 75 unsigned char *LinePD = Dest + Y * Stride; 76 if (Y == 0) 77 { 78
memset(ColValue + Radius,0, Width * sizeof(int)); 79 for (int Z = -Radius; Z
<= Radius; Z++) 80 { 81 unsigned char *LinePS = Src + ColOffset[Z + Radius] *
Stride; 82 for (int X = 0; X < Width; X++) 83 { 84 ColValue[X + Radius] +=
LinePS[X];// Update column data 85 } 86 } 87 } 88 else 89 { 90 unsigned char
*RowMoveOut = Src + ColOffset[Y -1] * Stride; // First address of the line to be subtracted 91 unsigned char
*RowMoveIn = Src + ColOffset[Y + Radius + Radius] * Stride;// The first address of the line to be added 92
for (int X = 0; X < Width; X++) 93 { 94 ColValue[X + Radius] -=
RowMoveOut[X] - RowMoveIn[X];// Update column data 95 } 96 } 97
FillLeftAndRight_Mirror_C(ColValue, Width, Radius);// Mirror fill left and right data 98 int LastSum
= SumofArray_C(ColValue, Radius *2 + 1); // Process the first data in each row 99 LinePD[0] =
IM_ClampToByte(LastSum * Inv); 100 for (int X = 1; X < Width; X++) 101 { 102
int NewSum = LastSum - ColValue[X - 1] + ColValue[X + Radius + Radius]; 103
LinePD[X] = IM_ClampToByte(NewSum * Inv); 104 LastSum = NewSum; 105 } 106 }
107 } 108 else if (Channel == 3) 109 { 110 111 } 112 else if (Channel == 4)
113 { 114 115 } 116 free(ColValue); 117 free(ColOffset); 118 return
Didn't realize it before, This is a simple codeC It's also attractive to be able to generate speed after writing,3000*2000 The chart of39ms, If you check the compiler's“ Enable enhanced instruction set: Streaming processing
SIMD extend 2 (/arch:SSE2)”, The system will use the relevantSIMD Instruction Optimization, As shown in the figure below:


   This time3000*2000 The chart of25ms,, Basically close to what I've improvedOPENCV Code speed of.

   In a simple description, the function first.

  IM_GetMirrorPos Function is mainly to get a certain positionPos When processing in a mirror like manner, theLength Coordinates of direction, HerePos Can be negative, This is mainly used to obtain the later coordinate offset.      

  FillLeftAndRight_Mirror_C It is mainly used to obtain image data on both sides( Direct access, No callIM_GetMirrorPos function), For example, for exampleArray The original data is
***abcdefgh*** (Length = 8, Radius = 3), The result isdcbabcdefghgfe.

  SumofArray_C It mainly calculates the sum of all elements of an array.

  IM_BoxBlur_C Inside the function is the fuzzy function body, Overall and arbitrary radius median filtering( Extend to percentage filter)O(1) The principle of time complexity algorithm, Realization and effect
It is the same.. When the radius isR Time, The value of the box blur is equal to a point centered, Left, right, up and downR Synthesis of all pixels in the range of pixels, The total number of pixels is(2*R+1)*(2*R+1) individual, Of course, we can also divide him into(2*R+1) column, Each column has(2*R+1) The optimization method of this example is to divide the accumulated data into columns, Make full use of repeated information to improve speed.
  We define aRadius + Width +
Radius Memory data forColValue, Used to save the accumulated data of column direction, For the first row of data, Special treatment required, That is, the sum of all elements in a row of pixels in the column direction is calculated in the original way,

See for details78 As for86 Line code, Of course, only the middle is calculated hereWidth Data in scope, When it's not the first line, As shown on the left of the figure below, The new cumulative value only needs to subtract the pixel value of the line moved out and add the pixel value of the line moved in, See for details90 reach96
Line code.
   It's just the middleWidth Accumulated value in the range, In order to facilitate the subsequent code writing and useSSE optimization, UnderneathFillLeftAndRight_Mirror_C The main function is to fill in the left and right data respectively, And it is filled in the way of image.
        After updating the above accumulated value, We're working on the average, For the first point of each line,SumofArray_C Calculated before2*R +
1 Cumulative value of columns, This cumulative value is the periphery of this point(2*R+1)*(2*R+1) Cumulative sum of elements, For other pixels in a row, In fact, it is similar to the update of accumulated value of row and column, Subtract the removed column and add the new column, As shown on the right side of the figure below,102 reach104 Line implements the process.

   That's basically how it works, This algorithm obviously takes up very little extra memory, But not supportedIn-Place operation.
   In pursuit of speed, We can use the whole processSSE Optimized placesSSE optimization.
   The first is the first.79 to86 Row data, This is easy to optimize: for (int Z = -Radius; Z <= Radius; Z++) { unsigned
char *LinePS = Src + ColOffset[Z + Radius] * Stride; int BlockSize = 8, Block =
Width / BlockSize; for (int X = 0; X < Block * BlockSize; X += BlockSize) { int
*DestP = ColValue + X + Radius; __m128i Sample =
_mm_cvtepu8_epi16(_mm_loadl_epi64((__m128i *)(LinePS + X)));
_mm_storeu_si128((__m128i*)DestP, _mm_add_epi32(_mm_loadu_si128((__m128i *
)DestP), _mm_cvtepi16_epi32(Sample))); _mm_storeu_si128((__m128i*)(DestP + 4),
_mm_add_epi32(_mm_loadu_si128((__m128i *)(DestP +4)),
_mm_unpackhi_epi16(Sample, _mm_setzero_si128()))); } //   Processing surplus cannot beSSE Optimized data }

   use_mm_loadl_epi64 Load8 Bytes of data to memory, And use_mm_cvtepu8_epi16 Convert it to16 Bit__m128i variable, Lower the back4 Bit and height4 Bits of data are converted to32 Bit, Then use
_mm_add_epi32 accumulation, Notice that later I used two different ways to convert functions, Because the data here is absolutely positive, So it can also be used_mm_cvtepi16_epi32 and
_mm_unpackhi_epi16 combinationZero Realization.

   Look again.92 reach95 Line code, This is very simple.
int BlockSize = 8, Block = Width / BlockSize; __m128i Zero =
_mm_setzero_si128();for (int X = 0; X < Block * BlockSize; X += BlockSize) { int
*DestP = ColValue + X + Radius; __m128i MoveOut =
_mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(RowMoveOut + X)), Zero); __m128i
MoveIn= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(RowMoveIn + X)), Zero);
__m128i Diff= _mm_sub_epi16(MoveIn, MoveOut); //
Notice that there are both negative and positive numbers, Convert to when there is a negative number32 Bit is not available_mm_unpackxx_epi16 Functions of the system _mm_storeu_si128((__m128i
*)DestP, _mm_add_epi32(_mm_loadu_si128((__m128i *)DestP),
_mm_cvtepi16_epi32(Diff))); _mm_storeu_si128((__m128i*)(DestP + 4),
_mm_add_epi32(_mm_loadu_si128((__m128i *)(DestP +4)),
_mm_cvtepi16_epi32(_mm_srli_si128(Diff,8)))); } //   Processing surplus cannot beSSE Optimized data
   This is also a one-time process8 Pixel, Here I use another conversion technique to8 Bit conversion to16 position, But the backDiff Because there are positive and negative, To convert to32 Bit must be used
_mm_cvtepi16_epi32 Conversion, Out-of-serviceunpack Series of combined functions, becauseunpack Symbol bits will not be moved, I've suffered several losses here.

   Next isFillLeftAndRight_Mirror_C Function optimization, Rewrite as follows:
void FillLeftAndRight_Mirror_SSE(int *Array, int Length, int Radius) { int
BlockSize =4, Block = Radius / BlockSize; for (int X = 0; X < Block *
BlockSize; X += BlockSize) { __m128i SrcV1 = _mm_loadu_si128((__m128i *)(Array
+ Radius + Radius - X -3)); __m128i SrcV2 = _mm_loadu_si128((__m128i *)(Array +
Radius + Length - X -5)); _mm_storeu_si128((__m128i *)(Array + X),
_mm_shuffle_epi32(SrcV1, _MM_SHUFFLE(0, 1, 2, 3))); _mm_storeu_si128((__m128i
*)(Array + Radius + Length + X), _mm_shuffle_epi32(SrcV2, _MM_SHUFFLE(0, 1, 2, 3
))); }//   Processing surplus cannot beSSE Optimized data }
   Mirror image is the process of reverse order, Direct useSSE Ofshuffle Functions are easy to implement.

   It is also convenient to calculate the accumulation and optimization of arrays.
int SumofArray_SSE(int *Array, int Length) { int BlockSize = 8, Block = Length
/ BlockSize; __m128i Sum1 = _mm_setzero_si128(); __m128i Sum2 =
_mm_setzero_si128();for (int X = 0; X < Block * BlockSize; X += BlockSize) {
Sum1= _mm_add_epi32(Sum1, _mm_loadu_si128((__m128i *)(Array + X + 0))); Sum2 =
_mm_add_epi32(Sum2, _mm_loadu_si128((__m128i *)(Array + X +4))); } int Sum =
SumI32(_mm_add_epi32(Sum1, Sum2));//   Processing surplus cannot beSSE Optimized data return Sum; }
   Use two__m128i Variables are mainly used to make full use ofXMM register, amongSumI32 The function is as follows, Mainly for calculation__m128i The accumulated value of the inner four integers.
// Convert 4 32-bit integer values to 4 unsigned char values.
void _mm_storesi128_4char(__m128i Src, unsigned char *Dest) { __m128i T =
_mm_packs_epi32(Src, Src); T= _mm_packus_epi16(T, T); *((int*)Dest) =
_mm_cvtsi128_si32(T); }
   Code not interpreted.

   The last is100 reach104 Line code conversion.
int BlockSize = 4, Block = (Width - 1) / BlockSize; __m128i OldSum =
_mm_set1_epi32(LastSum); __m128 Inv128= _mm_set1_ps(Inv); for (int X = 1; X <
Block * BlockSize +1; X += BlockSize) { __m128i ColValueOut =
_mm_loadu_si128((__m128i *)(ColValue + X -1)); __m128i ColValueIn =
_mm_loadu_si128((__m128i *)(ColValue + X + Radius + Radius)); __m128i
ColValueDiff= _mm_sub_epi32(ColValueIn, ColValueOut); // P3 P2 P1 P0 __m128i
Value_Temp = _mm_add_epi32(ColValueDiff, _mm_slli_si128(ColValueDiff,4)); //
P3+P2 P2+P1 P1+P0 P0 __m128i Value = _mm_add_epi32(Value_Temp,
_mm_slli_si128(Value_Temp,8)); // P3+P2+P1+P0 P2+P1+P0 P1+P0 P0 __m128i NewSum =
_mm_add_epi32(OldSum, Value); OldSum= _mm_shuffle_epi32(NewSum, _MM_SHUFFLE(3,
3, 3, 3)); // Reassign to latest value __m128 Mean = _mm_mul_ps(_mm_cvtepi32_ps(NewSum),
Inv128); _mm_storesi128_4char(_mm_cvtps_epi32(Mean), LinePD+ X); }

   Previously, this algorithm was thought to be difficult to useSSE optimization, The main difficulty is here, But learningOpencv Integral graph technology of, The problem is solved, Further analysis also found thatOpencv The code of is not perfect, There are still changes.
Space for entry, See code above, The accumulation can be obtained by shifting the same data twice, fromP3 P2 P1 P0 Turn intoP3+P2+P1+P0 P2+P1+P0 P1+P0
P0. Let's do a little analysis.             

  __m128i ColValueDiff = _mm_sub_epi32(ColValueIn, ColValueOut);
This code gets the difference between the move in and move out sequences, We plan to;  P3 P2 P1 P0 ( High to low) Due to the additive characteristics of the algorithm, IfOldSum The original value of isA3 A3 A3
A3, So newNewSum It should be:

    A3+P3+P2+P1+P0  A3+P2+P1+P0  A3+P1+P0  A3+P0;
__m128i Value_Temp = _mm_add_epi32(ColValueDiff, _mm_slli_si128(ColValueDiff,
4)); This sentence_mm_slli_si128 Get intermediate results P2 P1 P0 0, Re joinP3 P2 P1 P0 Additivity     Value_Temp =  
P3+P2   P2+P1  P1+P0   P0 __m128i Value = _mm_add_epi32(Value_Temp,
_mm_slli_si128(Value_Temp, 8)); This sentence_mm_slli_si128 Get intermediate resultsP1+P0 P0 0 0, Re joinP3+P2
P2+P1 P1+P0 P0 Additivity:     Value = P3+P2+P1+P0   P2+P1+P0   P1+P0   P0
What a simple process.   OldSum = _mm_shuffle_epi32(NewSum, _MM_SHUFFLE(3, 3, 3, 3));
Why is this sentence written like this, I'm afraid it's better for the readers to experience it, It seems to be difficult to express in words.

    The last one_mm_storesi128_4char It's my own definition of a will1 individual__m128i Inside4 individual32 A function that converts bit integers to bytes and stores them in memory, See attachment code for details.

   As for24 Optimization of bit image, It's an awkward situation, becauseSSE One time treatment4 individual32 digit, and24 Bits per pixel only3 Each component, This situation generally extends him to4 Each component, For instanceColValue Can be changed to4 Passageway, Then the cumulative sum also needs to be treated as4 Passageway, Of course, it needs some tricks, I don't want to talk about it here, Leave something for yourself. Of course, we can also consider24 Bit decomposition to3 Gray memory, Then the gray level algorithm is used to calculate, After that, we are synthesizing24 position, This decomposition and synthesis process can also be usedSSE Accelerated, If you combine multiple threads, Can also put3 Parallel processing of gray processes, In addition to taking up more memory, Speed ratio to secondary processing24 Place fast( Because direct processing algorithms can't be parallel, The reason of dependence). In addition, in the end, the process of calculating the cumulative average of columns becomes more natural, There will be no grayscale__mm128i Internal accumulation process, It's twoSSE Variable accumulation.

   A little more. Now most of themCPU All supportAVX256 了, You can also useAVX Further acceleration, It seems that the code is not very difficult, Interested friends can try it on their own.

   So to speak, At present, the speed has basically reachedCPU The limit. But it was testedIPP Speed, Seems to be faster than this, I don't rule it outAVX, And it doesn't exclude him from using multi-core resources.

   This optimization is forBoxBlur It's an important step, But more importantly, it can be used in many situations, For example, local mean square error calculation of image, Similar technology can also be used for acceleration, The large local square difference of two images can also be optimized in this way, I will talk about his accelerated application in the future.

   Source code download:

   Color drawing engineering:


   Excuse me? The graph is too small. Speed is0ms了.