stay SSE Image algorithm optimization series 5 ： Implementation and optimization of super high speed exponential fuzzy algorithm （10000*10000 stay 100ms Left right implementation ）
<http://www.cnblogs.com/Imageshop/p/6849611.html>
In one article , I once said optimized ExpBlur than BoxBlur Faster , At that time, I compared BoxBlur The algorithm is based on integral graph +SSE Realized , I am here 09 This article was once provided on another blog account in
Lazy algorithm for high speed blur of color image <http://www.cnblogs.com/laviewpbt/archive/2009/07/23/1529250.html>
, It also introduces a fast image blur algorithm , The execution time of this algorithm is basically independent of the radius . In this year's SSE On the way of optimizing learning, I have also considered using this algorithm SSE realization , But at that time, it was thought that the algorithm was pixel by pixel and line by line dependent （ Simple pixel by pixel dependency algorithm I mentioned how to use in exponential blur SSE optimization ）, It can't be used SSE Processed , Never thought about it , Until recently, a friend proposed an algorithm based on local local variance, hoping to speed up , I was inspired again , Finally, the instruction set of this algorithm is implemented , And the test speed is twice faster than the integral diagram , than
analysis opencv in Box Filter And put forward the scheme of further acceleration （ Source sharing ）
<http://www.cnblogs.com/Imageshop/p/5053013.html>
This article is fast 3 times , than opencv Of cvSmooth Fast function 5 times , In an old one I3 Handle on notebook 3000*2000 The gray scale of 6ms Speed of , This paper shares the optimization process and provides the gray-scale version of the optimization code for you to learn and discuss .

Lazy algorithm for high speed blur in color image
<http://www.cnblogs.com/laviewpbt/archive/2009/07/23/1529250.html>
In the code attached （VB6 Code of ） It's about 24 Bitwise image , For the convenience of discussion , Let's post it first 8 Bitwise grayscale C++ Code of ：
1 /// <summary> 2 /// according to Tile Extend data by , Get the position in the original dimension after expansion , with 0 Is subscript . 3 /// </summary>
4 int IM_GetMirrorPos(int Length, int Pos) 5 { 6 if (Pos < 0) 7 return -Pos;
8 else if (Pos >= Length) 9 return Length + Length - Pos - 2; 10 else 11
return Pos; 12 } 13 14 void FillLeftAndRight_Mirror_C(int *Array, int Length,
int Radius) 15 { 16 for (int X = 0; X < Radius; X++) 17 { 18 Array[X] =
Length - X -2]; 20 } 21 } 22 23 int SumofArray_C(int *Array, int Length)
24 { 25 int Sum = 0; 26 for (int X = 0; X < Length; X++) 27 { 28 Sum +=
Array[X]; 29 } 30 return Sum; 31 } 32 33 /// <summary> 34 ///
Limit integer to byte data type . 35 /// </summary> 36 inline unsigned char IM_ClampToByte(int
Value)// modern PC It's better to write fast 37 { 38 if (Value < 0) 39 return 0; 40 else if
(Value >255) 41 return 255; 42 else 43 return (unsigned char)Value; 44 //
return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >>
31)); 45 } 46 47 int IM_BoxBlur_C(unsigned char *Src, unsigned char *Dest,
int Width, int Height, int Stride, int Radius) 48 { 49 int Channel = Stride /
Width; 50 if ((Src == NULL) || (Dest == NULL)) return IM_STATUS_NULLREFRENCE;
51 if ((Width <= 0) || (Height <= 0) || (Radius <= 0)) return
IM_STATUS_INVALIDPARAMETER; 52 if (Radius < 1) return
IM_STATUS_INVALIDPARAMETER; 53 if ((Channel != 1) && (Channel != 3) && (Channel
Width -1), Height - 1); // Due to the requirements of image , The required radius cannot be greater than the width or height -1 Data for 56 57 int SampleAmount
= (2 * Radius + 1) * (2 * Radius + 1); 58 float Inv = 1.0 / SampleAmount; 59
60 int *ColValue = (int *)malloc((Width + Radius + Radius) * (Channel == 1 ?
Channel :4) * sizeof(int)); 61 int *ColOffset = (int *)malloc((Height + Radius
+ Radius) *sizeof(int)); 62 if ((ColValue == NULL) || (ColOffset == NULL)) 63
{ 64 if (ColValue != NULL) free(ColValue); 65 if (ColOffset != NULL) free
(ColOffset); 66 return IM_STATUS_OUTOFMEMORY; 67 } 68 for (int Y = 0; Y <
Height + Radius + Radius; Y++) 69 ColOffset[Y] = IM_GetMirrorPos(Height, Y -
Radius); 70 71 if (Channel == 1) 72 { 73 for (int Y = 0; Y < Height; Y++)
74 { 75 unsigned char *LinePD = Dest + Y * Stride; 76 if (Y == 0) 77 { 78
memset(ColValue + Radius,0, Width * sizeof(int)); 79 for (int Z = -Radius; Z
<= Radius; Z++) 80 { 81 unsigned char *LinePS = Src + ColOffset[Z + Radius] *
Stride; 82 for (int X = 0; X < Width; X++) 83 { 84 ColValue[X + Radius] +=
LinePS[X];// Update column data 85 } 86 } 87 } 88 else 89 { 90 unsigned char
*RowMoveOut = Src + ColOffset[Y -1] * Stride; // First address of the line to be subtracted 91 unsigned char
for (int X = 0; X < Width; X++) 93 { 94 ColValue[X + Radius] -=
RowMoveOut[X] - RowMoveIn[X];// Update column data 95 } 96 } 97
FillLeftAndRight_Mirror_C(ColValue, Width, Radius);// Mirror fill left and right data 98 int LastSum
= SumofArray_C(ColValue, Radius *2 + 1); // Process the first data in each row 99 LinePD =
IM_ClampToByte(LastSum * Inv); 100 for (int X = 1; X < Width; X++) 101 { 102
int NewSum = LastSum - ColValue[X - 1] + ColValue[X + Radius + Radius]; 103
LinePD[X] = IM_ClampToByte(NewSum * Inv); 104 LastSum = NewSum; 105 } 106 }
107 } 108 else if (Channel == 3) 109 { 110 111 } 112 else if (Channel == 4)
113 { 114 115 } 116 free(ColValue); 117 free(ColOffset); 118 return
IM_STATUS_OK;119 }

Didn't realize it before , This is a simple code C It's also attractive to be able to generate speed after writing ,3000*2000 The chart of 39ms, If you check the compiler's “ Enable enhanced instruction set ： Flow processing
SIMD extend 2 (/arch:SSE2)”, The system will use the relevant SIMD Instruction Optimization , As shown in the figure below ：

This time 3000*2000 The chart of 25ms,, Basically close to what I've improved OPENCV Code speed of .

In a simple description, the function first .

IM_GetMirrorPos Function is mainly to get a certain position Pos When processing in a mirror like manner, the Length Coordinates of direction , here Pos Can be negative , This is mainly used to obtain the later coordinate offset .

FillLeftAndRight_Mirror_C It is mainly used to obtain image data on both sides （ Direct access , Do not call IM_GetMirrorPos function ）, For example Array The original data is
***abcdefgh*** (Length = 8, Radius = 3), Then the result is dcbabcdefghgfe.

SumofArray_C It mainly calculates the sum of all elements of an array .

IM_BoxBlur_C Inside the function is the fuzzy function body , Overall and arbitrary radius median filtering （ Extend to percentage filter ）O(1) The principle of time complexity algorithm , Realization and effect
<http://www.cnblogs.com/Imageshop/archive/2013/04/26/3045672.html>
Is consistent . When the radius is R Time , The value of the box blur is equal to a point centered , Left, right, up and down R Synthesis of all pixels in the range of pixels , The total number of pixels is （2*R+1）*（2*R+1） individual , Of course, we can also divide him into （2*R+1） column , Each column has （2*R+1） The optimization method of this example is to divide the accumulated data into columns , Make full use of repeated information to improve speed .
We define a Radius + Width +
Radius Memory data for ColValue, Used to save the accumulated data of column direction , For the first row of data , Special treatment required , That is, the sum of all elements in a row of pixels in the column direction is calculated in the original way ,

See 78 as for 86 Line code , Of course, only the middle is calculated here Width Data in scope , When it's not the first line , As shown on the left of the figure below , The new cumulative value only needs to subtract the pixel value of the line moved out and add the pixel value of the line moved in , See 90 reach 96
Line code .
It's just the middle Width Accumulated value in the range , In order to facilitate the subsequent code writing and use SSE optimization , Below FillLeftAndRight_Mirror_C The main function is to fill in the left and right data respectively , And it is filled in the way of image .
After updating the above accumulated value , We're working on the average , For the first point of each line ,SumofArray_C Before calculation 2*R +
1 Cumulative value of columns , This cumulative value is the periphery of this point （2*R+1）*（2*R+1） Cumulative sum of elements , For other pixels in a row , In fact, it is similar to the update of accumulated value of row and column , Subtract the removed column and add the new column , As shown on the right side of the figure below ,102 reach 104 Line implements the process .

That's basically how it works , This algorithm obviously takes up very little extra memory , But not supported In-Place operation .
For speed , We can use the whole process SSE Optimized places SSE optimization .
First of all, No 79 to 86 Row data , This is easy to optimize ： for (int Z = -Radius; Z <= Radius; Z++) { unsigned
char *LinePS = Src + ColOffset[Z + Radius] * Stride; int BlockSize = 8, Block =
Width / BlockSize; for (int X = 0; X < Block * BlockSize; X += BlockSize) { int
*DestP = ColValue + X + Radius; __m128i Sample =
)DestP), _mm_cvtepi16_epi32(Sample))); _mm_storeu_si128((__m128i*)(DestP + 4),
_mm_unpackhi_epi16(Sample, _mm_setzero_si128()))); } //　　 Processing surplus cannot be SSE Optimized data }

use _mm_loadl_epi64 load 8 Bytes of data to memory , Combined use _mm_cvtepu8_epi16 Convert it to 16 Bitwise __m128i variable , Lower the back 4 Bit and height 4 Bits of data are converted to 32 Bitwise , Then use
_mm_add_epi32 accumulation , Notice that later I used two different ways to convert functions , Because the data here is absolutely positive , So it can also be used _mm_cvtepi16_epi32 and
_mm_unpackhi_epi16 combination Zero realization .

Let's see 92 reach 95 Line code , This is very simple .
int BlockSize = 8, Block = Width / BlockSize; __m128i Zero =
_mm_setzero_si128();for (int X = 0; X < Block * BlockSize; X += BlockSize) { int
*DestP = ColValue + X + Radius; __m128i MoveOut =
_mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(RowMoveOut + X)), Zero); __m128i
MoveIn= _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(RowMoveIn + X)), Zero);
__m128i Diff= _mm_sub_epi16(MoveIn, MoveOut); //
Notice that there are both negative and positive numbers , Convert to when there is a negative number 32 Bit is not available _mm_unpackxx_epi16 Functions of the system _mm_storeu_si128((__m128i
_mm_cvtepi16_epi32(Diff))); _mm_storeu_si128((__m128i*)(DestP + 4),
_mm_cvtepi16_epi32(_mm_srli_si128(Diff,8)))); } //　　 Processing surplus cannot be SSE Optimized data
This is also a one-time process 8 Pixels , Here I use another conversion technique to 8 Bit to 16 position , But the back Diff Because there are positive and negative , To convert to 32 Bit must be used
_mm_cvtepi16_epi32 To convert , out-of-service unpack Series of combined functions , because unpack Symbol bits will not be moved , I've suffered several losses here .

Then there was FillLeftAndRight_Mirror_C Function optimization , Rewrite as follows ：
void FillLeftAndRight_Mirror_SSE(int *Array, int Length, int Radius) { int
BlockSize =4, Block = Radius / BlockSize; for (int X = 0; X < Block *
BlockSize; X += BlockSize) { __m128i SrcV1 = _mm_loadu_si128((__m128i *)(Array
Radius + Length - X -5)); _mm_storeu_si128((__m128i *)(Array + X),
_mm_shuffle_epi32(SrcV1, _MM_SHUFFLE(0, 1, 2, 3))); _mm_storeu_si128((__m128i
*)(Array + Radius + Length + X), _mm_shuffle_epi32(SrcV2, _MM_SHUFFLE(0, 1, 2, 3
))); }//　　 Processing surplus cannot be SSE Optimized data }
Mirror image is the process of reverse order , Direct use SSE Of shuffle Functions are easy to implement .

It is also convenient to calculate the accumulation and optimization of arrays .
int SumofArray_SSE(int *Array, int Length) { int BlockSize = 8, Block = Length
/ BlockSize; __m128i Sum1 = _mm_setzero_si128(); __m128i Sum2 =
_mm_setzero_si128();for (int X = 0; X < Block * BlockSize; X += BlockSize) {
SumI32(_mm_add_epi32(Sum1, Sum2));//　　 Processing surplus cannot be SSE Optimized data return Sum; }
Use two __m128i Variables are mainly used to make full use of XMM register , among SumI32 The function is as follows , Mainly for calculation __m128i The accumulated value of the inner four integers .
// Convert 4 32-bit integer values to 4 unsigned char values.
void _mm_storesi128_4char(__m128i Src, unsigned char *Dest) { __m128i T =
_mm_packs_epi32(Src, Src); T= _mm_packus_epi16(T, T); *((int*)Dest) =
_mm_cvtsi128_si32(T); }
Code not interpreted .

In the end 100 reach 104 Line code conversion .
int BlockSize = 4, Block = (Width - 1) / BlockSize; __m128i OldSum =
_mm_set1_epi32(LastSum); __m128 Inv128= _mm_set1_ps(Inv); for (int X = 1; X <
Block * BlockSize +1; X += BlockSize) { __m128i ColValueOut =
_mm_loadu_si128((__m128i *)(ColValue + X -1)); __m128i ColValueIn =
ColValueDiff= _mm_sub_epi32(ColValueIn, ColValueOut); // P3 P2 P1 P0 __m128i
P3+P2 P2+P1 P1+P0 P0 __m128i Value = _mm_add_epi32(Value_Temp,
_mm_slli_si128(Value_Temp,8)); // P3+P2+P1+P0 P2+P1+P0 P1+P0 P0 __m128i NewSum =
3, 3, 3)); // Reassign to latest value __m128 Mean = _mm_mul_ps(_mm_cvtepi32_ps(NewSum),
Inv128); _mm_storesi128_4char(_mm_cvtps_epi32(Mean), LinePD+ X); }

Previously, this algorithm was thought to be difficult to use SSE optimization , The main difficulty is here , But learning Opencv Integral graph technology of , The problem is solved , Further analysis also found that Opencv The code of is not perfect , There are changes
Space in , See code above , The accumulation can be obtained by shifting the same data twice , from P3 P2 P1 P0 Change to P3+P2+P1+P0 P2+P1+P0 P1+P0
P0. Let's do a little analysis .

__m128i ColValueDiff = _mm_sub_epi32(ColValueIn, ColValueOut);
This code gets the difference between the move in and move out sequences , We plan to ;　　P3 P2 P1 P0 （ High to low ） Due to the additive characteristics of the algorithm , if OldSum The original value of is A3 A3 A3
A3, So new NewSum It should be ：

A3+P3+P2+P1+P0　　A3+P2+P1+P0　　A3+P1+P0　　A3+P0;
4)); In this sentence _mm_slli_si128 Get intermediate results 　P2 P1 P0 0, Reconnection P3 P2 P1 P0 Add to get 　　　　Value_Temp =
P3+P2 　　P2+P1　　P1+P0 　　P0 __m128i Value = _mm_add_epi32(Value_Temp,
_mm_slli_si128(Value_Temp, 8)); In this sentence _mm_slli_si128 Get intermediate results P1+P0 P0 0 0, Reconnection P3+P2
P2+P1　P1+P0 P0 Add to get ： 　　　　Value =　P3+P2+P1+P0 　　P2+P1+P0 　　P1+P0 　　P0
What a simple process . 　　OldSum = _mm_shuffle_epi32(NewSum, _MM_SHUFFLE(3, 3, 3, 3));
Why is this sentence written like this , I'm afraid it's better for the readers to experience it , It seems to be difficult to express in words .

the last one _mm_storesi128_4char It's my own definition of a will 1 individual __m128i Inside 4 individual 32 A function that converts bit integers to bytes and stores them in memory , See attachment code for details .

as for 24 Optimization of bit image , It's an awkward situation , because SSE One time treatment 4 individual 32 digit , and 24 Bits per pixel only 3 Components , This situation generally extends him to 4 Components , for instance ColValue Can be changed to 4 Channeled , Then the cumulative sum also needs to be treated as 4 Channeled , Of course, it needs some tricks , I don't want to talk about it here , Leave something for yourself . Of course, we can also consider 24 Bit decomposition to 3 Gray memory , Then the gray level algorithm is used to calculate , After that, we are synthesizing 24 position , This decomposition and synthesis process can also be used SSE Accelerated , If you combine multiple threads , You can also 3 Parallel processing of gray processes , In addition to taking up more memory , Speed ratio to secondary processing 24 Hurry up （ Because direct processing algorithms can't be parallel , The reason of dependence ）. In addition, in the end, the process of calculating the cumulative average of columns becomes more natural , There will be no grayscale __mm128i Internal accumulation process , It's two SSE Variable accumulation .

A little more , Now most of them CPU All support AVX256 了 , You can also use AVX Further acceleration , It seems that the code is not very difficult , Interested friends can try it on their own .

so to speak , At present, the speed has basically reached CPU The limit of , But it was tested IPP Speed of , Seems to be faster than this , I don't rule it out AVX, And it doesn't exclude him from using multi-core resources .

This optimization is for BoxBlur It's an important step , But more importantly, it can be used in many situations , For example, local mean square error calculation of image , Similar technology can also be used for acceleration , The large local square difference of two images can also be optimized in this way , I will talk about his accelerated application in the future .

<https://files.cnblogs.com/files/Imageshop/FastBlur.rar>

Color drawing engineering ：https://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar
<https://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar>

sorry , The picture is too small , The speed is 0ms了.