------------------------------------------------------------------------------- Matrox Imaging Library (7.0) SSE.txt Readme File June 28, 2001 Copyright © 2000 by Matrox Electronic Systems Ltd. All rights reserved. ------------------------------------------------------------------------------- The following file contains a list of all the functions that have been optimized with SSE code. A supplementary section also suggests the data alignment required to obtain the best performance with SSE when a buffer is created with the MbufCreate2d()/MbufCreateColor() function. Another section indicates how to enable/disable the use of SSE optimization by MIL. Contents 1. Image processing commands. 2. Buffer management commands. 3. Measurements commands. 4. Pattern matching commands. 5. Blob analysis commands. 6. Graphics commands. 7. Data alignment. 8. Known difference with FPU equivalent instruction. ------------------------------------------------------------------------------- Symbols used in the file ------------------------------------------------------------------------------- Buffers: Dst : Destination Src : Source Cnd : Condition Data type UChar : unsigned char Char : signed char UShort: unsigned short Short : signed short ULong : unsigned long Long : signed long Float : float Bin : binary All the buffer bit and sign means all but floating point buffers. ******************************************************************************* 1. Image processing commands. ******************************************************************************* 1.1 MimArith (). 1.1.1 Optimized versions: Dst Src1 Src2 ------ ------ ------ UChar UChar UChar Char Char Char UChar UShort UChar Char Short Char UChar UChar UShort Char Char Short UShort UChar UChar Short Char Char UShort UShort UChar Short Short Char UShort UChar UShort Short Char Short UShort UShort UShort Short Short Short ULong ULong ULong Long Long Long 1.1.2 Notes. -Saturation is not optimized for 32-bit src buffers. -M_MULT / M_MULT_CONST is not optimized for 32-bit src buffers. 1.1.3 Options optimized. 1.1.3.1 Operations where saturation is optimized. M_ADD_CONST M_ADD M_SUB_CONST M_SUB M_CONST_SUB M_MULT M_MULT_CONST 1.1.3.2 Operations without saturation (optimized). M_AND M_AND_CONST M_OR M_OR_CONST M_XOR M_XOR_CONST M_NAND M_NAND_CONST M_NOR M_NOR_CONST M_XNOR M_XNOR_CONST M_MAX M_MAX_CONST M_MIN M_MIN_CONST M_NOT M_NEG M_ABS M_MULT M_SUB_ABS 1.1.3.3 Operations not optimized. M_PASS M_CONST_PASS M_DIV M_DIV+M_FIXED_POINT M_CONST_DIV M_CONST_DIV+M_FIXED_POINT M_DIV_CONST+M_FIXED_POINT 1.1.3.4 Particular cases optimized. M_DIV_CONST (Only for positive constant and cases where the source is 8-bit signed or unsigned.) 1.1.3.5 Mixed sign cases optimized. - Logical ( M_AND, M_NOR, etc. ) operations on mixed types are optimized when the size of all buffers is the same and with mix of 8 and 16-bit buffer when the destination is 8-bit. 1.2 MimMorphic(). M_ERODE, M_DILATE, M_THIN, M_THICK, M_HIT_OR_MISS, M_MATCH: M_GRAYSCALE operation: All the integer buffer depths and sign combinations are optimized with SSE. EXCEPTION: 32-bit buffer is not optimized. M_BINARY operation: Not optimized with SSE. 1.3 MimThin(). M_GRAYSCALE operation: All the integer buffer depths and sign combinations are optimized with SSE. EXCEPTION: 32-bit buffer is not optimized. M_BINARY operation: Not optimized with SSE. 1.4 MimThick(). M_GRAYSCALE operation: All the integer buffer depths and sign combinations are optimized with SSE. EXCEPTION: 32-bit buffer is not optimized. M_BINARY operation: Not optimized with SSE. 1.5 MimConvert(). RGB_TO_L, BGR_TO_YUV16, RGB_TO_YUV16, RGB_TO_YUV24, YUV16_TO_BGR are optimized with SSE. No significant gains were found in other cases. ******************************************************************************* 2. Buffer management commands. ******************************************************************************* 2.1 MbufCopy() The following versions of the function are optimized with SSE: Dst Src ------ ------ (**via MilMemCopy**) Char Char UChar UChar Short Short UShort UShort Long Long ULong ULong Float Float (**via SSE_Copy**) UChar Float Char Float UShort Float Short Float ULong Float Long Float Float UChar Float Char Float UShort Float Short Float ULong Float Long Note that in the case of float to long or float to unsigned long, the result can differ if the value to convert is outside the range of a long or unsigned long, respectively. This is due to the fact that a casting in C++ is done in a __int64 before it has been copied in the destination. In the other cases, the result can differ if the value to convert is outside the range of a long, even if you have an unsigned destination. This is done to optimize the speed. For all the copies that apply to the same type Src and Dst, the MbufCopy() is optimised only for buffers larger than 125 KB (or else no speed advantage). The buffers also must have a pitchbyte aligned on 16 bytes. The call to MilMemCopy is either done from the Ho level or via the DataExchange (some "conversions" having been integrated into the DataExchange, like 1-band 8bit to 1-band 8 bit (CopyMONO8inMONO8 in rgb.cpp). 2.2 MbufBayer(). Only the following versions of the function are optimized with SSE: Dst Src ------ ------ UChar UChar UShort UShort Src Dst Restriction ------ ------ ----------- M_MONO8 (Bayer) M_MONO8 DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3 M_MONO8 (Bayer) M_RGB24+M_PLANAR DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3 M_MONO8 (Bayer) M_BGR32+M_PACKED DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3 M_MONO16 (Bayer) M_RGB48+M_PLANAR DstSizeX > 6, DstSizeY > 3, SrcSizeX > 6, SrcSizeY > 3 M_MONO8 (Bayer) M_YUV_YUYV See table below. If the Src is M_MONO8 (Bayer) and the Dst is M_YUV16_YUYV, the restrictions depend on the SizeX and the AncestorOffsetX as follows: AncestorOffsetX SizeX Restriction --------------- ----- ----------- Odd Even DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3 Odd Odd DstSizeX > 11, DstSizeY > 3, SrcSizeX > 11, SrcSizeY > 3 Even Even DstSizeX > 12, DstSizeY > 3, SrcSizeX > 12, SrcSizeY > 3 Even Odd DstSizeX > 11, DstSizeY > 3, SrcSizeX > 11, SrcSizeY > 3 Finally, there are restrictions on the values of the white balance coefficients to respect in order to use the SSE-optimized version of the function: Dst Coefficient #0 Coefficient #1 Coefficient #2 ------ -------------- -------------- -------------- M_MONO8 < 64 Don't care Don't care M_RGB24+M_PLANAR < 64 < 64 < 64 M_BGR32+M_PACKED < 64 < 64 < 64 M_RGB48+M_PLANAR < 64 < 64 < 64 M_YUV_YUYV < 64 Don't care Don't care For other conversions, see MbufCopy() and MimConvert(). ******************************************************************************* 3. Measurements commands. ******************************************************************************* ******************************************************************************* 4. Pattern matching commands. ******************************************************************************* ******************************************************************************* 5. Blob analysis commands. ******************************************************************************* ******************************************************************************* 6. Graphics commands. ******************************************************************************* ******************************************************************************* 7. Data alignment. ******************************************************************************* When a MIL buffer is created using MbufCreate2d()/MbufCreateColor(), its image row data (scanline) should be aligned on 32-byte boundaries to give the best performance in conjunction with the SSE-enabled functions. When it is not possible to align on 32-byte boundaries, then the buffer should at least be aligned on xmmword (128-bit) or doubleword (32-bit) boundaries. Note that, by using the MbufAlloc2d()/MbufAllocColor() function, you don't have to worry about data alignment since in this case, MIL automatically allocates the buffer with the proper alignment. Moreover, 32 extra bytes should be available in reading at the beginning and end of the buffer in order for the SSE-enabled algorithms to be able to perform prefetching. The performance could decrease dramatically if those extra pixels are not available. When they are available, then the define M_SSE_ENABLED must be added to the attribute parameter at buffer creation time (MbufCreate2d()/MbufCreateColor()) so that the SSE-enabled algorithms know that prefetching can be performed on them. It is also possible to set this flag after buffer creation time using the MbufControl(...M_FORMAT...) command. In which case, the following syntax should appear: MbufControl(MilImage, M_FORMAT, M_SSE_ENABLED|MbufInquire(MilImage, M_FORMAT, NULL)); (Note that this control is usually reserved for internal use only and thus does not appear in the official documentation) ******************************************************************************* 8. Known difference with FPU equivalent instruction. ******************************************************************************* We have denoted some difference in the conversion instructions from float to int. Those are due to the fact that the conversion function in MSDEV makes the conversion in a __int64 before copying it in the long. The SSE instructions, however, make the conversion on 32-bit directly. This gives exactly the same value for values that fit the range of a long, but not for values that overflow the range of a long.