------------------------------------------------------------------------------- Matrox Imaging Library (7.0) SSE2.txt Readme File August 21, 2001 Copyright © 2001 by Matrox Electronic Systems Ltd. All rights reserved. ------------------------------------------------------------------------------- The following file contains a list of all the functions that have been optimized with SSE2 code. A supplementary section also suggests the data alignment required to obtain the best performance with SSE2 when a buffer is created with the MbufCreate2d()/MbufCreateColor() function. Another section indicates how to enable/disable the use of SSE2 optimization by MIL. Contents 1. Image processing commands. 2. Buffer management commands. 3. Measurements commands. 4. Pattern matching commands. 5. Blob analysis commands. 6. Graphics commands. 7. Data alignment. 8. Known difference with FPU equivalent instruction. ------------------------------------------------------------------------------- Symbols used in the file ------------------------------------------------------------------------------- Buffers: Dst : Destination Src : Source Cnd : Condition Data type UChar : unsigned char Char : signed char UShort: unsigned short Short : signed short ULong : unsigned long Long : signed long Float : float Bin : binary All the buffer bit and sign means all but floating point buffers. ******************************************************************************* 1. Image processing commands. ******************************************************************************* 1.1 MimConvolve (). 1.1.1 Optimized versions: Dst Src Kernel ------ ------ ------ Char Char Char (*) (128, -128, 128, -128) Char Char UChar (*) (256, , 256, ) Char UChar Char (*) (128, -128, 128, -128) Char UChar UChar (*) (256, , 128, ) UChar Char Char (*) (128, -128, 128, -128) UChar Char UChar (*) (256, , 256, ) UChar UChar Char (*) (128, -128, 128, -128) UChar UChar UChar (*) (256, , 128, ) Char Char Char (*) (128, -128, 128, -128) Char Char UChar (*) (256, , 256, ) Char UChar Char (*) (128, -128, 128, -128) Char UChar UChar (*) (256, , 128, ) UChar Char Char (*) (128, -128, 128, -128) UChar Char UChar (*) (256, , 256, ) UChar UChar Char (*) (128, -128, 128, -128) UChar UChar UChar (*) (256, , 128, ) (*) For these versions, the sum of the kernel values is verified to be below or equal (greater or equal for negative values) to the values specified in parenthesis. The first value is the sum of the positive values in the kernel, the second is the sum of the negative values in the kernel, the third is the sum of the positive values divided by the normalization factor, and the fourth is the sum of the negative values divided by the normalization factor. If these conditions are respected, the MMX version with a 16-bit accumulator is called. If these conditions are not respected and the number of elements in the kernel is smaller than 32025, the MMX function with a 32-bit accumulator is called. If the number of elements in the kernel is greater or equal to 32025 the non-MMX version is called. The internal accumulator contains the sum of the products of kernel elements by image values before normalization. 1.1.2 Aditionnal restriction: Src and Dst buffer pitchbytes must be multiples of 16. ******************************************************************************* 2. Buffer management commands. ******************************************************************************* ******************************************************************************* 3. Measurements commands. ******************************************************************************* ******************************************************************************* 4. Pattern matching commands. ******************************************************************************* ******************************************************************************* 5. Blob analysis commands. ******************************************************************************* ******************************************************************************* 6. Graphics commands. ******************************************************************************* ******************************************************************************* 7. Data alignment. ******************************************************************************* When a MIL buffer is created using MbufCreate2d()/MbufCreateColor(), its image row data (scanline) should be aligned on 32-byte boundaries to give the best performance in conjunction with the SSE2-enabled functions. When it is not possible to align on 32-byte boundaries, then the buffer should at least be aligned on xmmword (128-bit) or doubleword (32-bit) boundaries. Note that, by using the MbufAlloc2d()/MbufAllocColor() function, you don't have to worry about data alignment since in this case, MIL automatically allocates the buffer with the proper alignment. Moreover, 32 extra bytes should be available in reading at the beginning and end of the buffer in order for the MMX-enabled algorithms to be able to perform prefetching. The performance could decrease dramatically if those extra pixels are not available. When they are available, then the define M_SSE2_ENABLED must be added to the attribute parameter at buffer creation time (MbufCreate2d()/MbufCreateColor()) so that the SSE2-enabled algorithms know that prefetching can be performed on them. It is also possible to set this flag after buffer creation time using the MbufControl(...M_FORMAT...) command. In which case, the following syntax should appear: MbufControl(MilImage, M_FORMAT, M_SSE2_ENABLED|MbufInquire(MilImage, M_FORMAT, NULL)); (Note that this control is usually reserved for internal use only and thus does not appear in the official documentation.) ******************************************************************************* 8. Known difference with FPU equivalent instruction. ******************************************************************************* 8.1 We have denoted some difference in the conversion instructions from float to int. Those are due to the fact that the conversion function in MSDEV makes the conversion in a __int64 before copying it in the long. The SSE2 instructions, however, make the conversion on 32-bit directly. This gives exactly the same value for values that fit the range of a long, but not for values that overflow the range of a long.