-------------------------------------------------------------------------------
                Matrox Imaging Library (7.0) SSE.txt Readme File
                                  June 28, 2001
    Copyright © 2000 by Matrox Electronic Systems Ltd. All rights reserved.
-------------------------------------------------------------------------------


The following file contains a list of all the functions that have been 
optimized with SSE code. A supplementary section also suggests the data
alignment required to obtain the best performance with SSE when a buffer is 
created with the MbufCreate2d()/MbufCreateColor() function. Another section 
indicates how to enable/disable the use of SSE optimization by MIL. 


Contents

1. Image processing commands.
2. Buffer management commands.
3. Measurements commands.
4. Pattern matching commands.
5. Blob analysis commands.
6. Graphics commands.
7. Data alignment.
8. Known difference with FPU equivalent instruction.


-------------------------------------------------------------------------------
Symbols used in the file
-------------------------------------------------------------------------------

Buffers:    Dst   : Destination
            Src   : Source
            Cnd   : Condition

Data type   UChar : unsigned char
            Char  : signed char
            UShort: unsigned short
            Short : signed short
            ULong : unsigned long
            Long  : signed long
            Float : float
            Bin   : binary

All the buffer bit and sign means all but floating point buffers.


*******************************************************************************
1. Image processing commands.
*******************************************************************************

1.1   MimArith ().

      1.1.1 Optimized versions:
      
         Dst     Src1     Src2          
         ------  ------   ------       
         UChar   UChar    UChar
         Char    Char     Char
         UChar   UShort   UChar
         Char    Short    Char
         UChar   UChar    UShort 
         Char    Char     Short
         UShort  UChar    UChar
         Short   Char     Char
         UShort  UShort   UChar
         Short   Short    Char
         UShort  UChar    UShort
         Short   Char     Short
         UShort  UShort   UShort
         Short   Short    Short
         ULong   ULong    ULong
         Long    Long     Long
       
      1.1.2 Notes.
      
         -Saturation is not optimized for 32-bit src buffers.

         -M_MULT / M_MULT_CONST is not optimized for 32-bit src buffers.
         
      1.1.3 Options optimized.
      
         1.1.3.1 Operations where saturation is optimized.
         
                 M_ADD_CONST     M_ADD
                 M_SUB_CONST     M_SUB
                 M_CONST_SUB     M_MULT
                 M_MULT_CONST
                  
         1.1.3.2 Operations without saturation (optimized).
         
                 M_AND           M_AND_CONST       
                 M_OR            M_OR_CONST       
                 M_XOR           M_XOR_CONST       
                 M_NAND          M_NAND_CONST       
                 M_NOR           M_NOR_CONST       
                 M_XNOR          M_XNOR_CONST       
                 M_MAX           M_MAX_CONST
                 M_MIN           M_MIN_CONST
                 M_NOT
                 M_NEG           M_ABS
                 M_MULT          
                 M_SUB_ABS
                                     
          1.1.3.3 Operations not optimized.
                    
                 M_PASS          M_CONST_PASS
                 M_DIV           M_DIV+M_FIXED_POINT
                 M_CONST_DIV     M_CONST_DIV+M_FIXED_POINT
                                 M_DIV_CONST+M_FIXED_POINT
  
          1.1.3.4 Particular cases optimized.
                 
                 M_DIV_CONST (Only for positive constant and cases where 
                              the source is 8-bit signed or unsigned.)
                     
          1.1.3.5 Mixed sign cases optimized.
          
                 - Logical ( M_AND, M_NOR, etc. ) operations on mixed 
                   types are optimized when the size of all buffers is 
                   the same and with mix of 8 and 16-bit buffer when the 
                   destination is 8-bit.
 

1.2   MimMorphic().

      M_ERODE, M_DILATE, M_THIN, M_THICK, M_HIT_OR_MISS, M_MATCH:
        
      M_GRAYSCALE operation:
         All the integer buffer depths and sign combinations are optimized with SSE.
         EXCEPTION: 32-bit buffer is not optimized.
        
      M_BINARY operation:
         Not optimized with SSE.

1.3   MimThin().

      M_GRAYSCALE operation:
         All the integer buffer depths and sign combinations are optimized with SSE.
         EXCEPTION: 32-bit buffer is not optimized.
        
      M_BINARY operation:
         Not optimized with SSE.


1.4   MimThick().

      M_GRAYSCALE operation:
         All the integer buffer depths and sign combinations are optimized with SSE.
         EXCEPTION: 32-bit buffer is not optimized.
        
      M_BINARY operation:
         Not optimized with SSE.

1.5   MimConvert().

      RGB_TO_L, BGR_TO_YUV16, RGB_TO_YUV16, RGB_TO_YUV24, YUV16_TO_BGR are optimized with
      SSE.
      
      No significant gains were found in other cases.

*******************************************************************************
2. Buffer management commands.
*******************************************************************************

2.1   MbufCopy()


      The following versions of the function are optimized with SSE:

      Dst      Src
      ------   ------
   
      (**via MilMemCopy**)
      Char     Char
      UChar    UChar
      Short    Short
      UShort   UShort
      Long     Long
      ULong    ULong
      Float    Float
   
      (**via SSE_Copy**)
      UChar    Float
      Char     Float
      UShort   Float
      Short    Float
      ULong    Float
      Long     Float

      Float    UChar 
      Float    Char  
      Float    UShort
      Float    Short 
      Float    ULong 
      Float    Long
      
      Note that in the case of float to long or float to unsigned long, the 
      result can differ if the value to convert is outside the range of a long 
      or unsigned long, respectively. This is due to the fact that a casting in 
      C++ is done in a __int64 before it has been copied in the destination. In 
      the other cases, the result can differ if the value to convert is outside 
      the range of a long, even if you have an unsigned destination. This is 
      done to optimize the speed.
      For all the copies that apply to the same type Src and Dst, the MbufCopy() is
      optimised only for buffers larger than 125 KB (or else no speed advantage).
      The buffers also must have a pitchbyte aligned on 16 bytes. The call to
      MilMemCopy is either done from the Ho level or via the DataExchange (some
      "conversions" having been integrated into the DataExchange, like 1-band 8bit to
      1-band 8 bit (CopyMONO8inMONO8 in rgb.cpp).


2.2   MbufBayer().

      Only the following versions of the function are optimized with SSE:
      
      Dst     Src
      ------  ------
      UChar   UChar 
      UShort  UShort
      
      Src                 Dst                   Restriction
      ------              ------                -----------
      M_MONO8 (Bayer)     M_MONO8               DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3
      M_MONO8 (Bayer)     M_RGB24+M_PLANAR      DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3
      M_MONO8 (Bayer)     M_BGR32+M_PACKED      DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3
      M_MONO16 (Bayer)    M_RGB48+M_PLANAR      DstSizeX > 6, DstSizeY > 3, SrcSizeX > 6, SrcSizeY > 3
      M_MONO8 (Bayer)     M_YUV_YUYV            See table below.
      
      If the Src is M_MONO8 (Bayer) and the Dst is M_YUV16_YUYV, the restrictions
      depend on the SizeX and the AncestorOffsetX as follows:

      AncestorOffsetX     SizeX                 Restriction
      ---------------     -----                 -----------
      Odd                 Even                  DstSizeX > 10, DstSizeY > 3, SrcSizeX > 10, SrcSizeY > 3
      Odd                 Odd                   DstSizeX > 11, DstSizeY > 3, SrcSizeX > 11, SrcSizeY > 3
      Even                Even                  DstSizeX > 12, DstSizeY > 3, SrcSizeX > 12, SrcSizeY > 3
      Even                Odd                   DstSizeX > 11, DstSizeY > 3, SrcSizeX > 11, SrcSizeY > 3

      Finally, there are restrictions on the values of the white balance coefficients
      to respect in order to use the SSE-optimized version of the function:

      Dst                 Coefficient #0     Coefficient #1      Coefficient #2
      ------              --------------     --------------      --------------
      M_MONO8                  < 64            Don't care          Don't care
      M_RGB24+M_PLANAR         < 64               < 64                < 64
      M_BGR32+M_PACKED         < 64               < 64                < 64
      M_RGB48+M_PLANAR         < 64               < 64                < 64
      M_YUV_YUYV               < 64            Don't care          Don't care

      For other conversions, see MbufCopy() and MimConvert().


*******************************************************************************
3. Measurements commands.
*******************************************************************************

*******************************************************************************
4. Pattern matching commands.
*******************************************************************************

*******************************************************************************
5. Blob analysis commands.
*******************************************************************************

*******************************************************************************
6. Graphics commands.
*******************************************************************************

*******************************************************************************
7. Data alignment.
*******************************************************************************

When a MIL buffer is created using MbufCreate2d()/MbufCreateColor(), its 
image row data (scanline) should be aligned on 32-byte boundaries to give 
the best performance in conjunction with the SSE-enabled functions. When it 
is not possible to align on 32-byte boundaries, then the buffer should at 
least be aligned on xmmword (128-bit) or doubleword (32-bit) boundaries. 
Note that, by using the MbufAlloc2d()/MbufAllocColor() function, you don't 
have to worry about data alignment since in this case, MIL automatically 
allocates the buffer with the proper alignment.

Moreover, 32 extra bytes should be available in reading at the beginning and 
end of the buffer in order for the SSE-enabled algorithms to be able to 
perform prefetching. The performance could decrease dramatically if those 
extra pixels are not available. When they are available, then the define 
M_SSE_ENABLED must be added to the attribute parameter at buffer creation 
time (MbufCreate2d()/MbufCreateColor()) so that the SSE-enabled algorithms
know that prefetching can be performed on them. It is also possible to set 
this flag after buffer creation time using the MbufControl(...M_FORMAT...) 
command. In which case, the following syntax should appear:

MbufControl(MilImage,
            M_FORMAT,
            M_SSE_ENABLED|MbufInquire(MilImage, M_FORMAT, NULL));

(Note that this control is usually reserved for internal use only and thus 
does not appear in the official documentation)


*******************************************************************************
8. Known difference with FPU equivalent instruction.
*******************************************************************************

We have denoted some difference in the conversion instructions from 
float to int. Those are due to the fact that the conversion function in 
MSDEV makes the conversion in a __int64 before copying it in the long. 
The SSE instructions, however, make the conversion on 32-bit directly. 
This gives exactly the same value for values that fit the range of a 
long, but not for values that overflow the range of a long.