TBCI Numerical high perf. C++ Library  2.8.0
SIMD arithmetics (SSE2)

If TBCI_NO_SIMD is NOT defined, we use SSE2 instructions on machines that support it (-msse2 on x86, always on x86-64). This greatly speeds up operations; as we process two doubles (and even four floats) per instruction, we get significantly improved performance. In unrolled loops (which TBCI generates), we can reach more than 1.5x the speed of normal SISD operation for doubles.

Note that this only holds if we are not memory bound, i.e. if the data fits in the cache and is there. The hardware prefetch logic of SSE2 capable CPUs takes care of some of the latency, but the mem bandwidth is a limiting factor for simple arithmetic operations on modern systems.

On x86, the default for floating point is to use the i387 FPU. It has the annoying(!) feature of providing extra accuracy (80bit extended double format) – which can yield different results than the standard IEEE754/854 floating point arithmetic that you get with SSE2 instructions and on all other architectures. You can switch the FPU to double precision mode though to get consistent results with or without SSE2. Note that x86-64 does not share this problem, as SSE2 is used by default there, just normally with single data.

Another mismatch can occur when SIMD instructions are used for summing up numbers; see comments on TCBI_SIMD_SUM.

Note that we generate unrolled SSE2 loops using intrinsics and a heavy abuse of macros; we take care of proper loop heads and tails to account for unaligned data and for uneven number of elements. The compiler (gcc-4.0) does not automatically generate SIMD SSE2 instructions, so we need to do some manual work with the macros.

details

Unfortunately, the amount of definitions needed per loop kernel is much larger for SIMD intrinsics. Still better than assembly.

The following environment is provided by the macros from unroll_prefetch_simd_def.h:

Local storage of the passed type is available, TMP and LD. Additional vars can be created via the PREP macro. Final evaluation can be done via the FIN macro. The scalars (f2 or f1 and f2) are passed to those, if they are constants, values, otherwise references, never pointers. Unlike the macros from the non SIMD type, we are passed pointers to the vectors for the loop operations.

If TBCI_NO_SIMD is NOT defined, we use SSE2 instructions on machines that support it (-msse2 on x86, always on x86-64). This greatly speeds up operations; as we process two doubles (and even four floats) per instruction, we get significantly improved performance. In unrolled loops (which TBCI generates), we can reach more than 1.5x the speed of normal SISD operation for doubles.

Note that this only holds if we are not memory bound, i.e. if the data fits in the cache and is there. The hardware prefetch logic of SSE2 capable CPUs takes care of some of the latency, but the mem bandwidth is a limiting factor for simple arithmetic operations on modern systems.

On x86, the default for floating point is to use the i387 FPU. It has the annoying(!) feature of providing extra accuracy (80bit extended double format) – which can yield different results than the standard IEEE754/854 floating point arithmetic that you get with SSE2 instructions and on all other architectures. You can switch the FPU to double precision mode though to get consistent results with or without SSE2. Note that x86-64 does not share this problem, as SSE2 is used by default there, just normally with single data.

Another mismatch can occur when SIMD instructions are used for summing up numbers; see comments on TCBI_SIMD_SUM.

Note that we generate unrolled AVX loops using intrinsics and a heavy abuse of macros; we take care of proper loop heads and tails to account for unaligned data and for uneven number of elements. The compiler (gcc-4.0) does not automatically generate SIMD SSE2 instructions, so we need to do some manual work with the macros.

details

Unfortunately, the amount of definitions needed per loop kernel is much larger for SIMD intrinsics. Still better than assembly.

The following environment is provided by the macros from unroll_prefetch_simd_def.h:

Local storage of the passed type is available, TMP and LD. Additional vars can be created via the PREP macro. Final evaluation can be done via the FIN macro. The scalars (f2 or f1 and f2) are passed to those, if they are constants, values, otherwise references, never pointers. Unlike the macros from the non SIMD type, we are passed pointers to the vectors for the loop operations.