vc/doc/examples.h

344 lines
16 KiB
C++

/**
* \page examples Examples
*
* There are several examples shipping with %Vc. If you have a suggestion for a useful or interesting
* example, please contact vc@compeng.uni-frankfurt.de.
*
* \li \subpage ex-polarcoord
* This example explains the basic approaches to vectorization of a given problem. It contains a
* discussion of storage issues.
* \li \subpage ex-finitediff
* \li \subpage ex-matrix
* \li \subpage ex-mandelbrot
* \li \subpage ex-buddhabrot
*
*********************************************************************************
*
* \page ex-polarcoord Polar Coordinates
*
* The \c polarcoord example generates 1000 random Cartesian 2D coordinates that are then
* converted to polar coordinates and printed to the terminal.
* This is a very simple example but shows the concept of vertical versus horizontal
* vectorization quite nicely.
*
* \section ex_polarcoord_background Background
*
* In our problem we start with the allocation and random initialization of 1000 Cartesian 2D
* coordinates. Thus every coordinate consists of two floating-point values (x and y).
* \code
* struct CartesianCoordinate
* {
* float x, y;
* };
* CartesianCoordinate input[1000];
* \endcode
* \image html polarcoord-cartesian.png "Cartesian coordinate"
*
* We want to convert them to 1000 polar coordinates.
* \code
* struct PolarCoordinate
* {
* float r, phi;
* };
* PolarCoordinate output[1000];
* \endcode
* \image html polarcoord-polar.png "Polar coordinate"
*
* Recall that:
* \f[
* r^2 = x^2 + y^2
* \f]\f[
* \tan\phi = y/x
* \f]
* (One typically uses \c atan2 to calculate \c phi efficiently.)
*
* \section ex_polarcoord_vectorization Identify Vectorizable Parts
*
* When you look into vectorization of your application/algorithm, the first task is to identify the
* data parallelism to use for vectorization.
* A scalar implementation of our problem could look like this:
* \code
* for (int i = 0; i < ArraySize; ++i) {
* const float x = input[i].x;
* const float y = input[i].y;
* output[i].r = std::sqrt(x * x + y * y);
* output[i].phi = std::atan2(y, x) * 57.295780181884765625f; // 180/pi
* if (output[i].phi < 0.f) {
* output[i].phi += 360.f;
* }
* }
* \endcode
* The data parallelism inside the loop is minimal. It basically consists of two multiplications
* that can be executed in parallel. This kind of parallelism is already exploited by all modern
* processors via pipelining, which is one form of instruction level parallelism (ILP).
* Thus, if one were to put the x and y values into a SIMD vector, this one multiplication could be
* executed with just a single SIMD instruction. This vectorization is called \e vertical
* vectorization, because the vector is placed vertically into the object.
*
* There is much more data parallelism in this code snippet, though. The different iteration steps
* are all independent, which means that subsequent steps do not depend on results of the preceding steps.
* Therefore, several steps of the loop can be executed in parallel. This is the most
* straightforward vectorization strategy for our problem:
* From a loop, always execute N steps in parallel, where N is the number of entries in the SIMD vector.
* The input values to the loop need to be placed into a vector.
* Then all intermediate values and results are also vectors. Using the %Vc datatypes a single loop
* step would then look like this:
* \code
* // x and y are of type float_v
* float_v r = Vc::sqrt(x * x + y * y);
* float_v phi = Vc::atan2(y, x) * 57.295780181884765625f; // 180/pi
* phi(output[i].phi < 0.f) += 360.f;
* \endcode
* This vectorization is called \e horizontal vectorization, because the vector is placed
* horizontally over several objects.
*
* \section ex_polarcoord_data Data Structures
*
* To form the \c x vector from the previously used storage format, one would thus write:
* \code
* float_v x;
* for (int i = 0; i < float_v::Size; ++i) {
* x[i] = input[offset + i].x;
* }
* \endcode
* Notice how the memory access is rather inefficient.
*
* \subsection ex_polarcoord_data_aos Array of Structs (AoS)
*
* The data was originally stored as array of
* structs (\e AoS). Another way to call it is \e interleaved storage. That's because the entries of
* the \c x and \c y vectors are interleaved in memory.
* Let us assume the storage format is given and we cannot change it.
* We would rather not load and store all our vectors entry by entry as this would lead to
* inefficient code, which mainly occupies the load/store ports of the processor. Instead, we can use
* a little helper function %Vc provides to load the data as vectors with subsequent deinterleaving:
* \code
* Vc::float_v x, y;
* Vc::deinterleave(&x, &y, &input[i], Vc::Aligned);
* \endcode
* This pattern can be very efficient if the interleaved data members are always accessed together.
* This optimizes for data locality and thus cache usage.
*
* \subsection ex_polarcoord_data_vectors Interleaved Vectors / Array of Vectorized Structs (AoVS)
*
* If you can change the data structures, it might be a good option to store interleaved vectors:
* \code
* struct CartesianCoordinate
* {
* Vc::float_v x, y;
* };
* CartesianCoordinate input[(1000 + Vc::float_v::Size - 1) / Vc::float_v::Size];
* \endcode
* Accessing vectors of \c x and \c y is then as simple as accessing the members of a \c
* CartesianCoordinate object. This can be slightly more efficient than the previous method because
* the deinterleaving step is not required anymore. On the downside your data structure now depends
* on the target architecture, which can be a portability concern.
* In short, the \c sizeof operator returns different values depending on Vc::float_v::Size.
* Thus, you would have to ensure correct conversion to target independent data
* formats for any data exchange (storage, network). (But if you are really careful about portable
* data exchange, you already have to handle endian conversion anyway.)
*
* Note the unfortunate complication of determining the size of the array. In order to fit 1000
* scalar values into the array, the number of vectors times the vector size must be greater or equal
* than 1000. But integer division truncates.
*
* Sadly, there is one last issue with alignment. If the \c CartesianCoordinate object is allocated
* on the stack everything is fine (because the compiler knows about the alignment restrictions of
* \c x and \c y and thus of \c CartesianCoordinate). But if \c CartesianCoordinate is allocated on
* the heap (with \c new or inside an STL container), the correct alignment is not ensured. %Vc provides
* Vc::VectorAlignedBase, which contains the correct reimplementations of the \c new and \c delete operators:
* \code
* struct CartesianCoordinate : public Vc::VectorAlignedBase
* {
* Vc::float_v x, y;
* }
* CartesianCoordinate *input = new CartesianCoordinate[(1000 + Vc::float_v::Size - 1) / Vc::float_v::Size];
* \endcode
* To ensure correctly aligned storage with STL containers you can use Vc::Allocator:
* \code
* struct CartesianCoordinate
* {
* Vc::float_v x, y;
* }
* Vc_DECLARE_ALLOCATOR(CartesianCoordinate)
* std::vector<CartesianCoordinate> input((1000 + Vc::float_v::Size - 1) / Vc::float_v::Size);
* \endcode
*
* For a thorough discussion of alignment see \ref intro_alignment.
*
* \subsection ex_polarcoord_data_soa Struct of Arrays (SoA)
*
* A third option is storage in the form of a single struct instance that contains arrays of the
* data members:
* \code
* template<size_t Size> struct CartesianCoordinate
* {
* float x[Size], y[Size];
* }
* CartesianCoordinate<1000> input;
* \endcode
* Now all \c x values are adjacent in memory and thus can easily be loaded and stored as vectors.
* Well, two problems remain:
* 1. The alignment of \c x and \c y is not defined and therefore not guaranteed. Vector loads and
* stores thus must assume unaligned pointers, which is bad for performance. Even worse, if an
* instruction that expects an aligned pointer is executed with an unaligned address the program
* will crash.
* 2. The size of the \c x and \c y arrays is not guaranteed to be large enough to allow the last
* values in the arrays to be loaded/stored as vectors.
*
* %Vc provides the Vc::Memory class to solve both issues:
* \code
* template<size_t Size> struct CartesianCoordinate
* {
* Vc::Memory<float_v, Size> x, y;
* }
* CartesianCoordinate<1000> input;
* \endcode
*
* \section ex_polarcoord_complete The Complete Example
*
* Now that we have covered the background and know what we need - let us take a look at the
* complete example code.
*
* \snippet polarcoord/main.cpp includes
* The example starts with the main include directive to use for %Vc: \c \#include \c <Vc/Vc>.
* The remaining includes are required for terminal output.
* Note that we include Vc::float_v into the global namespace.
* It is not recommended to include the whole %Vc namespace into the global namespace
* except maybe inside a function scope.
*
* \snippet polarcoord/main.cpp memory allocation
* At the start of the program, the input and output memory is allocated.
* Of course, you can abstract these variables into structs/classes for Cartesian and polar
* coordinates.
* The Vc::Memory class can be used to allocate memory on the stack or on the heap.
* In this case the memory is allocated on the stack, since the size of the memory is given at
* compile time.
* The first \c float value of Vc::Memory (e.g. x_mem[0]) is guaranteed to be aligned to the
* natural SIMD vector alignment.
* Also, the size of the allocated memory may be padded at the end to allow access to the last \c
* float value (e.g. x_mem[999]) with a SIMD vector.
* Thus, if this example is compiled for a target with a vector width (\c Vc::float_v::Size) of 16
* entries, the four arrays would internally be allocated as size 1008 (63 vectors with 16 entries =
* 1008 entries).
*
* \snippet polarcoord/main.cpp random init
* Next the x and y values are initialized with random numbers.
* %Vc includes a simple vectorized random number generator.
* The floating point RNGs in %Vc produce values in the range from 0 to 1.
* Thus the value has to be scaled and subtracted to get into the desired range of -1 to 1.
* The iteration over the memory goes from 0 (no surprise) to a value determined by the Vc::Memory
* class. In the case of fixed-size allocation, this number is also available to the compiler as a
* compile time constant. Vc::Memory has two functions to use as upper bound for iterations:
* Vc::Memory::entriesCount and Vc::Memory::vectorsCount. The former returns the same number as was
* used for allocation. The latter returns the number of SIMD vectors that fit into the (padded)
* allocated memory. Thus, if Vc::float_v::Size were 16, \c x_mem.vectorsCount() would expand to 63.
* Inside the loop, the memory i-th vector is then set to a random value.
*
* \warning Please do not use this RNG until you have read its documentation. It may not be as
* random as you need it to be.
*
* \snippet polarcoord/main.cpp conversion
* Finally we arrive at the conversion of the Cartesian coordinates to polar coordinates.
* The for loop is equivalent to the one above.
*
* \note
* Inside the loop we first assign the \c x and \c y values to local variables.
* This is not necessary; but it can help the compiler with optimization. The issue is that when you
* access values from some memory area, the compiler cannot always be sure that the pointers to
* memory do not alias (i.e. point to the same location). Thus, the compiler might rather take the
* safe way out and load the value from memory more often than necessary. By using local variables,
* the compiler has an easy task to prove that the value does not change and can be cached in a
* register. This is a general issue, and not a special issue with SIMD. In this case mainly serves
* to make the following code more readable.
*
* The radius (\c r) is easily calculated as the square root of the sum of squares.
* It is then directly assigned to the right vector in \c r_mem.
*
* \subsection ex_polarcoord_complete_masking Masked Assignment
*
* The \c phi value, on the other hand, is determined as a value between -pi and pi.
* Since we want to have a value between 0 and 360, the value has to be scaled with 180/pi.
* Then, all \c phi values that are less than zero must be modified.
* This is where SIMD code differs significantly from the scalar code you are used to:
* An \c if statement cannot be used directly with a SIMD mask.
* Thus, <tt>if (phi < 0)</tt> would be ill-formed.
* This is obvious once you realize that <tt>phi < 0</tt> is an object that contains multiple boolean values.
* Therefore, the meaning of <tt>if (phi < 0)</tt> is ambiguous.
* You can make your intent clear by using a reduction function for the mask, such as one of:
* \code
* if (all_of(phi < 0)) // ...
* if (any_of(phi < 0)) // ...
* if (none_of(phi < 0)) // ...
* if (some_of(phi < 0)) // ...
* \endcode
*
* In most cases a reduction is useful as a shortcut, but not as a general solution.
* This is where masked assignment (or write-masking) comes into play.
* The idea is to assign only a subset of the values in a SIMD vector and leave the remaining ones
* unchanged.
* The %Vc syntax for this uses a predicate (mask) in parenthesis before the assignment operator.
* Thus, <tt>x(mask) = y;</tt> requests that \c x is assigned the values from \c y where the
* corresponding entries in \c mask are \c true.
* Read it as: "where mask is true, assign y to x".
*
* Therefore, the transformation of \c phi is written as <tt>phi(phi < 0.f) += 360.f</tt>.
* (Note that the assignment operator can be any compound assignment, as well.)
*
* \note
* %Vc also supports an alternative syntax using the \ref Vc::where function, which enables generic
* implementations that work for fundamental arithmetic types as well as the %Vc vector types.
*
* \subsection ex_polarcoord_complete_output Console Output
*
* At the end of the program the results are printed to \c stdout:
* \snippet polarcoord/main.cpp output
*
* The loop iterates over the Vc::Memory objects using Vc::Memory::entriesCount, thus iterating over scalar
* values, not SIMD vectors.
* The code therefore should be clear, as it doesn't use any SIMD functionality.
*
*********************************************************************************
*
* \page ex-finitediff Finite Differences
*
* Finite difference method example
*
* We calculate central differences for a given function and compare it to the analytical solution.
*
* \snippet finitediff/main.cpp includes
* \snippet finitediff/main.cpp constants
* \snippet finitediff/main.cpp functions
* \snippet finitediff/main.cpp cleanup
*
*********************************************************************************
*
* \page ex-matrix Matrix Class
*
*********************************************************************************
*
* \page ex-buddhabrot Buddhabrot
*
*********************************************************************************
*
* \page ex-mandelbrot Mandelbrot
*
* This example draws a colorized Mandelbrot image on screen using Qt4 widgets.
*
* The example uses a simple class to abstract complex numbers. In principle, one could just use
* std::complex, if it would perform well enough. But especially the norm function is very slow for
* scalar float/double. Also, complex multiplication is correctly implemented to handle NaN and
* infinity. This is not required for Mandelbrot as these special cases will not occur.
* Additionally, the provided complex abstraction stores the square of the real and imaginary parts
* to help the compiler in optimizing the code as good as possible.
* \snippet mandelbrot/mandel.cpp MyComplex
*
* Mandelbrot uses the function z = z² + c for iteration.
* \snippet mandelbrot/mandel.cpp P function
*
*********************************************************************************
*
* \page ex-buddhabrot Buddhabrot
*
*/