forked from openkylin/vc
344 lines
16 KiB
C++
344 lines
16 KiB
C++
/**
|
|
* \page examples Examples
|
|
*
|
|
* There are several examples shipping with %Vc. If you have a suggestion for a useful or interesting
|
|
* example, please contact vc@compeng.uni-frankfurt.de.
|
|
*
|
|
* \li \subpage ex-polarcoord
|
|
* This example explains the basic approaches to vectorization of a given problem. It contains a
|
|
* discussion of storage issues.
|
|
* \li \subpage ex-finitediff
|
|
* \li \subpage ex-matrix
|
|
* \li \subpage ex-mandelbrot
|
|
* \li \subpage ex-buddhabrot
|
|
*
|
|
*********************************************************************************
|
|
*
|
|
* \page ex-polarcoord Polar Coordinates
|
|
*
|
|
* The \c polarcoord example generates 1000 random Cartesian 2D coordinates that are then
|
|
* converted to polar coordinates and printed to the terminal.
|
|
* This is a very simple example but shows the concept of vertical versus horizontal
|
|
* vectorization quite nicely.
|
|
*
|
|
* \section ex_polarcoord_background Background
|
|
*
|
|
* In our problem we start with the allocation and random initialization of 1000 Cartesian 2D
|
|
* coordinates. Thus every coordinate consists of two floating-point values (x and y).
|
|
* \code
|
|
* struct CartesianCoordinate
|
|
* {
|
|
* float x, y;
|
|
* };
|
|
* CartesianCoordinate input[1000];
|
|
* \endcode
|
|
* \image html polarcoord-cartesian.png "Cartesian coordinate"
|
|
*
|
|
* We want to convert them to 1000 polar coordinates.
|
|
* \code
|
|
* struct PolarCoordinate
|
|
* {
|
|
* float r, phi;
|
|
* };
|
|
* PolarCoordinate output[1000];
|
|
* \endcode
|
|
* \image html polarcoord-polar.png "Polar coordinate"
|
|
*
|
|
* Recall that:
|
|
* \f[
|
|
* r^2 = x^2 + y^2
|
|
* \f]\f[
|
|
* \tan\phi = y/x
|
|
* \f]
|
|
* (One typically uses \c atan2 to calculate \c phi efficiently.)
|
|
*
|
|
* \section ex_polarcoord_vectorization Identify Vectorizable Parts
|
|
*
|
|
* When you look into vectorization of your application/algorithm, the first task is to identify the
|
|
* data parallelism to use for vectorization.
|
|
* A scalar implementation of our problem could look like this:
|
|
* \code
|
|
* for (int i = 0; i < ArraySize; ++i) {
|
|
* const float x = input[i].x;
|
|
* const float y = input[i].y;
|
|
* output[i].r = std::sqrt(x * x + y * y);
|
|
* output[i].phi = std::atan2(y, x) * 57.295780181884765625f; // 180/pi
|
|
* if (output[i].phi < 0.f) {
|
|
* output[i].phi += 360.f;
|
|
* }
|
|
* }
|
|
* \endcode
|
|
* The data parallelism inside the loop is minimal. It basically consists of two multiplications
|
|
* that can be executed in parallel. This kind of parallelism is already exploited by all modern
|
|
* processors via pipelining, which is one form of instruction level parallelism (ILP).
|
|
* Thus, if one were to put the x and y values into a SIMD vector, this one multiplication could be
|
|
* executed with just a single SIMD instruction. This vectorization is called \e vertical
|
|
* vectorization, because the vector is placed vertically into the object.
|
|
*
|
|
* There is much more data parallelism in this code snippet, though. The different iteration steps
|
|
* are all independent, which means that subsequent steps do not depend on results of the preceding steps.
|
|
* Therefore, several steps of the loop can be executed in parallel. This is the most
|
|
* straightforward vectorization strategy for our problem:
|
|
* From a loop, always execute N steps in parallel, where N is the number of entries in the SIMD vector.
|
|
* The input values to the loop need to be placed into a vector.
|
|
* Then all intermediate values and results are also vectors. Using the %Vc datatypes a single loop
|
|
* step would then look like this:
|
|
* \code
|
|
* // x and y are of type float_v
|
|
* float_v r = Vc::sqrt(x * x + y * y);
|
|
* float_v phi = Vc::atan2(y, x) * 57.295780181884765625f; // 180/pi
|
|
* phi(output[i].phi < 0.f) += 360.f;
|
|
* \endcode
|
|
* This vectorization is called \e horizontal vectorization, because the vector is placed
|
|
* horizontally over several objects.
|
|
*
|
|
* \section ex_polarcoord_data Data Structures
|
|
*
|
|
* To form the \c x vector from the previously used storage format, one would thus write:
|
|
* \code
|
|
* float_v x;
|
|
* for (int i = 0; i < float_v::Size; ++i) {
|
|
* x[i] = input[offset + i].x;
|
|
* }
|
|
* \endcode
|
|
* Notice how the memory access is rather inefficient.
|
|
*
|
|
* \subsection ex_polarcoord_data_aos Array of Structs (AoS)
|
|
*
|
|
* The data was originally stored as array of
|
|
* structs (\e AoS). Another way to call it is \e interleaved storage. That's because the entries of
|
|
* the \c x and \c y vectors are interleaved in memory.
|
|
* Let us assume the storage format is given and we cannot change it.
|
|
* We would rather not load and store all our vectors entry by entry as this would lead to
|
|
* inefficient code, which mainly occupies the load/store ports of the processor. Instead, we can use
|
|
* a little helper function %Vc provides to load the data as vectors with subsequent deinterleaving:
|
|
* \code
|
|
* Vc::float_v x, y;
|
|
* Vc::deinterleave(&x, &y, &input[i], Vc::Aligned);
|
|
* \endcode
|
|
* This pattern can be very efficient if the interleaved data members are always accessed together.
|
|
* This optimizes for data locality and thus cache usage.
|
|
*
|
|
* \subsection ex_polarcoord_data_vectors Interleaved Vectors / Array of Vectorized Structs (AoVS)
|
|
*
|
|
* If you can change the data structures, it might be a good option to store interleaved vectors:
|
|
* \code
|
|
* struct CartesianCoordinate
|
|
* {
|
|
* Vc::float_v x, y;
|
|
* };
|
|
* CartesianCoordinate input[(1000 + Vc::float_v::Size - 1) / Vc::float_v::Size];
|
|
* \endcode
|
|
* Accessing vectors of \c x and \c y is then as simple as accessing the members of a \c
|
|
* CartesianCoordinate object. This can be slightly more efficient than the previous method because
|
|
* the deinterleaving step is not required anymore. On the downside your data structure now depends
|
|
* on the target architecture, which can be a portability concern.
|
|
* In short, the \c sizeof operator returns different values depending on Vc::float_v::Size.
|
|
* Thus, you would have to ensure correct conversion to target independent data
|
|
* formats for any data exchange (storage, network). (But if you are really careful about portable
|
|
* data exchange, you already have to handle endian conversion anyway.)
|
|
*
|
|
* Note the unfortunate complication of determining the size of the array. In order to fit 1000
|
|
* scalar values into the array, the number of vectors times the vector size must be greater or equal
|
|
* than 1000. But integer division truncates.
|
|
*
|
|
* Sadly, there is one last issue with alignment. If the \c CartesianCoordinate object is allocated
|
|
* on the stack everything is fine (because the compiler knows about the alignment restrictions of
|
|
* \c x and \c y and thus of \c CartesianCoordinate). But if \c CartesianCoordinate is allocated on
|
|
* the heap (with \c new or inside an STL container), the correct alignment is not ensured. %Vc provides
|
|
* Vc::VectorAlignedBase, which contains the correct reimplementations of the \c new and \c delete operators:
|
|
* \code
|
|
* struct CartesianCoordinate : public Vc::VectorAlignedBase
|
|
* {
|
|
* Vc::float_v x, y;
|
|
* }
|
|
* CartesianCoordinate *input = new CartesianCoordinate[(1000 + Vc::float_v::Size - 1) / Vc::float_v::Size];
|
|
* \endcode
|
|
* To ensure correctly aligned storage with STL containers you can use Vc::Allocator:
|
|
* \code
|
|
* struct CartesianCoordinate
|
|
* {
|
|
* Vc::float_v x, y;
|
|
* }
|
|
* Vc_DECLARE_ALLOCATOR(CartesianCoordinate)
|
|
* std::vector<CartesianCoordinate> input((1000 + Vc::float_v::Size - 1) / Vc::float_v::Size);
|
|
* \endcode
|
|
*
|
|
* For a thorough discussion of alignment see \ref intro_alignment.
|
|
*
|
|
* \subsection ex_polarcoord_data_soa Struct of Arrays (SoA)
|
|
*
|
|
* A third option is storage in the form of a single struct instance that contains arrays of the
|
|
* data members:
|
|
* \code
|
|
* template<size_t Size> struct CartesianCoordinate
|
|
* {
|
|
* float x[Size], y[Size];
|
|
* }
|
|
* CartesianCoordinate<1000> input;
|
|
* \endcode
|
|
* Now all \c x values are adjacent in memory and thus can easily be loaded and stored as vectors.
|
|
* Well, two problems remain:
|
|
* 1. The alignment of \c x and \c y is not defined and therefore not guaranteed. Vector loads and
|
|
* stores thus must assume unaligned pointers, which is bad for performance. Even worse, if an
|
|
* instruction that expects an aligned pointer is executed with an unaligned address the program
|
|
* will crash.
|
|
* 2. The size of the \c x and \c y arrays is not guaranteed to be large enough to allow the last
|
|
* values in the arrays to be loaded/stored as vectors.
|
|
*
|
|
* %Vc provides the Vc::Memory class to solve both issues:
|
|
* \code
|
|
* template<size_t Size> struct CartesianCoordinate
|
|
* {
|
|
* Vc::Memory<float_v, Size> x, y;
|
|
* }
|
|
* CartesianCoordinate<1000> input;
|
|
* \endcode
|
|
*
|
|
* \section ex_polarcoord_complete The Complete Example
|
|
*
|
|
* Now that we have covered the background and know what we need - let us take a look at the
|
|
* complete example code.
|
|
*
|
|
* \snippet polarcoord/main.cpp includes
|
|
* The example starts with the main include directive to use for %Vc: \c \#include \c <Vc/Vc>.
|
|
* The remaining includes are required for terminal output.
|
|
* Note that we include Vc::float_v into the global namespace.
|
|
* It is not recommended to include the whole %Vc namespace into the global namespace
|
|
* except maybe inside a function scope.
|
|
*
|
|
* \snippet polarcoord/main.cpp memory allocation
|
|
* At the start of the program, the input and output memory is allocated.
|
|
* Of course, you can abstract these variables into structs/classes for Cartesian and polar
|
|
* coordinates.
|
|
* The Vc::Memory class can be used to allocate memory on the stack or on the heap.
|
|
* In this case the memory is allocated on the stack, since the size of the memory is given at
|
|
* compile time.
|
|
* The first \c float value of Vc::Memory (e.g. x_mem[0]) is guaranteed to be aligned to the
|
|
* natural SIMD vector alignment.
|
|
* Also, the size of the allocated memory may be padded at the end to allow access to the last \c
|
|
* float value (e.g. x_mem[999]) with a SIMD vector.
|
|
* Thus, if this example is compiled for a target with a vector width (\c Vc::float_v::Size) of 16
|
|
* entries, the four arrays would internally be allocated as size 1008 (63 vectors with 16 entries =
|
|
* 1008 entries).
|
|
*
|
|
* \snippet polarcoord/main.cpp random init
|
|
* Next the x and y values are initialized with random numbers.
|
|
* %Vc includes a simple vectorized random number generator.
|
|
* The floating point RNGs in %Vc produce values in the range from 0 to 1.
|
|
* Thus the value has to be scaled and subtracted to get into the desired range of -1 to 1.
|
|
* The iteration over the memory goes from 0 (no surprise) to a value determined by the Vc::Memory
|
|
* class. In the case of fixed-size allocation, this number is also available to the compiler as a
|
|
* compile time constant. Vc::Memory has two functions to use as upper bound for iterations:
|
|
* Vc::Memory::entriesCount and Vc::Memory::vectorsCount. The former returns the same number as was
|
|
* used for allocation. The latter returns the number of SIMD vectors that fit into the (padded)
|
|
* allocated memory. Thus, if Vc::float_v::Size were 16, \c x_mem.vectorsCount() would expand to 63.
|
|
* Inside the loop, the memory i-th vector is then set to a random value.
|
|
*
|
|
* \warning Please do not use this RNG until you have read its documentation. It may not be as
|
|
* random as you need it to be.
|
|
*
|
|
* \snippet polarcoord/main.cpp conversion
|
|
* Finally we arrive at the conversion of the Cartesian coordinates to polar coordinates.
|
|
* The for loop is equivalent to the one above.
|
|
*
|
|
* \note
|
|
* Inside the loop we first assign the \c x and \c y values to local variables.
|
|
* This is not necessary; but it can help the compiler with optimization. The issue is that when you
|
|
* access values from some memory area, the compiler cannot always be sure that the pointers to
|
|
* memory do not alias (i.e. point to the same location). Thus, the compiler might rather take the
|
|
* safe way out and load the value from memory more often than necessary. By using local variables,
|
|
* the compiler has an easy task to prove that the value does not change and can be cached in a
|
|
* register. This is a general issue, and not a special issue with SIMD. In this case mainly serves
|
|
* to make the following code more readable.
|
|
*
|
|
* The radius (\c r) is easily calculated as the square root of the sum of squares.
|
|
* It is then directly assigned to the right vector in \c r_mem.
|
|
*
|
|
* \subsection ex_polarcoord_complete_masking Masked Assignment
|
|
*
|
|
* The \c phi value, on the other hand, is determined as a value between -pi and pi.
|
|
* Since we want to have a value between 0 and 360, the value has to be scaled with 180/pi.
|
|
* Then, all \c phi values that are less than zero must be modified.
|
|
* This is where SIMD code differs significantly from the scalar code you are used to:
|
|
* An \c if statement cannot be used directly with a SIMD mask.
|
|
* Thus, <tt>if (phi < 0)</tt> would be ill-formed.
|
|
* This is obvious once you realize that <tt>phi < 0</tt> is an object that contains multiple boolean values.
|
|
* Therefore, the meaning of <tt>if (phi < 0)</tt> is ambiguous.
|
|
* You can make your intent clear by using a reduction function for the mask, such as one of:
|
|
* \code
|
|
* if (all_of(phi < 0)) // ...
|
|
* if (any_of(phi < 0)) // ...
|
|
* if (none_of(phi < 0)) // ...
|
|
* if (some_of(phi < 0)) // ...
|
|
* \endcode
|
|
*
|
|
* In most cases a reduction is useful as a shortcut, but not as a general solution.
|
|
* This is where masked assignment (or write-masking) comes into play.
|
|
* The idea is to assign only a subset of the values in a SIMD vector and leave the remaining ones
|
|
* unchanged.
|
|
* The %Vc syntax for this uses a predicate (mask) in parenthesis before the assignment operator.
|
|
* Thus, <tt>x(mask) = y;</tt> requests that \c x is assigned the values from \c y where the
|
|
* corresponding entries in \c mask are \c true.
|
|
* Read it as: "where mask is true, assign y to x".
|
|
*
|
|
* Therefore, the transformation of \c phi is written as <tt>phi(phi < 0.f) += 360.f</tt>.
|
|
* (Note that the assignment operator can be any compound assignment, as well.)
|
|
*
|
|
* \note
|
|
* %Vc also supports an alternative syntax using the \ref Vc::where function, which enables generic
|
|
* implementations that work for fundamental arithmetic types as well as the %Vc vector types.
|
|
*
|
|
* \subsection ex_polarcoord_complete_output Console Output
|
|
*
|
|
* At the end of the program the results are printed to \c stdout:
|
|
* \snippet polarcoord/main.cpp output
|
|
*
|
|
* The loop iterates over the Vc::Memory objects using Vc::Memory::entriesCount, thus iterating over scalar
|
|
* values, not SIMD vectors.
|
|
* The code therefore should be clear, as it doesn't use any SIMD functionality.
|
|
*
|
|
*********************************************************************************
|
|
*
|
|
* \page ex-finitediff Finite Differences
|
|
*
|
|
* Finite difference method example
|
|
*
|
|
* We calculate central differences for a given function and compare it to the analytical solution.
|
|
*
|
|
* \snippet finitediff/main.cpp includes
|
|
* \snippet finitediff/main.cpp constants
|
|
* \snippet finitediff/main.cpp functions
|
|
* \snippet finitediff/main.cpp cleanup
|
|
*
|
|
*********************************************************************************
|
|
*
|
|
* \page ex-matrix Matrix Class
|
|
*
|
|
*********************************************************************************
|
|
*
|
|
* \page ex-buddhabrot Buddhabrot
|
|
*
|
|
*********************************************************************************
|
|
*
|
|
* \page ex-mandelbrot Mandelbrot
|
|
*
|
|
* This example draws a colorized Mandelbrot image on screen using Qt4 widgets.
|
|
*
|
|
* The example uses a simple class to abstract complex numbers. In principle, one could just use
|
|
* std::complex, if it would perform well enough. But especially the norm function is very slow for
|
|
* scalar float/double. Also, complex multiplication is correctly implemented to handle NaN and
|
|
* infinity. This is not required for Mandelbrot as these special cases will not occur.
|
|
* Additionally, the provided complex abstraction stores the square of the real and imaginary parts
|
|
* to help the compiler in optimizing the code as good as possible.
|
|
* \snippet mandelbrot/mandel.cpp MyComplex
|
|
*
|
|
* Mandelbrot uses the function z = z² + c for iteration.
|
|
* \snippet mandelbrot/mandel.cpp P function
|
|
*
|
|
*********************************************************************************
|
|
*
|
|
* \page ex-buddhabrot Buddhabrot
|
|
*
|
|
*/
|