|
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false> |
void | parallel_for (Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) |
| [ASYNCHRONOUS] Launch the passed functor in parallel. More...
|
|
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false> |
void | parallel_for (char const *str, Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) |
| [ASYNCHRONOUS] Launch the passed functor in parallel. More...
|
|
template<class F , int N, bool simple> |
YAKL_INLINE void | parallel_inner (Bounds< N, simple > const &bounds, F const &f, InnerHandler handler) |
| Launch the passed functor in parallel in the finenst-level parallelism on the device. More...
|
|
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false> |
void | parallel_outer (Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) |
| [ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device More...
|
|
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false> |
void | parallel_outer (char const *str, Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) |
| [ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device More...
|
|
template<class F > |
YAKL_INLINE void | single_inner (F const &f, InnerHandler handler) |
| Launch the passed functor to only use one of the inner threads (still parallel over outer threads). More...
|
|
Contains Bounds
class, and parallel_for()
routines using Fortran-style indexing and ordering.
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
void yakl::fortran::parallel_for |
( |
char const * |
str, |
|
|
Bounds< N, simple > const & |
bounds, |
|
|
F const & |
f, |
|
|
LaunchConfig< VecLen, B4B > |
config = LaunchConfig<>() |
|
) |
| |
|
inline |
[ASYNCHRONOUS] Launch the passed functor in parallel.
If passing a lambda, it must be decorated with YAKL_LAMBDA. If passing a functor, the operator() must be decorated with YAKL_INLINE. Click for more information.
- Parameters
-
str | String label for this parallel_for . This form of parallel_for is highly recommended so that debugging and profiling features can be used when turned on via CPP macros. |
bounds | The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, {lower,upper} pair, or {lower,upper,stride} triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, N will default to a lower bound of 1 and an upper bound of N . All loop bounds specified as {lower,upper} or {lower,upper,stride} are inclusive, meaning the indices will be {lower,lower+stride,...,upper_with_stride} , where upper_with_stride <= upper depending on whether the index upper is exactly reached with the striding. |
f | The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with YAKL_INLINE. |
config | [Optional] Use a yakl::LaunchConfig object to describe the size of inner-level parallelism (VecLen ) or whether this kernel should be executed in serial (B4B ) when the CPP macro -DYAKL_B4B is defined. |
template<class F , int N, bool simple>
Launch the passed functor in parallel in the finenst-level parallelism on the device.
For hierarchical (two-level) parallelism only. Must be called from within a yakl::fortran::parallel_outer call. For CUDA and HIP, for instance, this is "block"-level parallelism spread over threads within a multiprocessor. IMPORTANT: If passing a lambda, it must be decorated with [&]
and not YAKL_LAMBDA
. If passing a functor, the operator() must not be decorated with YAKL_INLINE. Click for more information.
- Parameters
-
bounds | The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, {lower,upper} pair, or {lower,upper,stride} triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, N will default to a lower bound of 1 and an upper bound of N . All loop bounds specified as {lower,upper} or {lower,upper,stride} are inclusive, meaning the indices will be {lower,lower+stride,...,upper_with_stride} , where upper_with_stride <= upper depending on whether the index upper is exactly reached with the striding. |
f | The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with [&] and not YAKL_INLINE. |
handler | yakl::InnerHandler object created by yakl::fortran::parallel_outer. |
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
void yakl::fortran::parallel_outer |
( |
char const * |
str, |
|
|
Bounds< N, simple > const & |
bounds, |
|
|
F const & |
f, |
|
|
LaunchConfig< VecLen, B4B > |
config = LaunchConfig<>() |
|
) |
| |
|
inline |
[ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device
For hierarchical (two-level) parallelism only. For CUDA and HIP, for instance, this is "grid"-level parallelism spread over multiprocessors. yakl::fortran::parallel_inner, on the other hand, is "block"-level parallelism spread over threads within a multiprocessor. If passing a lambda, it must be decorated with YAKL_LAMBDA. If passing a functor, the operator() must be decorated with YAKL_INLINE. Click for more information. IMPORTANT: While the yakl::LaunchConfig parameter is optional, you will very likely want to use it! Otherwise, you're at the mercy of the YAKL_DEFAULT_VECTOR_LEN for a given hardware backend. The yakl::LaunchConfig parameter's template vector length parameter must be larger than the inner_size
declared by yakl::LaunchConfig::set_inner_size(). Click for more information.
Example usage:
int constexpr MAX_INNER_SIZE = 256;
int inner_size = 96;
...
} , LaunchConfig<MAX_INNER_SIZE>.set_inner_size(inner_size) );
IMPORTANT: All code inside yakl::fortran::parallel_outer is run in parallel over both outer and inner parallelism. So code not inside yakl::fortran::parallel_inner will still execute for all inner threads but without any knowledge of inner parallelism indices. If you want to execute only for one inner thread, please use the yakl::fortran::single_inner routine.
- Parameters
-
str | String label for this parallel_outer . This form of parallel_outer is highly recommended so that debugging and profiling features can be used when turned on via CPP macros. |
bounds | The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, {lower,upper} pair, or {lower,upper,stride} triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, N will default to a lower bound of 1 and an upper bound of N . All loop bounds specified as {lower,upper} or {lower,upper,stride} are inclusive, meaning the indices will be {lower,lower+stride,...,upper_with_stride} , where upper_with_stride <= upper depending on whether the index upper is exactly reached with the striding. |
f | The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with YAKL_INLINE. IMPORTANT: The lambda or operator() must accept an additional yakl::InnerHandler object after the loop indices. |
config | [Optional, but HIGHLY ENCOURAGED] Use the VecLen template parameter to define the maximum size of the inner looping. When creating the yakl::LaunchConfig object, use the yakl::LaunchConfig::set_inner_size(int) routine to set the actual size of the inner looping. Ensure set_inner_size <= VecLen . Also an optional B4B template parameter to tell YAKL to run this kernel in serial when -DYAKL_B4B is defined as a CPP macro. |