Contains Bounds class, and parallel_for() routines using Fortran-style indexing and ordering. More...

Classes
class	Bounds
	Describes a set of Fortran-style tightly-nested loops. More...

class	Bounds< N, false >
	Describes a set of Fortran-style tightly-nested loops where at least one loop has a lower bound other than `1` or a stride other than `1`. More...

class	Bounds< N, true >
	Describes a set of Fortran-style tightly-nested loops where all loops have lower bounds of `1` strides of `1`. More...

class	LBnd
	Describes a single Fortran-style loop bound (lower bound default of `1`) More...

Typedefs
template<int N>
using	SimpleBounds = Bounds< N, true >
	Make it easy for the user to specify that all lower bounds are one and all strides are one. More...

Functions
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
void	parallel_for (Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>())
	[ASYNCHRONOUS] Launch the passed functor in parallel. More...

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
void	parallel_for (char const *str, Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>())
	[ASYNCHRONOUS] Launch the passed functor in parallel. More...

template<class F , int N, bool simple>
YAKL_INLINE void	parallel_inner (Bounds< N, simple > const &bounds, F const &f, InnerHandler handler)
	Launch the passed functor in parallel in the finenst-level parallelism on the device. More...

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
void	parallel_outer (Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>())
	[ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device More...

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
void	parallel_outer (char const *str, Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>())
	[ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device More...

template<class F >
YAKL_INLINE void	single_inner (F const &f, InnerHandler handler)
	Launch the passed functor to only use one of the inner threads (still parallel over outer threads). More...

Detailed Description

Contains Bounds class, and parallel_for() routines using Fortran-style indexing and ordering.

Typedef Documentation

◆ SimpleBounds

template<int N>

using yakl::fortran::SimpleBounds = typedef Bounds<N,true>

Make it easy for the user to specify that all lower bounds are one and all strides are one.

Function Documentation

◆ parallel_for() [1/2]

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>

void yakl::fortran::parallel_for	(	Bounds< N, simple > const &	bounds,
		F const &	f,
		LaunchConfig< VecLen, B4B >	config = `LaunchConfig<>()`
	)

inline

[ASYNCHRONOUS] Launch the passed functor in parallel.

Same as the other form of yakl::fortran::parallel_for but without the string label.

◆ parallel_for() [2/2]

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>

void yakl::fortran::parallel_for	(	char const *	str,
		Bounds< N, simple > const &	bounds,
		F const &	f,
		LaunchConfig< VecLen, B4B >	config = `LaunchConfig<>()`
	)

inline

[ASYNCHRONOUS] Launch the passed functor in parallel.

If passing a lambda, it must be decorated with YAKL_LAMBDA. If passing a functor, the operator() must be decorated with YAKL_INLINE. Click for more information.

Parameters

str	String label for this `parallel_for`. This form of `parallel_for` is highly recommended so that debugging and profiling features can be used when turned on via CPP macros.
bounds	The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, `{lower,upper}` pair, or `{lower,upper,stride}` triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, `N` will default to a lower bound of `1` and an upper bound of `N`. All loop bounds specified as `{lower,upper}` or `{lower,upper,stride}` are inclusive, meaning the indices will be `{lower,lower+stride,...,upper_with_stride}`, where `upper_with_stride <= upper` depending on whether the index `upper` is exactly reached with the striding.
f	The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with YAKL_INLINE.
config	[Optional] Use a yakl::LaunchConfig object to describe the size of inner-level parallelism (`VecLen`) or whether this kernel should be executed in serial (`B4B`) when the CPP macro `-DYAKL_B4B` is defined.

◆ parallel_inner()

template<class F , int N, bool simple>

YAKL_INLINE void yakl::fortran::parallel_inner	(	Bounds< N, simple > const &	bounds,
		F const &	f,
		InnerHandler	handler
	)

Launch the passed functor in parallel in the finenst-level parallelism on the device.

For hierarchical (two-level) parallelism only. Must be called from within a yakl::fortran::parallel_outer call. For CUDA and HIP, for instance, this is "block"-level parallelism spread over threads within a multiprocessor. IMPORTANT: If passing a lambda, it must be decorated with [&] and not YAKL_LAMBDA. If passing a functor, the operator() must not be decorated with YAKL_INLINE. Click for more information.

Parameters

bounds	The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, `{lower,upper}` pair, or `{lower,upper,stride}` triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, `N` will default to a lower bound of `1` and an upper bound of `N`. All loop bounds specified as `{lower,upper}` or `{lower,upper,stride}` are inclusive, meaning the indices will be `{lower,lower+stride,...,upper_with_stride}`, where `upper_with_stride <= upper` depending on whether the index `upper` is exactly reached with the striding.
f	The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with `[&]` and not YAKL_INLINE.
handler	yakl::InnerHandler object created by yakl::fortran::parallel_outer.

◆ parallel_outer() [1/2]

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>

void yakl::fortran::parallel_outer	(	Bounds< N, simple > const &	bounds,
		F const &	f,
		LaunchConfig< VecLen, B4B >	config = `LaunchConfig<>()`
	)

inline

[ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device

Same as the other form of yakl::fortran::parallel_outer but without the string label.

◆ parallel_outer() [2/2]

template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>

void yakl::fortran::parallel_outer	(	char const *	str,
		Bounds< N, simple > const &	bounds,
		F const &	f,
		LaunchConfig< VecLen, B4B >	config = `LaunchConfig<>()`
	)

inline

[ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device

For hierarchical (two-level) parallelism only. For CUDA and HIP, for instance, this is "grid"-level parallelism spread over multiprocessors. yakl::fortran::parallel_inner, on the other hand, is "block"-level parallelism spread over threads within a multiprocessor. If passing a lambda, it must be decorated with YAKL_LAMBDA. If passing a functor, the operator() must be decorated with YAKL_INLINE. Click for more information. IMPORTANT: While the yakl::LaunchConfig parameter is optional, you will very likely want to use it! Otherwise, you're at the mercy of the YAKL_DEFAULT_VECTOR_LEN for a given hardware backend. The yakl::LaunchConfig parameter's template vector length parameter must be larger than the inner_size declared by yakl::LaunchConfig::set_inner_size(). Click for more information.

Example usage:

int constexpr MAX_INNER_SIZE = 256;
int inner_size = 96;
yakl::fortran::parallel_outer( Bounds<2>(nz,{0,ny}) , YAKL_LAMBDA (int k, int j, InnerHandler handler) {
  ...
} , LaunchConfig<MAX_INNER_SIZE>.set_inner_size(inner_size) );

IMPORTANT: All code inside yakl::fortran::parallel_outer is run in parallel over both outer and inner parallelism. So code not inside yakl::fortran::parallel_inner will still execute for all inner threads but without any knowledge of inner parallelism indices. If you want to execute only for one inner thread, please use the yakl::fortran::single_inner routine.

Parameters

str	String label for this `parallel_outer`. This form of `parallel_outer` is highly recommended so that debugging and profiling features can be used when turned on via CPP macros.
bounds	The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, `{lower,upper}` pair, or `{lower,upper,stride}` triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, `N` will default to a lower bound of `1` and an upper bound of `N`. All loop bounds specified as `{lower,upper}` or `{lower,upper,stride}` are inclusive, meaning the indices will be `{lower,lower+stride,...,upper_with_stride}`, where `upper_with_stride <= upper` depending on whether the index `upper` is exactly reached with the striding.
f	The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with YAKL_INLINE. IMPORTANT: The lambda or operator() must accept an additional yakl::InnerHandler object after the loop indices.
config	[Optional, but HIGHLY ENCOURAGED] Use the `VecLen` template parameter to define the maximum size of the inner looping. When creating the yakl::LaunchConfig object, use the `yakl::LaunchConfig::set_inner_size(int)` routine to set the actual size of the inner looping. Ensure `set_inner_size <= VecLen`. Also an optional `B4B` template parameter to tell YAKL to run this kernel in serial when `-DYAKL_B4B` is defined as a CPP macro.

◆ single_inner()

template<class F >

YAKL_INLINE void yakl::fortran::single_inner	(	F const &	f,
		InnerHandler	handler
	)

Launch the passed functor to only use one of the inner threads (still parallel over outer threads).

For hierarchical (two-level) parallelism only. Must be called from within a yakl::fortran::parallel_outer call. Most of the time, you will use yakl::fence_inner() before and after yakl::fortran::single_inner.

Parameters

f	The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with `[&]` and not YAKL_INLINE.
handler	yakl::InnerHandler object created by yakl::fortran::parallel_outer.

Classes

Typedefs

Functions

Detailed Description

Typedef Documentation

◆ SimpleBounds

Function Documentation

◆ parallel_for() [1/2]

◆ parallel_for() [2/2]

◆ parallel_inner()

◆ parallel_outer() [1/2]

◆ parallel_outer() [2/2]

◆ single_inner()