| 
| template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>  | 
| void  | parallel_for (Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) | 
|   | [ASYNCHRONOUS] Launch the passed functor in parallel.  More...
  | 
|   | 
| template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>  | 
| void  | parallel_for (char const *str, Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) | 
|   | [ASYNCHRONOUS] Launch the passed functor in parallel.  More...
  | 
|   | 
| template<class F , int N, bool simple>  | 
| YAKL_INLINE void  | parallel_inner (Bounds< N, simple > const &bounds, F const &f, InnerHandler handler) | 
|   | Launch the passed functor in parallel in the finenst-level parallelism on the device.  More...
  | 
|   | 
| template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>  | 
| void  | parallel_outer (Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) | 
|   | [ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device  More...
  | 
|   | 
| template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>  | 
| void  | parallel_outer (char const *str, Bounds< N, simple > const &bounds, F const &f, LaunchConfig< VecLen, B4B > config=LaunchConfig<>()) | 
|   | [ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device  More...
  | 
|   | 
| template<class F >  | 
| YAKL_INLINE void  | single_inner (F const &f, InnerHandler handler) | 
|   | Launch the passed functor to only use one of the inner threads (still parallel over outer threads).  More...
  | 
|   | 
Contains Bounds class, and parallel_for() routines using Fortran-style indexing and ordering. 
 
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false> 
  
  
      
        
          | void yakl::fortran::parallel_for  | 
          ( | 
          char const *  | 
          str,  | 
         
        
           | 
           | 
          Bounds< N, simple > const &  | 
          bounds,  | 
         
        
           | 
           | 
          F const &  | 
          f,  | 
         
        
           | 
           | 
          LaunchConfig< VecLen, B4B >  | 
          config = LaunchConfig<>()  | 
         
        
           | 
          ) | 
           |  | 
         
       
   | 
  
inline   | 
  
 
[ASYNCHRONOUS] Launch the passed functor in parallel. 
If passing a lambda, it must be decorated with YAKL_LAMBDA. If passing a functor, the operator() must be decorated with YAKL_INLINE. Click for more information.
- Parameters
 - 
  
    | str | String label for this parallel_for. This form of parallel_for is highly recommended so that debugging and profiling features can be used when turned on via CPP macros.  | 
    | bounds | The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, {lower,upper} pair, or {lower,upper,stride} triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, N will default to a lower bound of 1 and an upper bound of N. All loop bounds specified as {lower,upper} or {lower,upper,stride} are inclusive, meaning the indices will be {lower,lower+stride,...,upper_with_stride}, where upper_with_stride <= upper depending on whether the index upper is exactly reached with the striding.  | 
    | f | The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with YAKL_INLINE.  | 
    | config | [Optional] Use a yakl::LaunchConfig object to describe the size of inner-level parallelism (VecLen) or whether this kernel should be executed in serial (B4B) when the CPP macro -DYAKL_B4B is defined.  | 
  
   
 
 
template<class F , int N, bool simple> 
      
 
Launch the passed functor in parallel in the finenst-level parallelism on the device. 
For hierarchical (two-level) parallelism only. Must be called from within a yakl::fortran::parallel_outer call. For CUDA and HIP, for instance, this is "block"-level parallelism spread over threads within a multiprocessor. IMPORTANT: If passing a lambda, it must be decorated with [&] and not YAKL_LAMBDA. If passing a functor, the operator() must not be decorated with YAKL_INLINE. Click for more information.
- Parameters
 - 
  
    | bounds | The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, {lower,upper} pair, or {lower,upper,stride} triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, N will default to a lower bound of 1 and an upper bound of N. All loop bounds specified as {lower,upper} or {lower,upper,stride} are inclusive, meaning the indices will be {lower,lower+stride,...,upper_with_stride}, where upper_with_stride <= upper depending on whether the index upper is exactly reached with the striding.  | 
    | f | The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with [&] and not YAKL_INLINE.  | 
    | handler | yakl::InnerHandler object created by yakl::fortran::parallel_outer.  | 
  
   
 
 
template<class F , int N, bool simple, int VecLen = YAKL_DEFAULT_VECTOR_LEN, bool B4B = false> 
  
  
      
        
          | void yakl::fortran::parallel_outer  | 
          ( | 
          char const *  | 
          str,  | 
         
        
           | 
           | 
          Bounds< N, simple > const &  | 
          bounds,  | 
         
        
           | 
           | 
          F const &  | 
          f,  | 
         
        
           | 
           | 
          LaunchConfig< VecLen, B4B >  | 
          config = LaunchConfig<>()  | 
         
        
           | 
          ) | 
           |  | 
         
       
   | 
  
inline   | 
  
 
[ASYNCHRONOUS] Launch the passed functor in parallel in the coarsest-level parallelism on the device 
For hierarchical (two-level) parallelism only. For CUDA and HIP, for instance, this is "grid"-level parallelism spread over multiprocessors. yakl::fortran::parallel_inner, on the other hand, is "block"-level parallelism spread over threads within a multiprocessor. If passing a lambda, it must be decorated with YAKL_LAMBDA. If passing a functor, the operator() must be decorated with YAKL_INLINE. Click for more information. IMPORTANT: While the yakl::LaunchConfig parameter is optional, you will very likely want to use it! Otherwise, you're at the mercy of the YAKL_DEFAULT_VECTOR_LEN for a given hardware backend. The yakl::LaunchConfig parameter's template vector length parameter must be larger than the inner_size declared by yakl::LaunchConfig::set_inner_size(). Click for more information.
Example usage: 
int constexpr MAX_INNER_SIZE = 256;
int inner_size = 96;
  ...
} , LaunchConfig<MAX_INNER_SIZE>.set_inner_size(inner_size) );
IMPORTANT: All code inside yakl::fortran::parallel_outer is run in parallel over both outer and inner parallelism. So code not inside yakl::fortran::parallel_inner will still execute for all inner threads but without any knowledge of inner parallelism indices. If you want to execute only for one inner thread, please use the yakl::fortran::single_inner routine.
- Parameters
 - 
  
    | str | String label for this parallel_outer. This form of parallel_outer is highly recommended so that debugging and profiling features can be used when turned on via CPP macros.  | 
    | bounds | The yakl::fortran::Bounds or yakl::fortran::SimpleBounds object describing the tightly nested looping. You can also pass asingle integer, {lower,upper} pair, or {lower,upper,stride} triplet ensuring strides are positive. Why a positive stride? To protect you. If you need a negative stride, this means loop ordering matters, and the loop body is not a trivially parallel operation. The order of the bounds is always such that the left-most is the outer-most loop, and the right-most is the inner-most loop. All loop bounds expressed as a single integer, N will default to a lower bound of 1 and an upper bound of N. All loop bounds specified as {lower,upper} or {lower,upper,stride} are inclusive, meaning the indices will be {lower,lower+stride,...,upper_with_stride}, where upper_with_stride <= upper depending on whether the index upper is exactly reached with the striding.  | 
    | f | The functor to be launched in parallel. Lambdas must be decorated with YAKL_LAMBDA. Functors must have operator() decorated with YAKL_INLINE. IMPORTANT: The lambda or operator() must accept an additional yakl::InnerHandler object after the loop indices.  | 
    | config | [Optional, but HIGHLY ENCOURAGED] Use the VecLen template parameter to define the maximum size of the inner looping. When creating the yakl::LaunchConfig object, use the yakl::LaunchConfig::set_inner_size(int) routine to set the actual size of the inner looping. Ensure set_inner_size <= VecLen. Also an optional B4B template parameter to tell YAKL to run this kernel in serial when -DYAKL_B4B is defined as a CPP macro.  |