You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a feature that is only really useful for those developping high performance kernels.
In TBB you have access to blocked_ranges, which gives you the begin and the end index of the current task partition. This is very practical when you have fully vectorized kernels that operate on a range rather than an index.
In taskflow, the closest thing we can do right now is:
const int step = 16; // some hardcoded step size
taskflow.for_each_index(0, 1000, step, [&] (int i) {
some_vectorized_kernel(i, i+step);
});
This is okay but we are hardcoding the step size (usually the vector width) which can affect the granularity of the parallelism. If the step is too big, irregular workloads cannot be properly balanced. If the step is too small (like here at 16), we are increasing function call overhead, decreasing ILP in our kernel and polluting cache between intermediate calls.
The solution is almost there. With something like a guided or static partitioner, we just need to modify the api to have some sort of access to the chunk size (or a start and begin index like TBB).
This is a feature that is only really useful for those developping high performance kernels.
In TBB you have access to blocked_ranges, which gives you the begin and the end index of the current task partition. This is very practical when you have fully vectorized kernels that operate on a range rather than an index.
Example:
In taskflow, the closest thing we can do right now is:
This is okay but we are hardcoding the step size (usually the vector width) which can affect the granularity of the parallelism. If the step is too big, irregular workloads cannot be properly balanced. If the step is too small (like here at 16), we are increasing function call overhead, decreasing ILP in our kernel and polluting cache between intermediate calls.
The solution is almost there. With something like a guided or static partitioner, we just need to modify the api to have some sort of access to the chunk size (or a start and begin index like TBB).
Example:
The text was updated successfully, but these errors were encountered: