Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU TPC: improved GPU TPC track-model decoding #13122

Merged
merged 4 commits into from
May 27, 2024

Conversation

cima22
Copy link
Contributor

@cima22 cima22 commented May 8, 2024

Hides DMA transfer latencies in GPU TPC track-model decoding, pipelining data transfers and kernel calls to multiple streams.

@cima22 cima22 requested a review from davidrohr as a code owner May 8, 2024 15:55
Copy link
Contributor

github-actions bot commented May 8, 2024

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass3
async-2023-pbpb-apass4
async-2022-pp-apass6-2023-PbPb-apass2
async-2022-pp-apass4
async-2022-pp-apass4-accepted
async-2022-pp-apass6-2023-PbPb-apass2-accepted
async-2023-pbpb-apass3-accepted
async-2023-pbpb-apass4-accepted
async-2023-pp-apass4
async-2023-pp-apass4-accepted

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just two comments for possible future improvements

}
mIOPtrs.clustersNative = mClusterNativeAccess.get();
mClusterNativeAccess->clustersLinear = mInputsHost->mPclusterNativeOutput;
mClusterNativeAccess->setOffsetPtrs();

runKernel<GPUTPCDecompressionKernels, GPUTPCDecompressionKernels::step1unattached>(GetGridAuto(inputStream));
unsigned int batchSize = doGPU ? 6 : NSLICES;
for (unsigned int iSlice = 0; iSlice < NSLICES; iSlice = iSlice + batchSize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do an outer OMP loop on the CPU, and set a nested OMP nThreads for the inner loop that is used for the kernel, as done here:

GPUCA_OPENMP(parallel for if(!doGPU && GetProcessingSettings().ompKernels != 1) num_threads(mRec->SetAndGetNestedLoopOmpFactor(!doGPU, GetProcessingSettings().nTPCClustererLanes)))

bool toGPU = true;
runKernel<GPUMemClean16>({GetGridAutoStep(inputStream, RecoStep::TPCDecompression), krnlRunRangeNone, &mEvents->init}, DecompressorShadow.mNativeClustersIndex, NSLICES * GPUCA_ROW_COUNT * sizeof(DecompressorShadow.mNativeClustersIndex[0]));
std::exclusive_scan(cmprClsHost.nTrackClusters, cmprClsHost.nTrackClusters + cmprClsHost.nTracks, Decompressor.mAttachedClustersOffsets, 0u); // computing clusters offsets for first kernel
int nStreams = doGPU ? mRec->NStreams() - 1 : 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nStreams here could perhaps depend on the data size. For very small cases, one might want to use less than nStreams() - 1, or even just one.
Perhaps, you also want std::max(1, NStreams() - 1), just in case nStreams could be 1 one a GPU model (which is currently not the case).

@davidrohr davidrohr merged commit 6572437 into AliceO2Group:dev May 27, 2024
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants