-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU TPC: improved GPU TPC track-model decoding #13122
Conversation
…reams, unattached clusters input transfer in separate stream
…rs per kernel call
REQUEST FOR PRODUCTION RELEASES:
This will add The following labels are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just two comments for possible future improvements
} | ||
mIOPtrs.clustersNative = mClusterNativeAccess.get(); | ||
mClusterNativeAccess->clustersLinear = mInputsHost->mPclusterNativeOutput; | ||
mClusterNativeAccess->setOffsetPtrs(); | ||
|
||
runKernel<GPUTPCDecompressionKernels, GPUTPCDecompressionKernels::step1unattached>(GetGridAuto(inputStream)); | ||
unsigned int batchSize = doGPU ? 6 : NSLICES; | ||
for (unsigned int iSlice = 0; iSlice < NSLICES; iSlice = iSlice + batchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do an outer OMP loop on the CPU, and set a nested OMP nThreads for the inner loop that is used for the kernel, as done here:
GPUCA_OPENMP(parallel for if(!doGPU && GetProcessingSettings().ompKernels != 1) num_threads(mRec->SetAndGetNestedLoopOmpFactor(!doGPU, GetProcessingSettings().nTPCClustererLanes))) |
bool toGPU = true; | ||
runKernel<GPUMemClean16>({GetGridAutoStep(inputStream, RecoStep::TPCDecompression), krnlRunRangeNone, &mEvents->init}, DecompressorShadow.mNativeClustersIndex, NSLICES * GPUCA_ROW_COUNT * sizeof(DecompressorShadow.mNativeClustersIndex[0])); | ||
std::exclusive_scan(cmprClsHost.nTrackClusters, cmprClsHost.nTrackClusters + cmprClsHost.nTracks, Decompressor.mAttachedClustersOffsets, 0u); // computing clusters offsets for first kernel | ||
int nStreams = doGPU ? mRec->NStreams() - 1 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nStreams here could perhaps depend on the data size. For very small cases, one might want to use less than nStreams() - 1, or even just one.
Perhaps, you also want std::max(1, NStreams() - 1), just in case nStreams could be 1 one a GPU model (which is currently not the case).
Hides DMA transfer latencies in GPU TPC track-model decoding, pipelining data transfers and kernel calls to multiple streams.