Releases · NVIDIA/cccl

23 Apr 21:30

wmaxey

v2.4.0

1c009d2

v2.4.0 Latest

Latest

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

Added new cuda::ptx namespace with wrappers for inline-PTX instructions
cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Implement remaining ranges iterator concepts and modernize array by @miscco in #627
Fix C++11 support of recently added tests by @ahendriksen in #651
Update CUDA newest to CTK 12.3 by @jrhemstad in #629
Add cuda::ptx::* namespace by @ahendriksen in #574
The test seems to pass just fine by @miscco in #654
Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
Reduce benchmarking time by @gevtushenko in #657
Add CCCL_VERSION and script for updating version by @jrhemstad in #652
Fixes compiler error for extended fp type data gen by @elstehle in #666
fixup ___CUDA_VPTX -> _CUDA_VPTX by @wmaxey in #664
Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
Rework our system header approach to be more error proof by @miscco in #661
Project automation - fix sync action and draft setting step by @jarmak-nv in #625
Fix fallback when checking git repo by @wmaxey in #1085
Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
Adds virtual shared memory helper and tests by @elstehle in #619
Add cuda::ptx::st_async by @ahendriksen in #1078
Add cuda::ptx::red_async by @ahendriksen in #1080
Remove libcudacxx symlinks by @wmaxey in #1075
Move PTX tests that missed the symlink PR by @wmaxey in #1098
Fix truncation of constant value by @gevtushenko in #1097
Add cuda::ptx:mbarrier_{try/test}_wait{_parity} by @ahendriksen in #674
Initial CUB/NVRTC support by @gevtushenko in #1081
Fix cuda::ptx::red.async for int32_t types by @ahendriksen in #1102
Fix local test runs with lit by @miscco in #1108
Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
Do not replace the sccache binary for windows by @miscco in #1115
Test cuda graph capture by @gevtushenko in #1112
Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
Introduce CUB transform reduce by @gevtushenko in #1091
Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
Fix GCC6 / FP8 warning by @gevtushenko in #1130
Fix thrust transform reduce bench by @gevtushenko in #1133
Fix ptx.st.async.compile.pass.cpp failing in C++11. by @wmaxey in #1132
Fix _LIBCUDACXX_UNREACHABLE for old MSVC by @miscco in #1114
Allow filtering P0 benchmarks by @gevtushenko in #1135
Update barrier_arrive_tx.md docs by @gonzalobg in #1147
Update std iterators by @miscco in #672
Fix argument name in windows CI by @miscco in #1145
Fix XFAIL condition for subsumption tests by @miscco in #1144
Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
Use rst in block-scope docs by @gevtushenko in #1150
Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
Fix icc / cub by @gevtushenko in #1152
Abort testing on unsupported dialect flags by @wmaxey in #1158
Run with latest nvbench by @robertmaynard in #583
Set finer-grain workflow permissions by @jrhemstad in #1163
Port device docs to rst by @gevtushenko in #1160
CI log improvements by @jrhemstad in #621
Setup documentation and corresponding github action by @wmaxey in #1118
Update Docs links in README.md by @wmaxey in #1169
Fix GCC 13 by @gevtushenko in #1175
Add missing exit from run-as-coder by @jrhemstad in #1176
Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
Implement math functions for thrust::complex by @miscco in #1178
Use anchors in matrix.yaml by @jrhemstad in #1193
Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
Fix availability of is_constant_evaluated on old MSVC by @miscco in #1180
Enable std::variant for libcu++ by @miscco in #1076
Implement enable_borrowed_range by @miscco in #1196
Reduce thrust benchmarks noise by @gevtushenko in #1203
Prepare more algorithms by @miscco in #1161
Add icc compiler to CI matrix by @jrhemstad in #1159
Unify handling of dialects by @miscco in #1200
Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
Move definitions of execution space macros into cccl by @miscco in #1199
Adds new virtual shared memory facility to DeviceSelect::UniqueByKey by @elstehle in #1197
Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
Fix the example on README.md by @so298 in #1220
Add missing overloads for thrust::pow by @miscco in #1222
Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
Fix examples in reduce docs by @gevtushenko in #1230
Do not benchmark small problem sizes by @gevtushenko in #1243
Implement enable_view by @miscco in #1208
Refactors thrust::unique_by_key to use cub::DeviceSelect::UniqueByKey by @elstehle in #1245
Fix merge conflict from incoming PR by @miscco in #1250
Disable fast-math for ICC by @miscco in #1252
Fix a typo in thrust-config.cmake by @valgur in #1259
Implement ranges::{c}begin and ranges::{c}end by @miscco in #1256
Switch to entropy-based stopping criterion by @gevtushenko in #1280
Fix a sync bug in stream_ref::wait by @PointKernel in #1238
Silence some static asserts in ptx helpers by @miscco in #1257
Restore docs images...

Contributors

alliepiper, robertmaynard, and 23 other contributors

Assets 2

12 Mar 20:22

wmaxey

v2.3.2

64d3a5f

v2.3.2

What's Changed

[BACKPORT]: Silence some static asserts in ptx helpers (#1257) by @miscco in #1284
[BACKPORT]: Ensure that pair is trivially copyable (#1249) by @miscco in #1292
[BACKPORT]: Properly test internal headers (#1258) by @miscco in #1299
[Backport]: Fix errors when find_package(CCCL) is called twice. (#1157) by @miscco in #1298
[BACKPORT] Fix MSVC issues (#1261) by @miscco in #1297
[backport] thrust/mr: fix the case of reuising a block for a smaller alloc. (#1232) by @griwes in #1317
[BACKPORT]: Fix ptx usage to account for PTX ISA availability (#1359) by @miscco in #1421
Create patch 2.3.2 by @wmaxey in #1530

Full Changelog: v2.3.1...v2.3.2

Contributors

griwes, miscco, and wmaxey

Assets 2

23 Apr 21:29

wmaxey

v2.3.1

299eb62

v2.3.1

What's Changed

[BACKPORT]: Fix bug in stream_ref::wait by @miscco in #1283
Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1286
Create patch 2.3.1 by @wmaxey in #1287

Full Changelog: v2.3.0...v2.3.1

Contributors

miscco and wmaxey

Assets 2

28 Feb 18:36

wmaxey

v2.3.0

c4eda1a

CCCL 2.3.0

What’s New

In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

System Headers and Warnings

Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.

To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see #527.

TL;DR: You should never see warnings emitted from a CCCL header ever again!

Linkage Issues

Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.