Releases: NVIDIA/cccl
v2.4.0
What’s New
We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
Thrust
As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.
CUB
cub::DeviceSelect::UniqueByKey
now supports equality operator and large problem sizes.- New cub::DeviceFor family of algorithms goes beyond conventional
cub::DeviceFor::ForEach
.cub::DeviceFor::ForEachCopy
can provide you with additional performance benefits from vectorized memory accesses. - Many CUB algorithms now support CUDA graph capture mode.
libcudacxx
- Added new
cuda::ptx
namespace with wrappers for inline-PTX instructions cuda::std::complex
specializations for CUDA typesbfloat
andhalf
.
What's Changed
- Implement remaining ranges iterator concepts and modernize array by @miscco in #627
- Fix C++11 support of recently added tests by @ahendriksen in #651
- Update CUDA newest to CTK 12.3 by @jrhemstad in #629
- Add
cuda::ptx::*
namespace by @ahendriksen in #574 - The test seems to pass just fine by @miscco in #654
- Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
- Reduce benchmarking time by @gevtushenko in #657
- Add CCCL_VERSION and script for updating version by @jrhemstad in #652
- Fixes compiler error for extended fp type data gen by @elstehle in #666
- fixup
___CUDA_VPTX
->_CUDA_VPTX
by @wmaxey in #664 - Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
- Rework our system header approach to be more error proof by @miscco in #661
- Project automation - fix sync action and draft setting step by @jarmak-nv in #625
- Fix fallback when checking git repo by @wmaxey in #1085
- Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
- Adds virtual shared memory helper and tests by @elstehle in #619
- Add
cuda::ptx::st_async
by @ahendriksen in #1078 - Add
cuda::ptx::red_async
by @ahendriksen in #1080 - Remove libcudacxx symlinks by @wmaxey in #1075
- Move PTX tests that missed the symlink PR by @wmaxey in #1098
- Fix truncation of constant value by @gevtushenko in #1097
- Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}
by @ahendriksen in #674 - Initial CUB/NVRTC support by @gevtushenko in #1081
- Fix
cuda::ptx::red.async
for int32_t types by @ahendriksen in #1102 - Fix local test runs with lit by @miscco in #1108
- Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
- Do not replace the sccache binary for windows by @miscco in #1115
- Test cuda graph capture by @gevtushenko in #1112
- Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
- Introduce CUB transform reduce by @gevtushenko in #1091
- Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
- Fix GCC6 / FP8 warning by @gevtushenko in #1130
- Fix thrust transform reduce bench by @gevtushenko in #1133
- Fix
ptx.st.async.compile.pass.cpp
failing in C++11. by @wmaxey in #1132 - Fix
_LIBCUDACXX_UNREACHABLE
for old MSVC by @miscco in #1114 - Allow filtering P0 benchmarks by @gevtushenko in #1135
- Update barrier_arrive_tx.md docs by @gonzalobg in #1147
- Update std iterators by @miscco in #672
- Fix argument name in windows CI by @miscco in #1145
- Fix XFAIL condition for subsumption tests by @miscco in #1144
- Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
- Use rst in block-scope docs by @gevtushenko in #1150
- Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
- Fix icc / cub by @gevtushenko in #1152
- Abort testing on unsupported dialect flags by @wmaxey in #1158
- Run with latest nvbench by @robertmaynard in #583
- Set finer-grain workflow permissions by @jrhemstad in #1163
- Port device docs to rst by @gevtushenko in #1160
- CI log improvements by @jrhemstad in #621
- Setup documentation and corresponding github action by @wmaxey in #1118
- Update Docs links in README.md by @wmaxey in #1169
- Fix GCC 13 by @gevtushenko in #1175
- Add missing exit from
run-as-coder
by @jrhemstad in #1176 - Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
- Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
- Implement math functions for
thrust::complex
by @miscco in #1178 - Use anchors in matrix.yaml by @jrhemstad in #1193
- Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
- Fix availability of
is_constant_evaluated
on old MSVC by @miscco in #1180 - Enable std::variant for libcu++ by @miscco in #1076
- Implement
enable_borrowed_range
by @miscco in #1196 - Reduce thrust benchmarks noise by @gevtushenko in #1203
- Prepare more algorithms by @miscco in #1161
- Add icc compiler to CI matrix by @jrhemstad in #1159
- Unify handling of dialects by @miscco in #1200
- Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
- Move definitions of execution space macros into
cccl
by @miscco in #1199 - Adds new virtual shared memory facility to
DeviceSelect::UniqueByKey
by @elstehle in #1197 - Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
- Fix the example on README.md by @so298 in #1220
- Add missing overloads for thrust::pow by @miscco in #1222
- Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
- Fix examples in reduce docs by @gevtushenko in #1230
- Do not benchmark small problem sizes by @gevtushenko in #1243
- Implement
enable_view
by @miscco in #1208 - Refactors
thrust::unique_by_key
to usecub::DeviceSelect::UniqueByKey
by @elstehle in #1245 - Fix merge conflict from incoming PR by @miscco in #1250
- Disable
fast-math
for ICC by @miscco in #1252 - Fix a typo in thrust-config.cmake by @valgur in #1259
- Implement
ranges::{c}begin
andranges::{c}end
by @miscco in #1256 - Switch to entropy-based stopping criterion by @gevtushenko in #1280
- Fix a sync bug in
stream_ref::wait
by @PointKernel in #1238 - Silence some static asserts in ptx helpers by @miscco in #1257
- Restore docs images...
v2.3.2
What's Changed
- [BACKPORT]: Silence some static asserts in ptx helpers (#1257) by @miscco in #1284
- [BACKPORT]: Ensure that pair is trivially copyable (#1249) by @miscco in #1292
- [BACKPORT]: Properly test internal headers (#1258) by @miscco in #1299
- [Backport]: Fix errors when find_package(CCCL) is called twice. (#1157) by @miscco in #1298
- [BACKPORT] Fix MSVC issues (#1261) by @miscco in #1297
- [backport] thrust/mr: fix the case of reuising a block for a smaller alloc. (#1232) by @griwes in #1317
- [BACKPORT]: Fix ptx usage to account for PTX ISA availability (#1359) by @miscco in #1421
- Create patch 2.3.2 by @wmaxey in #1530
Full Changelog: v2.3.1...v2.3.2
v2.3.1
CCCL 2.3.0
What’s New
In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
System Headers and Warnings
Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem
. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I
.
To achieve the same effect as -isystem
, CCCL headers will now use the system_header
pragma. For more information, see #527.
TL;DR: You should never see warnings emitted from a CCCL header ever again!
Linkage Issues
Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE
macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE
. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.
Thrust
thrust::tuple
, thrust::pair
, and thrust::complex
have been replaced with cuda::std
alternatives. This can be a breaking change, but should be source compatible.
CUB
Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey
, cub::DeviceScan::ExclusiveSumByKey
, and cub::DeviceReduce::ReduceByKey
on A100. cub::DeviceSegmentedReduce
now supports 64-bit indexing.
libcudacxx
- The
cuda::ptx
namespace and<cuda/ptx>
header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics. - #379 - Added experimental bulk TMA memcpy under
<cuda/barrier>
What's Changed
- Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
- Branch/2.2.x by @gevtushenko in #305
- Tune unique by key on A100 by @gevtushenko in #306
- Merge branch/2.2.x to main by @jrhemstad in #308
- Add example cmake project by @jrhemstad in #177
- Adds catch2 tests for reduce-by-key by @elstehle in #311
- Tune scan by key on A100 by @gevtushenko in #325
- Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
- Fix MSVC / CUB tests build by @gevtushenko in #336
- gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
- Add a nvrtc configuration for libcu++ by @miscco in #202
- GH Infra: project automation and issue template fixes by @jarmak-nv in #297
- Tune reduce by key on A100 by @gevtushenko in #346
- Merge commits from 2.2 branch by @miscco in #350
- Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
- Assorted fixes for MSVC 2017 by @miscco in #341
- [skip-tests] Guard inline variables with
_LIBCUDACXX_INLINE_VAR
macro by @miscco in #355 - Port cub::DeviceScan tests to catch2 by @elstehle in #347
- Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
- Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
- Work around strange gcc bug by @miscco in #363
- Implement
iter_swap
CPO by @miscco in #332 - Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
- Add clang16 devcontainer and CI job by @miscco in #362
- [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
- [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
- Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
- Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
- Use
copy-pr-bot
by @ajschmidt8 in #381 - Implement the
permutable
concept by @miscco in #367 - [NFC] We missed some
_NOEXCEPT_
macro uses by @miscco in #371 - Implement
identity
changes for c++20 by @miscco in #383 - Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
- Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
- Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
- Add commit information to the test output by @miscco in #401
- Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
- Project Automation: set
Roadmap
project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389 - Add support for tests that should fail at runtime by @ahendriksen in #418
- Port
DeviceAdjacentDifference::SubtractRight
tests to catch2 by @miscco in #390 - Project automation - Fix indentation for
continue-on-error
by @jarmak-nv in #425 - [BUG] Ensure that all headers build on their own by @miscco in #200
- Remove
util_device.cuh
from iterator headers to enable online compilation by @leofang in #412 - Fix ci-overview example by @gevtushenko in #428
- Port
cub::DeviceRunLengthEncode
tests to catch2 by @miscco in #411 - Add cuda::device::barrier_arrive tx by @ahendriksen in #358
- Fix CubDebug by @gevtushenko in #430
- Do not use static member functions to initialize static member variables. by @miscco in #438
- Implement the
projected
helper struct by @miscco in #385 - Add PTX wrapping functions for TMA features by @ahendriksen in #379
- Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
- Enable lit to determine the compute architectures by @miscco in #447
- Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
- Improve documentation of
cuda::barrier
by @ahendriksen in #440 - Extend
thrust::complex
unit tests to prepare for upcoming replacement withstd::complex
by @Blonck in #413 - Remove having two install rules for -header-search.cmake by @robertmaynard in #298
- Run
.devcontainer/launch.sh
with bash + add error checking by @wence- in #407 - Remove C++03 compatability from unit tests by @Blonck in #378
- [libcu++] Fix use of
__ppc64__
by @miscco in #451 - Update the README by @jrhemstad in #291
- [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
- Consolidate matrix logic into single script/job by @jrhemstad in #361
- Implement the
indirectly_comparable
concept by @miscco in #445 - Fix compute matrix dropping trailing zeros by @jrhemstad in #466
- Avoid integer promotion warnings with MSVC by @miscco in #460
- Implement ranges comparison objects by @miscco in #464
- Fix CUB/MSVC/RDC tests by @gevtushenko in #469
- Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
- Script for Running CUB Benchmarks by @gevtushenko in #472
- [skip ci] Add list of CCCL users to README by @jrhemstad in #474
constexpr
all the things by @pb-dseifert in #476- Add Gonzalo/Allard to trustees by @jrhemstad in #482
- Implement the
sortable
concept by @miscco in #471 - [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
- Refactor
thrust::complex
as a struct derived fromcuda::std::complex
by @Blonck in #454 - Add ci scripts for windows by...
CCCL 2.2.0
(Note that these release notes are not yet finalized. They do not reflect any PRs that were merged to Thrust/CUB/libcudacxx before migrating to the nvidia/cccl repo).
What's Changed
- Add axis for docker builds by @raydouglass in #1
- Docker: Add support for ICPC and NVC++, install newer CMake, and add curl by @brycelelbach in #4
- Update excludes by @raydouglass in #5
- Docker: OS and CUDA upgrades, support for additional configurations by @brycelelbach in #9
- Docker: Add Thrust/CUB documentation toolchain to Ubuntu docker images by @brycelelbach in #15
- Re-enable CentOS images. by @allisonvacanti in #16
- Add sccache to dockerfile by @msadang in #17
- Update base containers. by @allisonvacanti in #18
- Update
sccache
version by @ajschmidt8 in #19 - Build
11.5.1
containers by @ajschmidt8 in #20 - Add ops-bot.yaml by @jrhemstad in #80
- Monorepo workflow by @jrhemstad in #99
- Add devcontainers by @jrhemstad in #105
- Update the libcu++ submodule by @miscco in #109
- Update libcudaxx again by @miscco in #110
- Remove submodules from CI workflow by @jrhemstad in #115
- Fix CUB CI by @senior-zero in #114
- Fix async scan / counting iterator tests by @senior-zero in #118
- Make sccache work locally by @jrhemstad in #113
- Fix compilation of thrust and cub by @miscco in #120
- Fix segfault in cub::CachingDeviceAllocator by @senior-zero in #119
- Initial GH Infra Setup by @jarmak-nv in #23
- Visualize variant space coverage by @senior-zero in #125
- Fix broken issue templates by @jarmak-nv in #124
- Tune scan by key for SM90 by @senior-zero in #121
- Update PR template to more explicitly prompt for a linked issue closed by the PR by @jrhemstad in #134
- Change component section to more general "area" by @jrhemstad in #132
- Try and fix CI for old CTK by @miscco in #116
- Fix
tuple_cat
forstd::
qualified types by @miscco in #144 - Add ccache to lit invocation by @miscco in #147
- Benchmark batched memcpy by @senior-zero in #136
- Properly querry
CMAKE_CUDA_COMPILER_LAUNCHER
for ccache support by @miscco in #152 - Implement Three-Way Partition Tuning / Benchmark by @senior-zero in #155
- Port three-way partition to use Catch2 by @senior-zero in #156
- Add gcc-6 to the test matrix by @miscco in #160
- Tune reduce / unique by key for SM90 by @senior-zero in #163
- Remove unused folders by @miscco in #145
- Fix documentation of
atomic_ref
by @miscco in #164 - New iterator traits by @miscco in #158
- Improve implementation of
destructible
by @miscco in #157 - Build script improvements by @jrhemstad in #149
- Fix icpc / denormals by @senior-zero in #185
- Enable tests by @jrhemstad in #167
- Monorepo by @jrhemstad in #194
- Multi-benchmark tuning by @senior-zero in #208
- Fixes universal_vector test failure on CTK 11.1 & gcc-6 by @elstehle in #209
- Delete several directories for older CI infra. by @wmaxey in #218
- Memory-safe radix sort test by @senior-zero in #222
- [FEA] Implement
iter_move
CPO by @miscco in #197 - Build cub benchmarks in build_cub.sh by @jrhemstad in #216
- [skip-tests] Do not run tests when
skip-tests
is part of the latest commit message by @miscco in #224 - Factor out build job logic into a "run-as-coder" reusable workflow. by @jrhemstad in #205
- Fix instances of 'scan' copy-pasted into reduction documentation by @milesvant in #221
- Add clangd to devcontainer by @senior-zero in #225
- Add initial CODEOWNERS file by @jrhemstad in #226
- Attempt to fix codeowners by @jrhemstad in #231
- Make libcudacxx respect CMake options for CUDA archs. by @wmaxey in #235
- Optimize Three-Way Partition by @senior-zero in #228
- [BUG] Rework how we handle feature test macros by @miscco in #195
- Enable use of
cudaMemcpyAsync
forthrust::copy
by @miscco in #211 - Enable additional arguments in build_common.sh by @wmaxey in #236
- [BUG] Properly uglify all qualifiers in product headers by @miscco in #201
- Port
cub::Device{Select, Partition}
tests to catch2 by @miscco in #229 - Fix CUB tests / MSVC 2022 by @senior-zero in #255
- Ensure that any CMake re-rooting doesn't break our find_file by @miscco in #257
- [BUG] Fix compilation issues with MSVC 2017 by @miscco in #196
- Implement iterator concepts by @miscco in #223
- Tune Histogram on H100 by @senior-zero in #266
- Add WarpExchangeAlgorithm customization for WarpExchange class by @pb-dseifert in #256
- [BUG]: Avoid deprecation warning for
std::aligned_storage
when building with c++23 by @miscco in #258 - Port cub::DeviceReduce tests to catch2 by @elstehle in #267
- Add support for nvcc-specific matrix. by @jrhemstad in #243
- Fix anchor link to cooperative groups in CUDA programming guide by @wence- in #274
- Fix BibTeX syntax in CITATION.md [skip-tests] by @wence- in #276
- Enforce C++17 for benches by @senior-zero in #275
- Project Automation: Move PR and Linked Issues to In Progress by @jarmak-nv in #170
- Update to 23.08 devcontainers and CUDA 12.2 by @jrhemstad in #270
- [skip-tests] CTK 12.2 tuning image by @senior-zero in #282
- Fix single-thread block reduction by @senior-zero in #287
- Tune Select and Partition on A100 by @senior-zero in #289
- Fix CUB tests / MSVC by @senior-zero in #292
- Allow building CUB tests without cuRand by @senior-zero in #250
- Fixup to CUB build - s/curand/cudart/ by @wmaxey in #301
- Fix OOB in
cub::DeviceRunLengthEncode::NonTrivialRuns
by @senior-zero in #294 - Tune RLE on A100 by @senior-zero in #295
- Tune scan on A100 by @senior-zero in #302
- Add new CCCL:: CMake targets by @allisonvacanti in #244
- Fix
cudacc
andnvcc
mixup. by @wmaxey in #329 - [skip-tests] Use builtin for
destructible
concept on MSVC by @miscco in #333 - Fix merge conflict from two inflight PRs by @miscco in #338
New Contributors
- @raydouglass made their first contribution in #1
- @brycelelbach made their first contribution in #4
- @msadang made their first contribution in #17
- @wmaxey made their first contribution in #218
- @milesvant made their first contribution in #221
- @pb-dseifert made their first contribution in #256
- @wence- made their first contribution in #274
Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0