Skip to content

Releases: NVIDIA/cccl

v2.4.0

23 Apr 21:30
1c009d2
Compare
Choose a tag to compare

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

  • cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
  • New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
  • Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

  • Added new cuda::ptx namespace with wrappers for inline-PTX instructions
  • cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Read more

v2.3.2

12 Mar 20:22
64d3a5f
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.3.1...v2.3.2

v2.3.1

23 Apr 21:29
299eb62
Compare
Choose a tag to compare

What's Changed

  • [BACKPORT]: Fix bug in stream_ref::wait by @miscco in #1283
  • Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1286
  • Create patch 2.3.1 by @wmaxey in #1287

Full Changelog: v2.3.0...v2.3.1

CCCL 2.3.0

28 Feb 18:36
c4eda1a
Compare
Choose a tag to compare

What’s New

In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

System Headers and Warnings

Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.

To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see #527.

TL;DR: You should never see warnings emitted from a CCCL header ever again!

Linkage Issues

Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.

Thrust

thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.

CUB

Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.

libcudacxx

  • The cuda::ptx namespace and <cuda/ptx> header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics.
  • #379 - Added experimental bulk TMA memcpy under <cuda/barrier>

What's Changed

  • Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
  • Branch/2.2.x by @gevtushenko in #305
  • Tune unique by key on A100 by @gevtushenko in #306
  • Merge branch/2.2.x to main by @jrhemstad in #308
  • Add example cmake project by @jrhemstad in #177
  • Adds catch2 tests for reduce-by-key by @elstehle in #311
  • Tune scan by key on A100 by @gevtushenko in #325
  • Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
  • Fix MSVC / CUB tests build by @gevtushenko in #336
  • gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
  • Add a nvrtc configuration for libcu++ by @miscco in #202
  • GH Infra: project automation and issue template fixes by @jarmak-nv in #297
  • Tune reduce by key on A100 by @gevtushenko in #346
  • Merge commits from 2.2 branch by @miscco in #350
  • Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
  • Assorted fixes for MSVC 2017 by @miscco in #341
  • [skip-tests] Guard inline variables with _LIBCUDACXX_INLINE_VAR macro by @miscco in #355
  • Port cub::DeviceScan tests to catch2 by @elstehle in #347
  • Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
  • Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
  • Work around strange gcc bug by @miscco in #363
  • Implement iter_swap CPO by @miscco in #332
  • Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
  • Add clang16 devcontainer and CI job by @miscco in #362
  • [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
  • [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
  • Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
  • Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
  • Use copy-pr-bot by @ajschmidt8 in #381
  • Implement the permutable concept by @miscco in #367
  • [NFC] We missed some _NOEXCEPT_ macro uses by @miscco in #371
  • Implement identity changes for c++20 by @miscco in #383
  • Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
  • Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
  • Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
  • Add commit information to the test output by @miscco in #401
  • Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
  • Project Automation: set Roadmap project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389
  • Add support for tests that should fail at runtime by @ahendriksen in #418
  • Port DeviceAdjacentDifference::SubtractRight tests to catch2 by @miscco in #390
  • Project automation - Fix indentation for continue-on-error by @jarmak-nv in #425
  • [BUG] Ensure that all headers build on their own by @miscco in #200
  • Remove util_device.cuh from iterator headers to enable online compilation by @leofang in #412
  • Fix ci-overview example by @gevtushenko in #428
  • Port cub::DeviceRunLengthEncode tests to catch2 by @miscco in #411
  • Add cuda::device::barrier_arrive tx by @ahendriksen in #358
  • Fix CubDebug by @gevtushenko in #430
  • Do not use static member functions to initialize static member variables. by @miscco in #438
  • Implement the projected helper struct by @miscco in #385
  • Add PTX wrapping functions for TMA features by @ahendriksen in #379
  • Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
  • Enable lit to determine the compute architectures by @miscco in #447
  • Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
  • Improve documentation of cuda::barrier by @ahendriksen in #440
  • Extend thrust::complex unit tests to prepare for upcoming replacement with std::complex by @Blonck in #413
  • Remove having two install rules for -header-search.cmake by @robertmaynard in #298
  • Run .devcontainer/launch.sh with bash + add error checking by @wence- in #407
  • Remove C++03 compatability from unit tests by @Blonck in #378
  • [libcu++] Fix use of __ppc64__ by @miscco in #451
  • Update the README by @jrhemstad in #291
  • [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
  • Consolidate matrix logic into single script/job by @jrhemstad in #361
  • Implement the indirectly_comparable concept by @miscco in #445
  • Fix compute matrix dropping trailing zeros by @jrhemstad in #466
  • Avoid integer promotion warnings with MSVC by @miscco in #460
  • Implement ranges comparison objects by @miscco in #464
  • Fix CUB/MSVC/RDC tests by @gevtushenko in #469
  • Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
  • Script for Running CUB Benchmarks by @gevtushenko in #472
  • [skip ci] Add list of CCCL users to README by @jrhemstad in #474
  • constexpr all the things by @pb-dseifert in #476
  • Add Gonzalo/Allard to trustees by @jrhemstad in #482
  • Implement the sortable concept by @miscco in #471
  • [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
  • Refactor thrust::complex as a struct derived from cuda::std::complex by @Blonck in #454
  • Add ci scripts for windows by...
Read more

CCCL 2.2.0

07 Sep 19:09
36f379f
Compare
Choose a tag to compare

(Note that these release notes are not yet finalized. They do not reflect any PRs that were merged to Thrust/CUB/libcudacxx before migrating to the nvidia/cccl repo).

What's Changed

New Contributors

Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0