Skip to content

Releases: eth-cscs/COSMA

COSMA-v2.6.6

10 May 11:32
a816153
Compare
Choose a tag to compare

Fix linking against cray-libsci.

COSMA-v2.6.5

21 Mar 10:16
6453f91
Compare
Choose a tag to compare
  • fix a bug in tiled-mm API
  • fix cmake related to nccl/rccl

COSMA-v2.6.4

08 Mar 12:34
e7ebdf7
Compare
Choose a tag to compare

Update submodules, minor fixes in cmake.

COSMA-v2.6.3

22 Feb 08:54
Compare
Choose a tag to compare

Improvements in cmake config. Update to new tiled-mm API.

COSMA-v2.6.2

25 Aug 18:30
fda69fd
Compare
Choose a tag to compare

This release fixes a bug in find_package(cosma) (cmake).

COSMA-v2.6.1

21 Jul 17:16
Compare
Choose a tag to compare

This release fixes the issues of COSMA-v2.6.0 coming from resizing the memory pool, as reported here.

2.6.0-fixed

21 Jul 11:29
Compare
Choose a tag to compare
2.6.0-fixed Pre-release
Pre-release
Fixed a bug with memory pool resizing.

COSMA-v2.6.0

14 Jul 23:18
922e300
Compare
Choose a tag to compare

This release enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways:

  • Using NCCL/RCCL Libraries: by specifying -DCOSMA_WITH_NCCL=ON cmake option.
  • Using GPU-aware MPI: by specifying -DCOSMA_WITH_GPU_AWARE_MPI=ON cmake option, as proposed here.
    See README and INSTALL for more info on how to build.

In addition, the following performance improvemets have been made:

  • Improved Caching:
    • all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate.
    • all device memory is cached and reused.
  • Reduced Data Trasfers: the GPU backend of COSMA called Tiled-MM is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu.
  • All collectives updated: both all-gather and reduce-scatter collectives are improved.
  • Reduced Data Reshuffling: avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling.
  • Works for variable blocks: NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to MPI_Reduce_scatterv which we previously used. We padded all the blocks to be able to overcome this issue.
  • Portability: Supports both NVIDIA and AMD GPUs.
  • Tiled-MM: Updated submodule
  • COSTA: Updated submodule

COSMA-v2.5.1

04 Jun 08:43
ff73093
Compare
Choose a tag to compare

Fixes the building issue with cmake versions prior to 3.12.2.

COSMA-v2.5.0

26 May 23:11
Compare
Choose a tag to compare

This version brings the following improvements:

  • [feature] Adds COSMA_DIM_THRESHOLD environment variable to cosma_prefixed_pxgemm.
  • [improvements] Fixes the building issues and dependency handling in CMake.
  • [bugfix] Fixes OpenMP race conditions.
  • [bugfix] Resolves the problem with setting devices when running COSMA on multigpu systems.