Query: comparison between RAPIDS and Modin #978

AndreaPi · 2020-01-09T06:49:47Z

What are the differences in terms of performance between Modin and RAPIDS (or if you will, Modin and cuDF)? On which task (e.g., statistical computation, file reading, groupby, joins, etc.) one should expect Modin to fare batter than cuDF and vice versa? They both aim at having an API which is as close as possible to the pandas API. Of course a core difference is that cuDF, being a NVIDIA project, requires a GPU to work. However, all serious GPU-equipped workstations also have a CPU with a lot of cores (e.g. NVIDIA DGX-2), thus it may still make sense to compare Modin and cuDF, in my opinion. Also, AFAICT, they're the only two projects which strive so much to be a one-line replacement for pandas (I.e., using the same API).

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-01-09T22:42:26Z

Thanks for the question @AndreaPi!

TL;DR Modin is more of a dataframe framework than a dataframe implementation and can incorporate cuDF as compute kernels alongside the CPU computation in the future.

Modin's architecture is such that different components can be implemented and plugged in as they are developed. This architecture allows us to explore new technologies quicker (i.e. pyarrow Table compute kernels) without having to re-implement the common features. cuDF falls into this category of kernels we could use to run different components of the query. Most of the reason we haven't already done this comes down to the engineering time. We are an extremely small team compared to cuDF or Dask, and the abstractions we have in place are to enable the small team to be agile. In the future, Modin will integrate the cuDF compute kernels alongside existing kernels to provide a hybrid CPU-GPU approach. Of course a lot of details will need to be explored, e.g. when to schedule GPU computation, etc. In general, though, I think that leveraging the GPU should happen if the system has one. GPUs can have some limitations when it comes to the memory available, so a hybrid approach makes the most sense to me.

In general, Modin is aimed at bringing the best parts of all systems together by exposing the right abstractions. We are not targeting beating performance for any specific library, because over time Modin will continue to integrate lower level libraries and faster systems as they are developed.

Performance benchmarking in dataframes is often flawed because systems tend to favor using bechmarks that show them in a good light. cuDF is bottlenecked by GPU memory, so certain benchmarks on 100's of GB of data would not favor cuDF. Instead of comparing them, I prefer to integrate their excellent work to make a more comprehensive performance story: to finish a query as fast as possible for the user with all resources available.

This probably won't happen for some time, but I think it is an important part of Modin's future.

AndreaPi · 2020-01-10T08:12:35Z

This makes a lot of sense! Thanks for the answer. Closing the issue

devin-petersohn added the question ❓ Questions about Modin label Jan 9, 2020

AndreaPi closed this as completed Jan 10, 2020

devin-petersohn pinned this issue Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: comparison between RAPIDS and Modin #978

Query: comparison between RAPIDS and Modin #978

AndreaPi commented Jan 9, 2020 •

edited

devin-petersohn commented Jan 9, 2020

AndreaPi commented Jan 10, 2020

Query: comparison between RAPIDS and Modin #978

Query: comparison between RAPIDS and Modin #978

Comments

AndreaPi commented Jan 9, 2020 • edited

devin-petersohn commented Jan 9, 2020

AndreaPi commented Jan 10, 2020

AndreaPi commented Jan 9, 2020 •

edited