Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: comparison between RAPIDS and Modin #978

Closed
AndreaPi opened this issue Jan 9, 2020 · 2 comments
Closed

Query: comparison between RAPIDS and Modin #978

AndreaPi opened this issue Jan 9, 2020 · 2 comments
Labels
question ❓ Questions about Modin

Comments

@AndreaPi
Copy link

AndreaPi commented Jan 9, 2020

What are the differences in terms of performance between Modin and RAPIDS (or if you will, Modin and cuDF)? On which task (e.g., statistical computation, file reading, groupby, joins, etc.) one should expect Modin to fare batter than cuDF and vice versa? They both aim at having an API which is as close as possible to the pandas API. Of course a core difference is that cuDF, being a NVIDIA project, requires a GPU to work. However, all serious GPU-equipped workstations also have a CPU with a lot of cores (e.g. NVIDIA DGX-2), thus it may still make sense to compare Modin and cuDF, in my opinion. Also, AFAICT, they're the only two projects which strive so much to be a one-line replacement for pandas (I.e., using the same API).

@devin-petersohn
Copy link
Collaborator

Thanks for the question @AndreaPi!

TL;DR Modin is more of a dataframe framework than a dataframe implementation and can incorporate cuDF as compute kernels alongside the CPU computation in the future.

Modin's architecture is such that different components can be implemented and plugged in as they are developed. This architecture allows us to explore new technologies quicker (i.e. pyarrow Table compute kernels) without having to re-implement the common features. cuDF falls into this category of kernels we could use to run different components of the query. Most of the reason we haven't already done this comes down to the engineering time. We are an extremely small team compared to cuDF or Dask, and the abstractions we have in place are to enable the small team to be agile. In the future, Modin will integrate the cuDF compute kernels alongside existing kernels to provide a hybrid CPU-GPU approach. Of course a lot of details will need to be explored, e.g. when to schedule GPU computation, etc. In general, though, I think that leveraging the GPU should happen if the system has one. GPUs can have some limitations when it comes to the memory available, so a hybrid approach makes the most sense to me.

In general, Modin is aimed at bringing the best parts of all systems together by exposing the right abstractions. We are not targeting beating performance for any specific library, because over time Modin will continue to integrate lower level libraries and faster systems as they are developed.

Performance benchmarking in dataframes is often flawed because systems tend to favor using bechmarks that show them in a good light. cuDF is bottlenecked by GPU memory, so certain benchmarks on 100's of GB of data would not favor cuDF. Instead of comparing them, I prefer to integrate their excellent work to make a more comprehensive performance story: to finish a query as fast as possible for the user with all resources available.

This probably won't happen for some time, but I think it is an important part of Modin's future.

@devin-petersohn devin-petersohn added the question ❓ Questions about Modin label Jan 9, 2020
@AndreaPi
Copy link
Author

This makes a lot of sense! Thanks for the answer. Closing the issue

@devin-petersohn devin-petersohn pinned this issue Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Questions about Modin
Projects
None yet
Development

No branches or pull requests

2 participants