Apache Beam Dataframe and SQL support #535

alxmrs · 2024-02-17T10:45:59Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I'd like to deploy on GCP Dataflow, Apache Flink, a Kubernetes cluster, etc. with a single Dataframe library.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Fugue should support Beam as an execution engine just like Spark or Ray.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

There are no other ways to deploy on Dataflow. Further, this should benefit Fugue since it will add lots more execution engines "for free."

https://beam.apache.org/documentation/runners/capability-matrix/

Additional context
Add any other context or screenshots about the feature request here.

https://beam.apache.org/documentation/dsls/dataframes/overview/

goodwanghan · 2024-02-19T21:28:07Z

Thanks for the suggestion. We may consider Flink in the future, however, I am not sure about Beam.

I am curious have you seen beam performing well compared to native Flink and Spark? I personally didn't have much positive experience with Beam. And the streaming first philosophy may be the fundamental problem in my opinion.

I'd love to learn different opinion from you.

Thanks!

alxmrs · 2024-02-20T06:14:41Z

Interesting! What are your negative experiences with Beam? I personally have had great experiences with Beam. I haven’t used native Spark in a while, I’ve been primarily using Beam for the last few years. I have searched around a bit and haven’t found benchmarks comparing Beam’s Dataframes on Spark vs Native Spark Dataframes. There are other comparisons on the JVM that show that native Spark and Flink are faster than using Beam, but I’m not sure how useful they would be for the Fugue/Python context. I have two main motivations for an integration with Beam, even if there are separate integrations with Flink: 1. I need to be able to deploy on GCP Dataflow. 2. I need to integrate a dataframes library of some kind with Xarray-Beam ( https://github.com/google/xarray-beam). Using Fugue for this would be quite streamlined, but it’s not a dealbreaker.

And the streaming first philosophy may be the fundamental problem in my

opinion. Would you elaborate here? It’s a problem with respect to what? I’m happy to share my experience and this context would help.

…

On Tue, Feb 20, 2024 at 4:28 AM Han Wang ***@***.***> wrote: Thanks for the suggestion. We may consider Flink in the future, however, I am not sure about Beam. I am curious have you seen beam performing well compared to native Flink and Spark? I personally didn't have much positive experience with Beam. And the streaming first philosophy may be the fundamental problem in my opinion. I'd love to learn different opinion from you. Thanks! — Reply to this email directly, view it on GitHub <#535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARXAB6OSF25XPS3F2SYWALYUO7XHAVCNFSM6AAAAABDNHAIPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGE3TOOBXGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Beam Dataframe and SQL support #535

Apache Beam Dataframe and SQL support #535

alxmrs commented Feb 17, 2024

goodwanghan commented Feb 19, 2024

alxmrs commented Feb 20, 2024 via email

Apache Beam Dataframe and SQL support #535

Apache Beam Dataframe and SQL support #535

Comments

alxmrs commented Feb 17, 2024

goodwanghan commented Feb 19, 2024

alxmrs commented Feb 20, 2024 via email