Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pyarrow a required dependency #52509

Closed
phofl opened this issue Apr 7, 2023 · 36 comments · Fixed by #52711
Closed

Make pyarrow a required dependency #52509

phofl opened this issue Apr 7, 2023 · 36 comments · Fixed by #52711
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action

Comments

@phofl
Copy link
Member

phofl commented Apr 7, 2023

Right now we run into a bunch of problems because pyarrow is potentially missing.

This is the next step in supporting arrow fully in pandas. There are some things we can’t implement properly without requiring arrow as dependency.
So I think we should do it soonish, so that we can move forward.

cc @pandas-dev/pandas-core

@phofl phofl changed the title Make pyarrow a required depdendency Make pyarrow a required dependency Apr 7, 2023
@phofl phofl added Dependencies Required and optional dependencies Arrow pyarrow functionality labels Apr 7, 2023
@MarcoGorelli
Copy link
Member

I'd be in favour

I've heard people raise objections because pyarrow is too heavy of a dependency - personally, I'd require it anyway, I don't think we should hold the project back because of this, just bringing it up for completeness

@jreback
Copy link
Contributor

jreback commented Apr 7, 2023

+1 from me - it's time

@datapythonista
Copy link
Member

+1

I think main objections were not to do it yet as it was just fine to have it as a an optional dependency (see #50285). But if having it optional is an impediment.

Maybe PyArrow dependencies aren't ideal, but this should only be a temporary problem if PyArrow packages are expected to become more modular.

@lithomas1
Copy link
Member

Are we planning on using Arrow Cython integration/C API stuff?

If not, requiring should be trivial on packaging side should be trivial (so no objections from me there).

Since from the few comments so far, most people seem to be in favor,
we should really start figuring out the specifics of this, e.g.

  1. What will be the min. version of pyarrow that we will support?
    a) Would it be OK to have certain parts of pandas depend on a higher min. version of pyarrow?
  2. How many version of pyarrow will we commit to supporting(e.g. for numpy we support from 1.20-1.24, which is ~5 versions)?

@jorisvandenbossche
Copy link
Member

As long as it is not actually required (at the moment, it's strictly speaking an optional dependency, as you can perfectly fine use pandas without needing pyarrow, if you don't use any IO method relying on it and if you don't use the arrow-based dtypes), I don't see a good reason to make it an actual hard dependency.

The mentioned reasons that make it "awkward" to have it as an optional dependency don't sound that convincing to me, personally. I think it is quite easy to have helper methods that can check for pyarrow arrays and scalars (I commented on the mentioned PR: https://github.com/pandas-dev/pandas/pull/52076/files#r1160653924). That should already cover the first two points in the top post.

Proper constructor support is not possible right now without causing significant slowdowns

Is there some discussion about somewhere? I don't directly see why this would give a big slowdown.


When choosing to make it a hard requirement now, we have to be aware what the consequences are installation wise. For example, you can currently install pandas from source on most platforms as long as you have a compiler available (apart from that, we only have python dependencies / dependencies that can be installed by python in the same way as pandas). With pyarrow, that would no longer be true, as it cannot be installed from source. A typical example are docker images based on alpine linux (musl), for which there are no wheels at the moment.

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Apr 7, 2023
@jbrockmendel
Copy link
Member

+1

@simonjayhawkins
Copy link
Member

Since from the few comments so far, most people seem to be in favor,

PDEP-1 first revision (scope) #51417 looks like it has moreless been accepted. The new text adds "New required dependency." as "Clear examples for potential PDEPs". However, IIUC this does not explicitly say a PDEP is required.

There is the following... "On the other hand, if an issue becomes controversial, i.e. it generated a significant
discussion, one could suggest opening a PDEP to formalize and document the discussion, making it
easier for the wider community to participate."

While this issue has not (here) "generated a significant discussion" a PDEP would be useful to "making it easier for the wider community to participate"

@MarcoGorelli
Copy link
Member

agree - that, and the time bound on the voting period ensures we'll actually reach a decision rather than leaving the issue hanging around indefinitely. I think it'd be a good candidate

@jbrockmendel
Copy link
Member

Making a PDEP shifts decision-making away from people who actually work on development and towards people willing to throw up walls of text.

We have five +1s, one -1, and one "no objections". Lets do this.

@simonjayhawkins
Copy link
Member

-1. I don't agree that we should dismiss @jorisvandenbossche concerns.

@mroeschke
Copy link
Member

I would be +1 on making pyarrow a required dependency. Though not strictly necessary for default functionality at the moment, I think it would be a great "signal" for pandas to vouch for the arrow format & ecosystem.

I can see this change being significant enough for a PDEP, but as @MarcoGorelli alluded too I expect we could vote on this rather quickly given all the support in this thread already

@simonjayhawkins
Copy link
Member

I can see this change being significant enough for a PDEP, but as @MarcoGorelli alluded too I expect we could vote on this rather quickly given all the support in this thread already

yes. I think @jorisvandenbossche has convincing arguments to counter the issues in the OP and also partly the issues in older issue.

from the PDEP guidelines... "It is not always trivial to know which issue has enough scope to require the full PDEP process.
Some simple API changes have sufficient consensus among the core team, and minimal impact on the
community."

there does appear to be "sufficient consensus among the core team" but since requiring a new dependancy does not satisfy "minimal impact on the community.", I think a PDEP should be required for this change.

@attack68
Copy link
Contributor

Making a PDEP shifts decision-making away from people who actually work on development and towards people willing to throw up walls of text.

I agree with @simonjayhawkins, PDEP-1 seems to indicate this is in scope.
I think PDEPs offer a lot of value, not only for audit but for the transparency of the project.

@mroeschke
Copy link
Member

Just to clarify my comment, I do think it is a good idea that this is a PDEP such that we can lay out the pros/cons and discuss and use voting to make a decision with clarity than let the decision be ambiguous and linger here in the issue.

Post the dev meeting on 2023-04-12, I will draft this PDEP

@datapythonista
Copy link
Member

As the person who wrote PDEP-1, I'd personally wouldn't use it to decide if something requires a PDEP or not. I'm happy with both having this as a PDEP or not. But I think more important than how we interpret PDEP-1 exact sentences is what's useful, and in this case I'm not sure what part of the text of the possible PDEP would be to comment on. Seems like the decision is more of a preference.

Depending on pyarrow makes our life easier. And makes users who 1) want to install pandas from source 2) are not interested in the pyarrow feature, not be able to do it anymore.

I don't think Joris comment is a -1 (maybe that's his position but the comment is just a be aware of the consequences).

So, feels to me that we're all more or less in the same page. It's just a question of each person preference/priority. My vote is a clear +1, advantages of ensuring pyarrow is always available are more valuable than the limitations to the installation that I'm unsure any user will really face. This seems to be the most popular opinion based on above answers.

I'd say, if there is no proper objection let's just move forward. If people think it's still not clear enough what are tha advantages and disadvantages let's write a PDEP, so we're all in the same page, and then make a decision (is the new voting system already in place, or should we vote for that too?).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 12, 2023

I'd say, if there is no proper objection let's just move forward. If people think it's still not clear enough what are tha advantages and disadvantages let's write a PDEP, so we're all in the same page, and then make a decision (is the new voting system already in place, or should we vote for that too?).

The other reason to write a PDEP is that it makes it possible to get user feedback. Since requiring pyarrow will impact users, we should see if our user community has an opinion on the matter.

@datapythonista
Copy link
Member

Do you think a PDEP has a much more visibility than this issue? We have them listed in the roadmap, I guess maybe a bit, but wouldn't an email to pandas-dev about this be a better option if we want more feedback?

I'm surely happy with a PDEP, the only thing is the time it takes to write it. But I see that a PDEP/PR adds a lot of value when feedback about the specific text written in the PDEP is expected. And in this case, assuming everybody has already an understanding and an opinion about requiring pyartow, I don't expect so many useful comments for particular parts of the text, so discussing in normal issue comments seem as fine.

The blocker here seems to be how we make the final decision (votes...) and I don't think a PDEP is going to help, so I'd address that directly.

@lithomas1
Copy link
Member

I don't think this should be a PDEP. It would make this process way more drawn out than it should be, and there's really not a lot to put in the PDEP (we would really just be talked about min. version of pyarrow, how many versions to support, etc., basically the stuff I listed above ).

I missed the dev meeting where this was agreed upon, but I think I would much rather have a PDEP explaining the rationale behind adding support for pyarrow to everything in general (which would explain this by proxy).

@lithomas1
Copy link
Member

lithomas1 commented Apr 12, 2023

When choosing to make it a hard requirement now, we have to be aware what the consequences are installation wise. For example, you can currently install pandas from source on most platforms as long as you have a compiler available (apart from that, we only have python dependencies / dependencies that can be installed by python in the same way as pandas). With pyarrow, that would no longer be true, as it cannot be installed from source. A typical example are docker images based on alpine linux (musl), for which there are no wheels at the moment.

FWIW, there are no numpy musllinux wheels either.

While numpy can be installed with no external (non-Python) deps other than the C compiler, I don't think anyone would actually install numpy like that, since you'll be missing things like the BLAS.

(I would probably guess that people are building a numpy wheel for themselves that is compiled with the BLAS, and then installing that).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 12, 2023

Do you think a PDEP has a much more visibility than this issue?

The PDEP would have a discussion period, and be broadcast to the pandas user community as "open for discussion".

We agreed at the dev meeting today that @mroeschke will draft a PDEP. We can have discussion there, and then formalize the vote.

I have a concern about package bloat. Adding pyarrow increases the sizes of containers used in Kubernetes that are dependent on pandas. Not necessarily a reason to not require pyarrow, but I think these issues can be discussed in the PDEP better than here, and gets the team used to using PDEP's to discuss issues like these.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2023

I have a concern about package bloat

if u r using pandas you almost always are using parquet so likely you have pyarrow installed anyways

@WillAyd
Copy link
Member

WillAyd commented Apr 12, 2023

Ah the age old problem - to include or not to include? As things stand today, it doesn't appear (according to pypistats, for as good as that can be) that pyarrow and pandas are always bundled together. And from our user polls I vaguely recall excel / html being pretty popular formats, maybe moreso than parquet. So I am a bit wary of using "likeliness of bundled usage" as a guiding factor

As things stand now I am -1 on the issue and agree this should be a PDEP, for the following reasons:

  1. We haven't really defined required. I assume you mean as a runtime dependency, but why not as a build time dependency?
  2. I don't understand the problem we are solving; we manage a lot of optional runtime dependencies and it isn't that clear in this issue what makes this harder than normal
  3. This has conceptual conflict with the Meson PDEP; one of the selling points of that was to expand our potential platforms via cross-compiling, but here it seems like we are OK to limit the platforms we can be built on
  4. I thought our goal was to become backend agnostic, i.e. there may be a future where numpy is optional and you can pick either backend

There is a lot of nuance to this discussion and I agree a PDEP would make it easier to communicate and collaborate versus a huge wall of responses in this issue. Not everything needs to be a PDEP so definitely want to be cognizant of using it as a blocker to progress, but at the same time we just launched that process so I think we should be heavy on PDEPs the next few months, as frustrating as that can be

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 13, 2023

I thought our goal was to become backend agnostic, i.e. there may be a future where numpy is optional and you can pick either backend

On today's call there was a mostly-joking suggestion of making numpy optional. If someone finds a reasonable path to get there that would be neat (in large part bc making the numpy import lazy would be awesome). Until then, I think we should shy away from the term "backend", as I've seen multiple discussions (maybe reddit, HN, not sure) where people are confused by the proliferation of dtype options. I don't have links on hand, but I recall the word "backend" coming up frequently in these threads.

We haven't really defined required. I assume you mean as a runtime dependency, but why not as a build time dependency?

I've been assuming run-time.

Other misc thoughts:

The OP referenced gymnastics-free isinstance checks for construction. I'm also eager to have maybe_convert_objects infer Decimals to decimal[pyarrow] dtype (which would make it easier to address #52011, #49598, #40685, #39098, #25168, #47544), and similarly for time (#22720) and date objects (#39703, #24109, #32473, #46582) and nested types (#40652, #35176), and other (#20899). (And I'm very skeptical of having internal functions behave differently based on the presence of pyarrow).

If we don't require pyarrow, I'd like to make the import lazy.

It would also be nice to prune down to Just One String Dtype, and AFAICT pyarrow is the obvious choice there.

@phofl
Copy link
Member Author

phofl commented Apr 13, 2023

Agreed by @jbrockmendel, we have to be careful communication wise. There is a lot of confusion out there regarding our arrow dtypes.

We haven't really defined required. I assume you mean as a runtime dependency, but why not as a build time dependency?

I've been assuming run-time.

Sorry for not clarifying, yes run-time for now. Build-dependency would be nice at some point I guess, but there are a couple of hurdles to clear to get there.

The OP referenced gymnastics-free isinstance checks for construction. I'm also eager to have maybe_convert_objects infer Decimals to decimal[pyarrow] dtype (which would make it easier to address #52011, #40685, #39098, #25168), and similarly for time (#22720) and date objects (https://github.com/pandas-dev/pandas/issues/39703)`. (And I'm very skeptical of having internal functions behave differently based on the presence of pyarrow).

Having something like bytes inferring as "bytes[pyarrow]" would be nice as well. Also nested data structures (but support for those is poor right now).

When choosing to make it a hard requirement now, we have to be aware what the consequences are installation wise.

This certainly isn't ideal, but it's not a blocker imo, especially if there are no NumPy wheels either.

@bashtage
Copy link
Contributor

Another consideration of hard dependencies is their speed in releasing wheels on new platforms. I just checked and from what I can see pyarrow only released the first wheels for 3.11 in late November, which was a full month after pandas had a release with 3.11 support baked in. Building pyarrow on some platforms is very hard, at least compared to building pandas and its dependencies (incl NumPy which is easy to configure for OpenBlas even on Windows, and can be trivially compiled without Blas if not needed).

@phofl
Copy link
Member Author

phofl commented Apr 13, 2023

Quick search brought the following issue:

https://issues.apache.org/jira/browse/ARROW-17487

This looks more like a timing problem not like a general Python 3.11 support problem (like numba has right now)

@simonjayhawkins
Copy link
Member

It would also be nice to prune down to Just One String Dtype, and AFAICT pyarrow is the obvious choice there.

Thanks @jbrockmendel for some further justification as only the arguments in the OP at first did not appear strong enough and had IMO been countered by @jorisvandenbossche.

Maybe the choice of/ future plans for string dtype warrants a detailed discussion (separate issue) and consensus what to do here? As I could potentially see this being a strong justification. (If we were to somehow able to remove object dtype from pandas)

part of the text of the possible PDEP would be to comment on. Seems like the decision is more of a preference.

My thinking was that we should not let preferences sway a (hasty) decision and have clear documented justification. Hence +1 for a PDEP.

My vote is a clear +1, advantages of ensuring pyarrow is always available are more valuable than the limitations to the installation that I'm unsure any user will really face.

Fair point. A more objective, maybe quantative, definition of "more valuable" would surely strengthen the argument in favor?

For instance @WillAyd mentioned pypistats and my thinking, related to PDEP-9, is how many packages (and how important to the ecosystem) would say just use pandas to provide interoperability and not functionality?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 13, 2023

Going into more detail on my comment above (#52509 (comment)) about "consequences installation wise".

Installation platform support
Nowadays the python packaging situation has improved a lot with wheels and conda, and so for the use cases that are covered by those, there is not really a problem (pyarrow currently has wheels for the three main platforms, for python versions 3.7 to 3.11, and both manylinux x86_64 and aarch64, and both macos x86_64 and arm).

But if, for some reason, there is no wheel available, a pip install pyarrow does not work out of the box. This is in contrast with numpy and pandas, where the only requirement is that you have a compiler available on the system, and then you can do pip install pandas typically fine.
I think in general the portability of pyarrow is quite good (although I am no expert here), and so it is possible to build and install it on a variety of platforms, but the issue is that you can't do this with a pip install, but need to manually install a bunch of C++ dependencies and build Arrow C++.

Some example cases I currently think of (but there are quitely more) for which there are no wheels and installation from source is needed:

  • Alpine linux which is often used in docker containers
  • WASM, and so the usage of pandas in pyodide, pyscript, etc
  • In testing "bleeding edge" environments, for example testing with the next Python (we have a python-dev CI build for when a next Python version is upcoming, where we pip install some basic dependencies from source. Pyarrow is currently not included in that build). Pyarrow has in the past also been quite a bit slower in providing wheels for a new released python version compared to numpy and pandas.
  • (PyPy, although pandas support for PyPy is currently broken anyway)

And then there is also a part of the world that doesn't use pip/pypi/wheels (corporate environments with their own index, people using other (system) package managers (eg various linux package managers that have numpy/pandas but potentially not pyarrow), ..). But not going into detail here, because that's something I don't know much about.

Installation size
Doing a quick test on my own laptop (Linux/Ubuntu, using Python 3.11, and using the current latest versions of the packages). Installing pandas and pyarrow using pip from wheels, and checking the unpacked size of the packages in the site-packages directory: both numpy and pandas are each around 70MB, and pyarrow is 120MB. So that means that the size of a pandas installation (although without other optional dependencies, and also not counting python itself) would almost double.

(as a comparison, polars (which also includes an arrow implementation) is 55MB)

I don't know to what extent this is an actual problem (for example, is this nowadays still an issue for AWS Lambda?), but at least something to be aware of. It's also something where we would like to improve pyarrow, and there are some preliminary efforts to make the wheel smaller / split off optional components in separate wheels.

@jbrockmendel
Copy link
Member

Having something like bytes inferring as "bytes[pyarrow]" would be nice as well. Also nested data structures (but support for those is poor right now).

xref #52429

@fangchenli
Copy link
Member

  1. I thought our goal was to become backend agnostic, i.e. there may be a future where numpy is optional and you can pick either backend

Then adding a Jax backend would be awesome.

@adrinjalali
Copy link

We recently enabled jupyter lite on our examples on the scikit-learn side, and by default we're including pandas. Looking at the size of the wheels for pyarrow, they're somewhere between 20-35MB based on the platform. That's a significant increase for users who would be loading the example in their browser.

We also have been having trouble with the latest pandas when pip installed in a mamba environment, since conda-lock fails in this case, and that's because pandas=2.0 requires tzdata, and the env ends up with two tzdata. It's not a pandas bug per say, but it complicates workflows (we're sticking with an older pandas for now for that env). That is to say even a small dependency like tzdata which should be present in most environments anyway, complicates things.

Pandas is a library on which, other libraries depend, and any dependency added here adds to the dependency tree to those environments, and pyarrow is by far not a trivial dependency.

All of that is to say, from my perspective, it would be a pity if pyarrow becomes a mandatory dependency.

@MarcoGorelli
Copy link
Member

thanks @adrinjalali for sharing your perspective! the conversation has mostly moved to #52711, I'd suggest commenting there for visibility

@alippai
Copy link
Contributor

alippai commented Apr 30, 2023

Maybe something like "pyarrow compatible interface is runtime required" would make more sense? Eg using the Go Arrow implementation, Arrow Rust or Arrow2 Rust implementation as a drop in replacement could help with some of the issues (platform support, binary sizes). There is already an Array API and dataframe API. Specifying a "ColumnStorage API" which different backends can implement doesn't sound bad (while we can also mark the different Arrow backends so polars and other tools can have assumptions about the data layout over the more generic ColumnStorage API)

My understanding was that it is the design goal of Arrow to set a few reasonable assumptions (columnar storage, masked arrays) and provide a data format + data contract.

Pyarrow provides definitely more than this: datasets, pyarrow.compute, pyarrow.parquet. Perhaps Pandas should start relying on the core functionality, but not the rest? This would require a pyarrow-core package ofc, but that would be smaller and easier to make installable by pip install pyarrow-core if no wheels are available

@jreback
Copy link
Contributor

jreback commented Apr 30, 2023

Part of the rationale for using pyarrow as a runtime dependency is to reduce pandas core development burdens - sure using just a core part of arrow might be nice but it's not that helpful in addressing one of the most important reasons to have a required dependency. The point is to take advantage of the eco system w/o have even more in built functionality in pandas.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 1, 2023

Part of the rationale for using pyarrow as a runtime dependency is to reduce pandas core development burdens

In the PDEP #52711 (where we really should be having this discussion), I have asked here to get a better understanding of that burden (see #52711 (review))

@jjerphan
Copy link

jjerphan commented Aug 18, 2023

(My comment was moved to #54466 (comment).)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.