Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pd.read_excel with table parameter #38937

Open
samukweku opened this issue Jan 4, 2021 · 19 comments
Open

ENH: pd.read_excel with table parameter #38937

samukweku opened this issue Jan 4, 2021 · 19 comments
Assignees
Labels
Enhancement IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action

Comments

@samukweku
Copy link
Contributor

samukweku commented Jan 4, 2021

Is your feature request related to a problem?

Excel has tables , that makes data managing within Excel easier and offers some other features.

At the moment, pandas cannot access those tables, it simply reads in all the data. I think it would be helpful if a table parameter was added to pd.read_excel to capture table(s) defined in a sheet.

Sample Data:

Attached is a sample excel file: 016-MSPTDA-Excel.xlsx

With the current implementation of read_excel, we cannot select specific tables (dSalesReps, dProduct, dCategory, or dSupplier). Even if we read the Tables sheet, it becomes quite hard to separate the data into individual tables.

Describe the solution you'd like

pd.read_excel(io=filename, sheet=sheetname, table=tablename, ...)

The table parameter can have a default of None, in which case the entire sheet is read in; if however table is not None, then the table or lists of tables only are read in. The openpyxl library would be used to implement this.

API breaking implications

I am not aware of any API breaking implications

Describe alternatives you've considered

It could be done outside Pandas, where you read in the data first through openpyxl, before passing it to Pandas. I wrote a blog post about it; I feel, however, that it may be more convenient to provide that same functionality within Pandas and let the user worry less about the abstractions.

Additional context

I did check to see if this has been raised before, but did not find any. If it already has, kindly point me to that issue, and I will gladly close this.

Thanks.

@samukweku samukweku added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 4, 2021
@lithomas1 lithomas1 added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 2, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021
@shraddhazpy
Copy link

Hi, if @samukweku is not working on it and this is up for grabs, I'd would want to contribute.

@samukweku
Copy link
Contributor Author

@shraddhazpy pls go ahead

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 14, 2021
@scotscotmcc
Copy link

hi, @shraddhazpy, are you working on this? This is functionality I would definitely appreciate. I haven't contributed to any public projects before, so it may be too big for me to tackle, but I would be interested in helping if I can.

@samukweku
Copy link
Contributor Author

@scotscotmcc I think U can just type take and it will be assigned to u

@scotscotmcc
Copy link

I'm stalling a bit on taking this; I did some playing around with the code yesterday, but today couldn't get the development environment with docker set up...

That said, I want to raise a question around how a table_name parameter would function with the existing sheet_name. I can think of three different reasonable options.

  1. Require sheet_name and table_name, grab the named table(s) from the named sheet (this is what was in the opening post).
  2. If table_name is used, then don't require sheet_name at all, and just read the table(s).
  3. If both are used, then read the entire sheet(s) and separately read the entire table(s). (my favorite)

Excel forces each table in a workbook to have a unique name, and so any table listed in the parameter could only exist once. Therefore, we don't need to know the sheet_name when the function is ran, we can find it, and it means one less thing for the user. Providing sheet_name would mean we don't have to search every sheet to have the name of the table, but it would also mean we would have an issue if the table is one a different sheet.

Requiring sheet_name when using table_name would also make it more difficult to read multiple tables that are on different sheets. You could require that, if you wanted to read multiple tables, you provide a list for sheet_name and a list for table_name, and the nth element of each list are pairs, but that seems bad...

Further, if you wanted to read all of Sheet1 and also Table2 that is within Sheet2 (but don't want to read Sheet2), it would be great to be able to do both of these at the time. To that end, allowing a new table_name to actually not really interact with the sheet_name parameter would allow you to do just that. Providing sheet_name reads sheets, and providing table_name reads tables, and they do the same thing regardless of what is up with the other.

@scotscotmcc
Copy link

I think that this enhancement is / should be dependent on this other to allow for passing engine_kwargs to pd.read_excel(). #43053

Currently, pd.read_excel() with engine=openpyxl (default if filetype is .xlsx) will default to setting the read_only = True in the openpyxl.load_workbook() call. However, you can't get table information out of the workbook object in openpyxl then; you need to have read_only = False.

The linked enhancement is to allow passing explicit arguments into the engine. That framework could then be used for this enhancement to change the read_only argument. The alternative would be that this enhancement creates a separate process to change the engine arguments, which seems clunky, inefficient, and cumbersome.

@samukweku
Copy link
Contributor Author

if this helps, there is an implementation for reading xlsx tables here; hopefully it can be done within Pandas (and one less library for users to install 😁

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@Meadiocre
Copy link

Hello @samukweku and @mroeschke,

My project team is looking for Pandas enhancement features for our grad school semester long project. We saw this task and would like to contribute if possible!

@samukweku
Copy link
Contributor Author

Hi @Meadiocre type take and it will be assigned to you

@iangainey
Copy link
Contributor

iangainey commented Feb 3, 2024

take
I'm in his project team, so thank you!

@iangainey
Copy link
Contributor

After looking into how this would be implemented, I had a couple of thoughts/wanted to clarify some desired behavior:
While this can currently be done by using openpyxl or pyjanitor, the intent is for this to be a more straightforward pandas user experience to be able to specify tables to read in data from without the more complex method in openpyxl or having to include pyjanitor-correct?
As the read_excel function has some features that are already dependent on openpyxl, I just wanted to clarify if the preference is to implement this without any dependency on openpyxl, or to just make it more clear & easy to use by adding the parameter to the read_Excel function with the dependency on openpyxl.

The other question has to do with a post by @scotscotmcc. To reiterate what they said, for if a table name and a sheet name are specified, either these options could be done:

  1. Only look for the table on the sheet name (If specified, otherwise search all sheets). This would result in faster reading of the table in files with many sheets, so not every sheet's tables has to be checked. But, it may be more "clunky" for a user if they mess up what sheet they put the table in, or in cases they want to read a specific sheet in as well as a table on a different sheet this would not be possible.
  2. Look for the tables in any sheet, no matter if sheet name is specified. Therefore the user could chose any sheet(s) they want to read in, and any tables and exactly those would be read in, even if those do not line up with what sheets the tables are on. This would cause any reading in of a table to search every sheet, but allowing the user more versatility in data selection.

I'm planning on proceeding with #2 unless there's any feedback otherwise with better suggestions. Additionally, I'm planning on pursuing the route of being dependent on openpyxl as the read_excel function already has many parts that are, just making it an easier use to read in tables with this function instead of using more complex methods through openpyxl or including pyjanitor, again unless there's any better feedback.

@samukweku
Copy link
Contributor Author

samukweku commented Feb 16, 2024

@iangainey hi. I wrote the pyjanitor implementation, because at the time there wasnt any in pandas. It would be nice if this was done within Pandas, and not dependent on pyjanitor. instead the dependency should directly be on openpyxl(at the moment, they are the only library offering this option). it'd give the pyjanitor team a valid reason to deprecate the function (we try not to replicate anything that Pandas already covers well).

On the other part, I'd suggest we go for option 1. The onus should be on the user to specify the correct sheet where the table name is stored. If the user does not provide a table name, then all tables from all sheets can be read in, or all tables from the specified sheet(s) can be read in.

as always, we can lean on the guidance of the pandas dev team here - @mroeschke @phofl @scotscotmcc

@scotscotmcc
Copy link

scotscotmcc commented Feb 16, 2024

Thanks for taking this on, @iangainey. As just a regular-old pandas user (and I use excel a bunch), I don't like option 1. As I said in my earlier post on it, excel requires the table name to be unique, and so passing in a table name uniquely and sufficiently identifies what the user is trying to read. Requiring the sheet name as well feels like the user would be repeating themself (at least that's how I would feel when using it).

With option 1, there is the advantage of not having to loop through sheets, since excel doesn't index the tables but just lists them within sheets. This advantage doesn't seem all that great, though, I don't think. At worst, it requires looping through sheets and the tables within those sheets, but it doesn't require looping through cells (which could be real onerous).

To raise another question here now I'm thinking of base off of @samukweku's comment, though - what if the user wants to read all tables in the workbook? Currently, read_excel(...,sheet_name=None) will read in all sheets. That seems like a functionality that could be replicated (and would make sense to replicate) with reading tables. That is, a user could pass a string, a list, or None and they would function similarly to sheet_name, and so read_excel(...,table_name=None) would read in all tables..

@samukweku
Copy link
Contributor Author

@scotscotmcc ahh that makes sense then if table name is unique. Totally missed that. Option 1 isn't needed then

@iangainey
Copy link
Contributor

I've been making a lot of progress on this, and have a come up with some questions:
To explain current progress of what the implementation looks like:

  • table_name can be a string, a list of strings, or None (which reads all tables)
  • If table_name is not passed, no tables will be read and the output will be a dictionary of dataframes of the sheets (this is currently how pandas does this, this wasn't changed)
  • If table_name is passed, the output will be a dictionary of dataframes of the tables
  • If both table_name and sheet_name are passed, the result will be a dictionary of 2 nested dictionaries, one for sheets and one for tables. The sheets dictionary will have dataframes of the sheets, and the tables dictionary will have the dataframes of the tables
  • I did the above forms of output so that if you don't pass table_name, nothing about the output will change (existing code using read_excel will still work fine, shouldn't break anything). This seemed like the most clear to use way of returning the output, any issues on this or suggestions of better ways to do it?

Now, the main question:
Currently with sheets, you can specify the index of the sheet to be read. I have not implemented this for tables. Is this desired functionality?
If so, I run into some issues with determining what to set table_name to if it is not passed in. Currently, I set it to 0 and then no tables are read in/returned. None is already used for a case in which to read all tables.
Sheets approaches this by just setting it to 0 if sheet_name is not passed, which by default reads in the first sheet. Since I don't want to break existing uses of read_excel, I don't want to return the firs table if table_name is not passed. I'm not sure how to implement this so that I can use 0 for an index, but also set table_name to something that will prevent any tables from being read? My only thought was set it to -1, check if it's -1 and if so don't do any table reading/output, but that seems like an unconventional process.

Otherwise, I should have a PR up soon to see if there's any issues or comments on things that could or should be improved.
Thoughts, @samukweku @scotscotmcc @mroeschke ?

@samukweku
Copy link
Contributor Author

As @scotscotmcc mentioned, sheets isn't required since table names are unique, so it can/should be ignored

@iangainey
Copy link
Contributor

iangainey commented Mar 16, 2024

I think I'm misunderstanding what your referencing. Sheet names as ignored in regards to what tables are read in, you don't have to specify a sheet to read tables. But, my question is more that I had to format the output differently if you pass in sheet names and table names, and wondering if reading table by index is desired @samukweku

@samukweku
Copy link
Contributor Author

@iangainey my bad. I don't think the nested approach for sheet names dictionary is necessary, a dictionary of table names and mapping sounds ok

@scotscotmcc
Copy link

I think it is fine to not read tables by index. I think that it is more straightforward for sheets to read by index since, in excel, there is already a single index of sheets at the workbook level. For tables (i.e. listobjects), they live within sheets and each sheet has its own list of them. One could then just sort of make a single list of them where the tables in sheet1 are first, then the tables in sheet2, etc. If I was trying to use this index system to grab a table out of excel, though, it seems like I'd be doing more work than just using the name (unless it is the first table).

All that said, I'm thinking a lot about my own use cases, where I have named tables that I have control over, and so I can read them by name just fine. If someone else is getting an excel file from someone else and trying to read it, and it has inconsistent table names, I could see it working less well for them. Even still, they could read all the tables and then filter the returning dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants