Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_html does not properly structure some html table elements (possible rowspan or colspan issues) #58461

Open
3 tasks done
jowens opened this issue Apr 28, 2024 · 9 comments
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@jowens
Copy link

jowens commented Apr 28, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

#!/usr/bin/env python3

import pandas as pd
import re
import requests
from io import StringIO

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", 500)

html = requests.get(
    "https://en.wikipedia.org/wiki/Template:AMD_Radeon_Pro_V_series"
).text

df = pd.read_html(
    StringIO(html),
    match=re.compile("Radeon Pro V620"),
)
print(df)

Issue Description

The bottom right of the ingested table puts entries in the wrong columns near the right side for the last two rows. I did some checking of the HTML source and even though it's got some complex rowspan and colspan directives, it appears to be properly constructed.

Cursor_and___Documents_working_owensgroup_proj_gpustats__jowens_piecaken Template_AMD_Radeon_Pro_V_series_-_Wikipedia

I acknowledge that I'm using a slightly older pandas than is installed, but I looked through recent issues on and checkins to read_html and I don't believe this is fixed/reported.

Expected Behavior

I expect the column called "Memory / L3 Cache" to only be populated in the last row.

I expect the two power entries in the last two rows to be placed in the "TDP" column.

Most of the right side of the bottom two rows is misplaced.

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.12.3.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : 3.0.10
pytest : 7.4.3
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.23.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0.dev0+gb7d2f7ffc.d20240415
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@jowens jowens added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 28, 2024
@samukweku
Copy link
Contributor

@jowens If you use another tool for the extraction other than pandas, do you get a different result?

@attack68 attack68 changed the title BUG: BUG: read_html does not properly structure some elements in the DataFrame Apr 28, 2024
@attack68 attack68 added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Apr 28, 2024
@attack68 attack68 changed the title BUG: read_html does not properly structure some elements in the DataFrame BUG: read_html does not properly structure some elements (possible rowspan or colspan issues) Apr 28, 2024
@attack68
Copy link
Contributor

I didn't implement any of this, and haven't checked the implementation but my guess is going to be that;

a) reading a grid based table is straight forward.
b) accounting for rowspan or colspan separately is an extension to a) which is not too difficult.
c) accounting for simulataneous cross-over of rowspan and colspan is really difficult and needs initial passes or very specific structuring. It probably isn't tested.

Can anyone confirm?

@jowens
Copy link
Author

jowens commented Apr 28, 2024

If you use another tool for the extraction other than pandas, do you get a different result?

Suggestion for that other tool? I'm happy to try.

@samukweku
Copy link
Contributor

@jowens a quick search on google gives this html-extractor - havent used it though (caveat). i asked the earlier question, to see if there is a tool that does it right and we can compare against them. it seems @attack68 has looked into your question more and may have figured out the possible bug?

@jowens
Copy link
Author

jowens commented Apr 28, 2024

Just for posterity, here's the specific Wikipedia revision we're discussing here, in case it gets edited:

https://en.wikipedia.org/w/index.php?title=Template:AMD_Radeon_Pro_V_series&oldid=1220301074

and here's a gist where I extracted everything between <table> and </table>:

https://gist.github.com/jowens/8e42fa17a5af4bc16284cfab56ef1473

@jowens
Copy link
Author

jowens commented Apr 28, 2024

html_table_extractor has similar behavior (same errors). Here's a quick test:

https://gist.github.com/jowens/bd15b42accaa20e9c403af89719a5256

(which just has the table manually in the source code).

Here's the last line of the output, which corresponds to what's in the issue description.

['Radeon Pro V620(Navi\xa021)[10][11]\n', 'Nov 4, 2021\n', 'RDNA 2TSMC\xa0N7\n', '26.8×109520 mm2\n', '4608:288:128:7272 CU\n', '18252200\n', '525.6633.6\n', '233.6281.6\n', '33,63840,550\n', '16,81920,275\n', '1,0511,267\n', '32\n', '512\n', 'GDDR6256-bit\n', '128 MB\n', '16000\n', 'PCIe\xa04.0×16\n', '—\n', '300\xa0W\n']

@jowens
Copy link
Author

jowens commented Apr 28, 2024

FWIW I just tested with read_html's flavor="lxml" and flavor="bs4" and they both returned identical results.

@jowens jowens changed the title BUG: read_html does not properly structure some elements (possible rowspan or colspan issues) BUG: read_html does not properly structure some html table elements (possible rowspan or colspan issues) Apr 28, 2024
@attack68
Copy link
Contributor

So is the summary here that all tools you have tested for parsing this table, including pandas, return the same results, and that those results are all incorrect.

@jowens
Copy link
Author

jowens commented Apr 29, 2024

Well, two tools (pandas and html_table_extractor), and those two tools return consistent but incorrect results, where incorrect is compared to how a web browser renders it.

Since these two tools both (appear to) have different code that parses the table's cells / rowspans / colspans, it seems like a possibility that web browsers (I looked at Chrome/Firefox/Safari, each of which [I think] uses a different back end [Chromium/Gecko/WebKit]) might interpret the table differently than these two tools. Web browsers are surely more forgiving of HTML errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants