BUG: `read_html` does not properly structure some html table elements (possible `rowspan` or `colspan` issues) #58461

jowens · 2024-04-28T03:33:07Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

#!/usr/bin/env python3

import pandas as pd
import re
import requests
from io import StringIO

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", 500)

html = requests.get(
    "https://en.wikipedia.org/wiki/Template:AMD_Radeon_Pro_V_series"
).text

df = pd.read_html(
    StringIO(html),
    match=re.compile("Radeon Pro V620"),
)
print(df)

Issue Description

The bottom right of the ingested table puts entries in the wrong columns near the right side for the last two rows. I did some checking of the HTML source and even though it's got some complex rowspan and colspan directives, it appears to be properly constructed.

Cursor_and___Documents_working_owensgroup_proj_gpustats__jowens_piecaken

Template_AMD_Radeon_Pro_V_series_-_Wikipedia

I acknowledge that I'm using a slightly older pandas than is installed, but I looked through recent issues on and checkins to read_html and I don't believe this is fixed/reported.

Expected Behavior

I expect the column called "Memory / L3 Cache" to only be populated in the last row.

I expect the two power entries in the last two rows to be placed in the "TDP" column.

Most of the right side of the bottom two rows is misplaced.

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.12.3.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : 3.0.10
pytest : 7.4.3
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.23.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0.dev0+gb7d2f7ffc.d20240415
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

samukweku · 2024-04-28T11:00:46Z

@jowens If you use another tool for the extraction other than pandas, do you get a different result?

attack68 · 2024-04-28T11:32:37Z

I didn't implement any of this, and haven't checked the implementation but my guess is going to be that;

a) reading a grid based table is straight forward.
b) accounting for rowspan or colspan separately is an extension to a) which is not too difficult.
c) accounting for simulataneous cross-over of rowspan and colspan is really difficult and needs initial passes or very specific structuring. It probably isn't tested.

Can anyone confirm?

jowens · 2024-04-28T14:16:22Z

If you use another tool for the extraction other than pandas, do you get a different result?

Suggestion for that other tool? I'm happy to try.

samukweku · 2024-04-28T15:35:11Z

@jowens a quick search on google gives this html-extractor - havent used it though (caveat). i asked the earlier question, to see if there is a tool that does it right and we can compare against them. it seems @attack68 has looked into your question more and may have figured out the possible bug?

jowens · 2024-04-28T16:34:31Z

Just for posterity, here's the specific Wikipedia revision we're discussing here, in case it gets edited:

https://en.wikipedia.org/w/index.php?title=Template:AMD_Radeon_Pro_V_series&oldid=1220301074

and here's a gist where I extracted everything between <table> and </table>:

https://gist.github.com/jowens/8e42fa17a5af4bc16284cfab56ef1473

jowens · 2024-04-28T16:52:57Z

html_table_extractor has similar behavior (same errors). Here's a quick test:

https://gist.github.com/jowens/bd15b42accaa20e9c403af89719a5256

(which just has the table manually in the source code).

Here's the last line of the output, which corresponds to what's in the issue description.

['Radeon Pro V620(Navi\xa021)[10][11]\n', 'Nov 4, 2021\n', 'RDNA 2TSMC\xa0N7\n', '26.8×109520 mm2\n', '4608:288:128:7272 CU\n', '18252200\n', '525.6633.6\n', '233.6281.6\n', '33,63840,550\n', '16,81920,275\n', '1,0511,267\n', '32\n', '512\n', 'GDDR6256-bit\n', '128 MB\n', '16000\n', 'PCIe\xa04.0×16\n', '—\n', '300\xa0W\n']

jowens · 2024-04-28T17:00:14Z

FWIW I just tested with read_html's flavor="lxml" and flavor="bs4" and they both returned identical results.

attack68 · 2024-04-29T14:23:54Z

So is the summary here that all tools you have tested for parsing this table, including pandas, return the same results, and that those results are all incorrect.

jowens · 2024-04-29T14:43:57Z

Well, two tools (pandas and html_table_extractor), and those two tools return consistent but incorrect results, where incorrect is compared to how a web browser renders it.

Since these two tools both (appear to) have different code that parses the table's cells / rowspans / colspans, it seems like a possibility that web browsers (I looked at Chrome/Firefox/Safari, each of which [I think] uses a different back end [Chromium/Gecko/WebKit]) might interpret the table differently than these two tools. Web browsers are surely more forgiving of HTML errors.

jowens added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 28, 2024

attack68 changed the title ~~BUG:~~ BUG: read_html does not properly structure some elements in the DataFrame Apr 28, 2024

attack68 added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Apr 28, 2024

attack68 changed the title ~~BUG: read_html does not properly structure some elements in the DataFrame~~ BUG: read_html does not properly structure some elements (possible rowspan or colspan issues) Apr 28, 2024

jowens mentioned this issue Apr 28, 2024

table from Wikipedia that doesn't parse yuanxu-li/html-table-extractor#23

Closed

jowens changed the title ~~BUG: read_html does not properly structure some elements (possible rowspan or colspan issues)~~ BUG: read_html does not properly structure some html table elements (possible rowspan or colspan issues) Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `read_html` does not properly structure some html table elements (possible `rowspan` or `colspan` issues) #58461

BUG: `read_html` does not properly structure some html table elements (possible `rowspan` or `colspan` issues) #58461

jowens commented Apr 28, 2024

INSTALLED VERSIONS

samukweku commented Apr 28, 2024

attack68 commented Apr 28, 2024

jowens commented Apr 28, 2024

samukweku commented Apr 28, 2024

jowens commented Apr 28, 2024

jowens commented Apr 28, 2024

jowens commented Apr 28, 2024

attack68 commented Apr 29, 2024

jowens commented Apr 29, 2024

BUG: read_html does not properly structure some html table elements (possible rowspan or colspan issues) #58461

BUG: read_html does not properly structure some html table elements (possible rowspan or colspan issues) #58461

Comments

jowens commented Apr 28, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

samukweku commented Apr 28, 2024

attack68 commented Apr 28, 2024

jowens commented Apr 28, 2024

samukweku commented Apr 28, 2024

jowens commented Apr 28, 2024

jowens commented Apr 28, 2024

jowens commented Apr 28, 2024

attack68 commented Apr 29, 2024

jowens commented Apr 29, 2024

BUG: `read_html` does not properly structure some html table elements (possible `rowspan` or `colspan` issues) #58461

BUG: `read_html` does not properly structure some html table elements (possible `rowspan` or `colspan` issues) #58461