Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode error with non-ascii Infobox boxterm #138

Open
baerbock opened this issue Feb 13, 2019 · 2 comments
Open

Unicode error with non-ascii Infobox boxterm #138

baerbock opened this issue Feb 13, 2019 · 2 comments
Assignees
Labels

Comments

@baerbock
Copy link

I would like to extract the infobox of the Bulgarian Railway Line No. 1 article.

import wptools
page = wptools.page('Железопътна линия 1 (България)', lang='bg')
page.get_parse()
page.data['ЖП линия']

which fails.

Are infoboxes not detected if they are named in an unsual manner (ЖП линия)?

@siznax
Copy link
Owner

siznax commented Feb 14, 2019

Thanks for trying wptools @baerbock!

We should have support for this with boxterm=ЖП линия:

>>> help(wptools.page)
class WPToolsPage(wptools.restbase.WPToolsRESTBase, wptools.wikidata.WPToolsWiki
data, wptools.core.WPTools)
 |  WPtools Page class, derived from wptools.core
 |
 |  Method resolution order:
 |      WPToolsPage
 |      wptools.restbase.WPToolsRESTBase
 |      wptools.wikidata.WPToolsWikidata
 |      wptools.core.WPTools
 |      __builtin__.object
 |
 |  Methods defined here:
 |
 |  __init__(self, *args, **kwargs)
 |      Returns a WPToolsPage object
 |
 |      Gets a random title without arguments
 |
 |      Optional positional {params}:
 |      - [title]: <str> Mediawiki page title, file, category, etc.
 |
 |      Optional keyword {params}:
 |      - [boxterm]: <str> Infobox title name or substring
 |      - [endpoint]: <str> alternative API endpoint (default=/w/api.php)
 |      - [lang]: <str> Mediawiki language code (default=en)
 |      - [pageid]: <int> Mediawiki pageid
 |      - [variant]: <str> Mediawiki language variant
 |      - [wiki]: <str> alternative wiki site (default=wikipedia.org)
 |      - [wikibase]: <str> Wikidata database ID (e.g. 'Q1')
 |
 |      Optional keyword {flags}:
 |      - [silent]: <bool> do not echo page data if True
 |      - [skip]: <list> skip actions in this list
 |      - [verbose]: <bool> verbose output to stderr if True
 ...

but that currently raises a UnicodeDecodeError in this case:

>>> page = wptools.page('Железопътна_линия_1_(България)', lang='bg', boxterm='ЖП линия')
>>> page.get()
bg.wikipedia.org (query) Железопътна_линия_1_(Б�...
bg.wikipedia.org (parse) 596059
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "wptools/page.py", line 522, in get
    self.get_parse(False, proxy, timeout)
  File "wptools/page.py", line 603, in get_parse
    self._get('parse', show, proxy, timeout)
  File "wptools/core.py", line 183, in _get
    self._set_data(action)
  File "wptools/page.py", line 204, in _set_data
    self._set_parse_data()
  File "wptools/page.py", line 255, in _set_parse_data
    infobox = utils.get_infobox(parsetree, boxterm)
  File "wptools/utils.py", line 37, in get_infobox
    if title and boxterm in title:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

I'll have to take a closer look at what's causing this.

@siznax siznax self-assigned this Feb 14, 2019
@siznax siznax added the Bug label Feb 14, 2019
@siznax
Copy link
Owner

siznax commented Mar 4, 2019

Sorry for the delay here. Hope to get to this soon... 😃

@siznax siznax changed the title Infobox with non-latin name Unicode error with non-ascii Infobox boxterm Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants