Skip to content

Releases: lorey/mlscraper

1.0.0rc3

24 Jun 14:04
Compare
Choose a tag to compare
1.0.0rc3 Pre-release
Pre-release
  • improved training performance by 10x (again) by trying to generate scrapers for highly similar matches first
  • added first pseudo css selectors by implementing nth-child. e.g. div a:nth-child(1)
  • added child selector generation, e.g. .user-box > a
  • added attribute-based css selectors, e.g. a[itemprop="user"]
  • added automated tests for GitHub profile pages
  • added lazy hashing for node elements
  • extended text matching to also include parent elements that contain the same text
  • fixed a bug where searching for values resulted in image dimensions being matched
  • fixed a bug where text did not exactly match the sample provided but was selected anyway

1.0.0rc2

21 Jun 19:39
Compare
Choose a tag to compare
1.0.0rc2 Pre-release
Pre-release
  • fixed a bug where text inside a tag was only selected if not enclosed by whitespace

1.0.0rc1

21 Jun 16:02
Compare
Choose a tag to compare
1.0.0rc1 Pre-release
Pre-release

mlscraper has been rewritten from the core and is now easier to use, more flexible, and faster than ever. This is the first release candidate for the upcoming 1.0 version. Feel free to try it out with pip install --pre mlscraper.

  • scrapers can extract arbitrary data structures (lists, dicts, lists of dicts and even lists of lists)
  • depending on the page, one example might be enough to train a scraper
  • the generation of CSS selectors has been overhauled and is now more efficient