Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file lookups are slow #228

Open
forkdog opened this issue Aug 22, 2022 · 5 comments
Open

Large file lookups are slow #228

forkdog opened this issue Aug 22, 2022 · 5 comments

Comments

@forkdog
Copy link

forkdog commented Aug 22, 2022

I have a large html file, about 13m, and it takes way too long to find the modifications. Is there any way to quickly find changes?

let html = try String(contentsOf: url, encoding: .utf8)
let document = try SwiftSoup.parse(html)
let fragmentIds: [String] = [......] //there are 1 thousand
for fragmentID in fragmentIds {
	let links = try document.select("[id=\(fragmentID)]")
	if links.count > 0 {
		let link = try document.createElement("a")
		try link.attr("href", fragmentID)
		try link.appendText(aFragmentID)
		try links.get(0).before(link)
	}
}
@aehlke
Copy link
Contributor

aehlke commented Aug 31, 2023

try fuzi

@boehs
Copy link

boehs commented Dec 15, 2023

as ridiculous as it sounds, we're switching to rust FFI using UniFFI and the scraper crate (built on html5ever). Reason being

  1. We weren't confident in Fuzi's CSS selector support
  2. We did not want to rewrite again
  3. An android version is in the cards so shared code would be nice

In preliminary tests, on large pages with lots of parsing this method outperforms swiftsoup by about 15 times, and without any sort of concurrency (we heavily used concurrency to mitigate swiftsoup's speed). The jury is out on small pages, conversion is very much incremental (800ms parsing was very much an emergency)

@aehlke
Copy link
Contributor

aehlke commented Dec 17, 2023

Amazing, if you wrap that into an SPM package then please do share

@boehs
Copy link

boehs commented Dec 22, 2023

Setup was convoluted and poorly documented, I've written a tutorial on how we setup UniFFI here. Note instead of a swift wrapper for the scraper crate, all the business logic is within rust, so unfortunately it's not generalizable to a package. This is because rust is cool and FFI has some overhead. I can say that FFI has been a joy to use. It's a miracle how well it works once configured, there's absolutely no indication that what you're calling is a rust function. A couple limitations to be aware of though:

  1. Previously we were using Double and Int types and UniFFI maps to types like Int32 and Float. All you need is a conversion (or just realizing you don't need a double anyway and changing everything to Float)
  2. Structs are mapped as well, but that's been annoying because I haven't found a way to take advantage of auto Codable implementations so right now that's manual. Issue here: How do I automatically implement protocols/extensions/conformance in Swift? mozilla/uniffi-rs#1905

Overall I do not regret it.

Edit: Just to substantiate these claims

image
image

This is largely verbatim copy and paste code with the exception of swiftsoup running many operations async – 14.34x difference.

@aehlke
Copy link
Contributor

aehlke commented Mar 19, 2024

Thanks for sharing the writeup

I also came across https://github.com/antoniusnaumann/cargo-swift which looks promising

lol-html is also interesting for not just parsing but also transforming https://shadowfacts.net/2022/swift-rust/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants