-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an experimental csv module exposing a streaming csv parser #3743
base: master
Are you sure you want to change the base?
Conversation
Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison. |
"go.k6.io/k6/js/modules" | ||
) | ||
|
||
// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion; if it doesn't pollute a lot the code/API, then I'd go for it, as in theory it doesn't sound like something very complex while it might be very handy for certain use cases.
) | ||
|
||
// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)? | ||
// TODO: Should we have an option to skip empty lines (lines with no fields?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first reaction is: why not doing so by default? But I guess we should/could take a look at similar parser libraries/tools, and get inspiration from there (what looks to be more common).
PS: Do you mean lines with at least one value missing, or those lines completely empty? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I raised this question as I reviewed papaparse parse
method's config which has a skipEmptyLines
option: https://www.papaparse.com/docs
And I wondered whether we should have the same option 👍🏻 I've made a very quick and shallow pass on node and deno, and I don't see them having such option (but I might have missed it).
PS: Do you mean lines with at least one value missing, or those lines completely empty? 🤔
I meant specifically empty lines. The default behavior of Go's CSV parser is to not make assumptions on the number of fields per record, but there's an option to enforce it. Maybe it would be worth adding an option to the module's parser too for users to tell us how many fields per record to expect, and error if we find a different value in any of the records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding skipEmptyLines
, it turns out that skipping them is the default behavior of the Go CSV parser, and it looks like it can't be overridden, so it sorts it out 👍🏻
|
||
// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)? | ||
// TODO: Should we have an option to skip empty lines (lines with no fields?) | ||
// TODO: Should we have an option to skip lines based on a specific predicate func(linenum, fields): bool ?:w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again my opinion; I'd say that'd be a nice-to-have, but not critical for a first iteration, so I'd keep that idea for a second iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking with Pawel, he agreed with you, and mentioned that our current scope is good enough for 90% of the use cases, and we could do it in the future if we still want to 👍🏻
// implementation details, but keep it public so that we can access it | ||
// from other modules that would want to leverage its implementation of | ||
// io.Reader and io.Seeker. | ||
Impl file `js:"-"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit; I think in general it's discouraged to expose an attribute as part of the public surface of the package (fs
in this case) which type is private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is an abstraction leak. The reason I went with that as a start was because I found the Read
/ExportedRead
approach we had initially taken somewhat inelegant (at least it tickled my sense of "nice" code).
But in hindsight it's probably the best compromise indeed.
The Impl
approach that's explicitly not exposed to users, but to other k6 libs and modules, started from the rationale that the underlying file
struct already implements ReadSeeker
, and I might as well leverage that. But because I couldn't find a way to retrieve the underlying file
implementation struct using ExportTo
or by doing it manually, I went with a public Impl
instead.
I'll give another shot at our initial approach and try to make it look okay 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to define it as a io.ReadSeeker
? 🤔
If possible, even if that requires a couple of tricks internally (because of stat()
), maybe that's a decent way to expose it for other k6 libs and modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean?
type File struct {
// ...
ReadSeeker io.ReadSeeker `js:"-"`
// ...
}
If so, it's something I considered, and I think it could work (would have to double-check) 🙇♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like that, yeah. I think it won't work as-is, because of stat()
, as I said, but perhaps we can find some workaround that only lives internally/privately in the package/implementation, with the benefit of keeping the public surface clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see what I can do 🙇♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I've pushed a commit, which exposes the behavior as an interface instead. I like it much more, indeed.
I've created a Stater
interface, made file
implement it, and incorporated it in a ReadSeekStater
interface, which I use to expose the underlying behavior instead of file
itself. Let me know what you think 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much cleaner now, at the very least! 👌🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving form to what we started during the Crococon 💟
I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀
I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to use the parser mostly within two scenarios:
- Getting one line per iteration
- Getting one line per VU
The current example iterating over the batch during the single iteration doesn't sound representative of the most common use case.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are reasonable use cases, right!
Indeed I think I mentioned at least one of them during the ideation phase during Crococon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're both right 🙂
If I summarize:
- "As a user, I want to create a parser, and every executed iteration gets the next line, and advances the line "head" to the next one for the next iteration (regardless of the VU executing the iteration)"
- "As a user, I want to create a parser, and when the VU code calls the
next
method, the nextline_number % vu_number
is returned to it"
Do you agree with these statements? 🙂
I think both call for a rather different design and on top of my head, it might be not easy to support both in an efficient manner. I'm gonna look into it, and see what I come up with 🙇♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! I guess we could leave the second one for the future, if we manage to let different VUs use different parsers, and leave the work for the test author. I know it's not ideal, and for certain situations where the amount of VUs varies over time it might be more difficult, tho.
In general I'd say, let's get an MVP and ship it! And, imho, that MVP should have at least one mechanism that doesn't imply having a new parser for each iteration but shared across iterations/vus somehow (the one we think have a better balance between popularity and feasibility). Once we have that, I'd vote for merging it, then iterate with more strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-vu:
vuID := exec.vu.idInTest
let parser = new csv.Parser(file, {
fromLine: vuID, // potentially with the modulo
toLine: vuID, // potentially with the modulo
})
const {done, value} = await parser.next();
export default async function() {
console.log(value)
}
Per-iteration:
let parser = new csv.Parser(file)
export default async function() {
const {done, value} = await parser.next();
console.log(value)
}
They should already work, we have to just add a unit test with them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for the "Per-vu", I guess there could be multiple strategies:
- Enabling an option to skip rows, like just process the rows multiple of...
- Suggesting the user, if they now the amount of rows in advance, to split the readers accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re per-vu: It's interesting. I think we had completely different assumptions/understandings regarding the per-vu strategy 🤓 From your example, it looks like you wish to select a chunk of the whole file based on a vu number, whereas I had in mind that for 16 VU, vu 1 would receive line 1, 17, 33, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re per-vu: It's interesting. I think we had completely different assumptions/understandings regarding the per-vu strategy 🤓 From your example, it looks like you wish to select a chunk of the whole file based on a vu number, whereas I had in mind that for 16 VU, vu 1 would receive line 1, 17, 33, etc.
Aren't these two the strategies I mentioned above? 🤔 Or did I miss any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see we're aligned, then! Also at when to write comments hahahaha :)
})(); | ||
|
||
export default async function() { | ||
let parser = new csv.Parser(file, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context @joanlopez and @codebien I just found out that currently, because we don't have support for await in the init context, the only way (with the current design) to instantiate the parser from the init context is the following:
let file;
let parser;
(async function () {
file = await open('data.csv');
parser = new csv.Parser(file, {
delimiter: ',',
skipFirstLine: true,
fromLine: 3,
toLine: 13,
})
})();
Which works, but adds a workaround to the workaround :-/
An alternative I would consider would be to have the parser support for receiving a file path, and opening the file under the hood (which might involve exposing the currently private openImpl
function from the fs
module to other modules such as this one).
What do you think? 🦖
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which works, but adds a workaround to the workaround :-/
It doesn't sound like a problem of this scope/pull-request. The problem is async in Init context and at some point we have to resolve it (we are already on it with ESM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right indeed 👍🏻 🙇🏻
Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:
|
What?
This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon. It is aimed at addressing #2976.
It is in a preliminary state and aims at getting input and iterating on the design collaboratively.
What's not there yet
csv.Parser.next
returns an iterator-like object with adone
property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?Why?
Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue encountered by users is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the
fs
module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.Checklist
make lint
) and all checks pass.make tests
) and all tests pass.Related PR(s)/Issue(s)
#2976