-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(invariant): fuzz with values from events and return values #7666
Conversation
With latest PR changes foundry can catch DSChief bug in about 30 seconds (500 runs / 500 depth, missing it ~ 2 out of 10 times) and never missed in ~ 110 seconds (2000 runs with depth of 500)
The proposed solution is to:
Further improvement that can be considered is to make samples more efficient by collecting and applying them per type. However this introduce code complexity as we'll have to decode results / events with proper target abi and also to update fuzz strategies to take the fuzzed types into account. @mds1 @klkvr would love to hear your thoughts re this approach, thanks |
…ing. Decode results and persist per types. Use typed samples when fuzzing from state.
76d57d0
to
316b1a5
Compare
I went ahead and committed a change to collect and apply samples per result type and with this approach was able to reproduce the DSChief bug consistently in preliminary testing, with same 500 / 500. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes a lot of sense, the next step should probably be to improve the way we handle logs and storage
only have a couple comments
if rand::thread_rng().gen_range(0..100) < 50 { | ||
typed_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should use prop_perturb here as randomness source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reorged this part in 58fc5a9 not sure we still need perturb, pls lmk wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we are still using rand::thread_rng()
which I believe causes proptest runs with the same seed generating different outputs, so perhaps we still need prop_perturb
to ensure deterministic randomness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I added a slightly different way using (0..100).prop_flat_map(Just)
in d00c192 pls check
// Decode result and collect samples to be used in subsequent fuzz runs. | ||
if !result.is_empty() { | ||
if let Ok(decoded_result) = func.abi_decode_output(result, true) { | ||
dict.insert_sample_value(decoded_result, run_depth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the motivation behind using run_depth
as a limit for values of a certain type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a limit for samples to remain relevant, if it is too low then we're going to use same values / repeat same test too many times, at the same time if it grows too big then samples won't be exercised enough to reveal failures due to potential dependencies.
The test depth is a limit I come up to make sure that, on the extreme case where we run with test depth of n
for a fuzz target containing same n
functions with one return value, we collect one sample from each function.
We could also introduce a new config for such to replace default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test depth is a limit I come up to make sure that, on the extreme case where we run with test depth of
n
for a fuzz target containing samen
functions with one return value, we collect one sample from each function.
Is this assumption true given how fuzz targets and selectors are chosen? i.e. I don't think there's a guarantee you cover all n
functions with a depth of n
, especially if there's >1 contract. Also related to #2986
But regardless, I'll echo my thoughts in #7666 (comment) — performance-wise (catching DSChief) this PR looks really good, and I think it's better to ship as-is and revisit this limit once we have benchmarks, than to block this PR on deciding a good value. Since before benchmarks we're just taking our best guess at what sensible values are anyway :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assumption true given how fuzz targets and selectors are chosen? i.e. I don't think there's a guarantee you cover all
n
functions with a depth ofn
, especially if there's >1 contract. Also related to #2986
Indeed, that is not a guarantee just the ideal case, as you say it can be improved as we go. Only concern could be that if this limit is too low then other relevant data collected won't be exercised enough.
I added logs decoding in d9d8619 , will track storage handle in follow up PR (to keep scope limited for this one) if you OK with |
@@ -247,7 +249,17 @@ impl<'a> InvariantExecutor<'a> { | |||
let mut state_changeset = | |||
call_result.state_changeset.to_owned().expect("no changesets"); | |||
|
|||
collect_data(&mut state_changeset, sender, &call_result, &fuzz_state); | |||
if !&call_result.reverted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any valuable data we could collect from a reverted call, hence adding this, pls let me know if I am missing something. (Further improvement when fail on revert set to false is to remove calls reverted from final sequence - should improve shrinking performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we don't need to collect data from a reverted call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 , will follow up with a PR to exclude reverted call from final sequence if running with fail-on-revert=false
, should improve shrinking phase a lot
In general the approach described here makes a lot of sense to me, definitely supportive. Regarding:
I think your defaults seem pretty reasonable, so we should avoid increasing UX complexity or spending too much time debating param defaults until we have confidence that the changes matter or that the added complexity is worth it. And we can gain that confidence once we have good benchmarks to compare against. So my suggestion would be to stick what we have here, but document all the defaults/assumptions somewhere (I'm indifferent as to where), that way we can easily revisit these decisions and adjust based on data later |
…to call details and use it to generate counterexample
942f2b7
to
55fd876
Compare
(sender, contract) | ||
}) | ||
}) | ||
.prop_map(|(sender, call_details)| BasicTxDetails::new(sender, call_details)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please validate this addition, don't see any issue with but confirmation would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry could you give a tldr on this change? are we just transforming the type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's just a code cleanup related change, that is to create a BasicTxDetails which is now a struct (used to be type) introduced in 55fd876
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, seems reasonable to me but deferring to @klkvr since it's an implementation question :)
7968802
to
481971a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Motivation
Ref #51
https://forum.openzeppelin.com/t/using-automatic-analysis-tools-with-makerdao-contracts/1021/2
Solution
Tests
forge test --mt invariant_check_dschief_with_return_value
forge test --mt invariant_check_dschief_with_event