Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use an epsilon independent of dt when comparing times #1057

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mstimberg
Copy link
Member

I'm not super happy with this fix, but I did not see any other obvious solution.

As a bit of context: in Brian, we are making our life hard by being very flexible on everything related to clocks. Users can use multiple clocks with varying dt's (which do not have to be multiples of each other), dt's can change between runs, and runs do not need to express times in multiples of dt (e.g. with the default dt of 0.1ms, you can run a network twice for 0.05ms and it will work fine, running everything for a single time step). Since this entails comparing floating point time values (instead of just integer time steps), we have to deal with floating point inaccuracies and cannot simply do exact comparisons to figure out whether two clocks with differing dt's should be executed within the same time step, etc.

Our previous solution was to consider two times as being equal iff they differ by less than 0.01% of dt. The problem is, that 0.01% of dt can be a meaningful amount of time if dt is very long. In #1054, a run_regularly operation had a very long dt, and a run had a runtime of less than 0.01% of this dt before doing a long run. From the point of view of the run_regular operation, the start and end time of the short run were identical, leading to weird behaviour (which differed between standalone and runtime...).

The fix I went for is to simply use an absolute difference when comparing times, i.e. 10ns (which is 0.01% of the default dt). This has the advantage of being easy to explain and should avoid issues when having big differences in dt. It is still far away from reasonably chosen dt's, and if someone wants to simulate stuff in the nanosecond range for some reason, they could always change Clock.epsilon_dt manually.

Fixes #1054 (@felix11h, I tested with the code you provided, but please give it a go with your actual code as well)

@thesamovar
Copy link
Member

Will look in more detail later but quick comment.

This solution makes me feel a bit uncomfortable because it's hard coding a relevant time scale. What about as an alternative, taking 0.01% of the minimum dt across all clocks in the Network?

@mstimberg
Copy link
Member Author

This solution makes me feel a bit uncomfortable because it's hard coding a relevant time scale.

I thought so before as well, but I now wonder whether this isn't rather a feature than a bug... The problem I'm seeing with your alternative is that it potentially makes the behaviour of one object depend on the presence of another object. Say I have a few things running on a very slow clock and a few others on a fast clock (i.e. the situation that triggered #1054). With the "minimum dt"-based solution, you'd again trigger the bug if you remove/deactivate the fast objects, e.g. for debugging. I find this more difficult to explain than "you have to manually change Clock.epsilon_dt if you want to use nanosecond scale time steps"...

But then, all this is about extreme corner cases, and I'm happy to think this over a bit more before merging.

@thesamovar
Copy link
Member

That's a fair point, but I'm still uncomfortable with the global time scale. The reason for having an epsilon relative to dt is that it relates more or less to floating point equality tests abs(x-y)/x<epsilon which is lost if you have a single absolute time scale (i.e. you just use abs(x-y)<epsilon as your test).

The choice of epsilon also puts an upper bound on how large you can make the duration. With a dt=.1ms we want to convert times into 32 bit signed ints if I remember correctly, which means our max duration is about 60 hours, whereas if we have a hardcoded epsilon based essentially on a dt of 0.01ms then that drops to 6 hours. Admittedly, not that many people run simulations for that much simulated time - but some do I guess!

Can you describe the origin of the bug in #1054 in more detail? Maybe there's another solution here.

@mstimberg
Copy link
Member Author

The choice of epsilon also puts an upper bound on how large you can make the duration. With a dt=.1ms we want to convert times into 32 bit signed ints if I remember correctly, which means our max duration is about 60 hours, whereas if we have a hardcoded epsilon based essentially on a dt of 0.01ms then that drops to 6 hours.

I don't think this is the case. The epsilon is only relevant when comparing two times, we don't use it to convert anything into integers (also: why 0.01ms? In the current PR, epsilon is 10ns = 0.00001ms). I think in practice there are only differences to a solution based on dt in situations where the "expected behaviour" is rather unclear. I can only come up with very contrived examples, but if you had a clock with a dt=10*second and one with dt=10.001*second, then the timestep at 10s would execute both clocks with a dt-based rule (because both clock's t is less than 0.01% of smallest dt apart), but not with the 10ns-based rule. You'd have the opposite effect with dt's smaller than 0.1ms.

Can you describe the origin of the bug in #1054 in more detail? Maybe there's another solution here.

The details are actually quite tricky because our system handling the clocks is quite convoluted. I'd say the the main issue is the following :
At the beginning of a run, we calculate the "end timestep" for each clock. E.g. for a clock with dt 0.1ms, and a run time of 100ms, we'd set the end step to 1000. As always, there are rounding issues here. We first try to round with respect to dt and see whether the solution is good enough, i.e. (until now), less than 0.01% of dt away from the requested run time. If this is not the case, we use ceiling instead of rounding. The idea here is that e.g. for an object with dt = 50*ms, we do want it to run at 0ms, even if the run only has a length of 10ms (i.e. end timestep should be 1 instead of 0, even though 0*dt is closer to the requested run time than 1*dt).

If the dt is 100000s (as in #1054), and we request a run of 1s, then the end timestep will be calculated to be 0, because it is "good enough" with respect to the requested time.

This is the main issue, later in the run process similar issues will then make the simulation not run at all or run things too often.

Just to make sure there is no misunderstanding, your proposed solution would definitely fix the bug, I'm just worried about introducing hard-to debug issues where the behaviour of one object depends on the presence of another object.

@felix11h
Copy link

felix11h commented Mar 1, 2019

Fixes #1054 (@felix11h, I tested with the code you provided, but please give it a go with your actual code as well)

Thank you for addressing this so quickly! I cloned the branch and can confirm that this also fixes the behaviour in my full network.

In the meanwhile, however, I discovered a new and related(?) problem that does not seem to be fixed by the suggested changes. Even when the run_regularly statement is removed in #1054 , simply increasing the simulation time T2 to 500000*second seems to produce similar behaviour.

@mstimberg
Copy link
Member Author

@felix11h uh, so this is indeed not completely fixed, thanks for making us aware! I'm away the coming week, but I'll have a closer look at this when I'm back.

Also cast integers to 64 bit instead of 32 bit
@mstimberg
Copy link
Member Author

I had another look at this, the latest issue that @felix11h mentioned was actually unrelated. When you set T2=500000*second, this leads to a number of time steps that is bigger than a 32bit integer (with dt=0.1ms) and the standalone code casted this value to a 32bit int before storing it in a 64bit int... I fixed this, but I'm having a hard time coming up with a test – the problem only occurs when actually running the code, so a successful test would mean running >2^31 time steps 😒

@thesamovar
Copy link
Member

Can you hack the clock jumping forward to time step 2^31-5 or something like that?

@mstimberg
Copy link
Member Author

Can you hack the clock jumping forward to time step 2^31-5 or something like that?

Yes, I think that should work, I'll give it a try.

@mstimberg
Copy link
Member Author

It is a little bit ugly (I had to use device.insert_code to set network.t in standalone), but the test works now.

All tests pass on my machine. @felix11h, can you give it another try as well?

Of course this still leaves the question open whether the fix I implemented (fixed epsilon instead of relative epsilon) is a good one.

I also just realized that the user can change the epsilon rather easily for runtime (by setting Clock.epsilon_dt), but that this is not taken into account for standalone, where the value is hardcoded. For consistency, I'm going to change it (this would apply to a relative epsilon in the same way).

@mstimberg
Copy link
Member Author

@thesamovar What do you think? Do you still prefer your solution of using the smallest dt as a reference, or do you have any other idea (for reference: #1057 (comment))

@thesamovar
Copy link
Member

I think I still slightly prefer my solution because it doesn't force a natural timescale, so maybe we should get comments from others. @romainbrette what do you think? Issue summarised in brief below so that you don't need to read the whole thread (Marcel, feel free to correct me if I've misrepresented it).

We have a bug where objects with a very large value of dt don't work correctly because the code for checking whether two things happen in the same timestep is relative to dt (I think 0.01% of dt?) in order to work with floating point values, and having two objects with very different dts introduces issues (because what seems like small to one object can be large to the other one). Large dt does make sense for some objects, e.g. network_operation that only runs infrequently. Marcel proposes having a global value of dt that is used for these "same timestep" calculations, which can be changed. My alternative was using the minimum value of dt across all objects in the Network, because this is time scale invariant. Marcel's problem with this is that it introduces a potentially hard to debug situation where the behaviour of one object depends on the inclusion or not of a potentially unrelated object. I can see the point here, but I also feel like having a dependence on a (to the user) unknown global dt could also be confusing.

Any thoughts?

@mstimberg
Copy link
Member Author

Would be happy to hear @felix11h's opinion as well!
I'd say we postpone the merge until after the 2.2.2 release (which I'm planning to do today), there is still some need for discussion and potentially there are corner cases that we did not think of.

@vigneswaran-chandrasekaran
Copy link
Member

Hello everyone, I read the whole thread, and guess that I understood the issue we're facing. I too feel that hardcoding the epsilon and forcing the user to set manual epsilon value for the special case is not the best choice, but it solves the issue for almost all reasonable values. Last week, when I was in class my professor said that for comparing two numbers having fixed tolerance epsilon induces flow leak (eg: abs(0.0000000000012-0.0.0000000000013)< 0.000000001 will give unwanted output). To overcome this problem we can use: abs(a-b)<epsilonabs(a) && abs(a-b)<epsilonabs(b) instead of abs(a-b)<epsilon. I hope, this can avoid manual setting. Regarding issue #1054 I guess, when we compare two time differences, viewing from their relative scale would be effective and reduce rounding issues. So would using less than (0.01% of dt)*abs(dt) of dt away from the requested run time instead of 0.01% of dt solve the problem? The bottom line what I feel is that comparing, finding difference and rounding should be done in the scale of the considered variables rather than having fixed global scale.
Please correct me if I'm wrong. Thanks in advance :)

@felix11h
Copy link

felix11h commented Apr 1, 2019

@mstimberg Thank you for your updates on this issue! I tested the branch with your changes and my full simulation now runs without issue (of course it will take a while to complete). I don't believe I can adequately comment on the best implementation. From the user standpoint however, I am not worried about the hardcoded 10*ns, especially since (if I understand you correctly) this value could be set by the user if needed.

@mstimberg mstimberg changed the title [MRG] Use an epsilon independent of dt when comparing times [WIP] Use an epsilon independent of dt when comparing times Nov 14, 2019
@mstimberg
Copy link
Member Author

I removed the [MRG] label for now since we haven't fully agreed on the best solution yet. Would be great to find a consensus, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run_regularly() with large dt in standalone mode
4 participants