Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce multiple random numbers efficiently #66

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

OlivierSohn
Copy link
Contributor

This is probably not meargable as-is, because it's not in the spirit of the current API, but 'foldMUniforms' could be used to implement a new fold-like function in the class Variate, to allow creating N random values that are consumed by a monadic accumulating function.

Note that creating N numbers with 'foldMUniforms' will require n+2 reads and writes to the state vector, whereas creating N numbers with 'uniform' requires 3*n reads and writes to the state vector. Benchmarks on my application show a speed-up, because random number generation is a bottleneck for me.

@Shimuuar
Copy link
Collaborator

Shimuuar commented Apr 9, 2018

Essentially it's about avoiding two stored into array and then reading them back immediately. On the surface it looks really similar to the deforestation but I don't know whether it's possible to implement it via rewrite rules. At very least it seems hard.

On completely unrelated note I thought about unrelated microoptimization: replace unboxed arrays with arrays from primitive. Indexing should be faster there since those don't support slicing.

@OlivierSohn
Copy link
Contributor Author

Thanks, I wasn't aware of the indexing overhead, I never used primitive arrays but I'll try it out when I have some time!

Deforestation is also a new subject for me, but I feel it would be complicated (and lead to complicated code?) to explain to the compiler how to fuse two operations, I might be wrong though ... and would definitely like to see how this could be done :)

@OlivierSohn
Copy link
Contributor Author

OlivierSohn commented Apr 9, 2018

I see a 2.5% performance boost when using a Storable. Interesting!

@OlivierSohn
Copy link
Contributor Author

OlivierSohn commented Apr 9, 2018

Note that it is a breaking change, since Seed now has a Storable instead of Unboxed. not anymore (see below)

@Shimuuar
Copy link
Collaborator

Indexing of unboxed arrays is has inherent slowdown because of slicing support. It's done as arr[off + i] instead of arr[i] so you have extra addition. Actually I had arrays from primitive in mind. I'll try to compare performance of unboxed/storable/primitive

Another approach for reducing number of read/writes to state vector is to turn generator into monad Word32 -> Word32 -> (# a, Word32, Word32 #) and let GHC optimier deal with it. It would be interesting to compare performance but it's complete API breakage. Maybe it's good thing

@OlivierSohn
Copy link
Contributor Author

OlivierSohn commented Apr 11, 2018

I tried with arrays from primitive, I see a slight speedup, but it's unclear if it's noise or not...

I kept the Seed type as it was before, i.e with an unboxed vector, so it is not a breaking change anymore

@Shimuuar
Copy link
Collaborator

I finally got time to work on PRs.

I cherry picked changes for Gen representation and run becnhmarks. Switch to primitve arrays improved performance by ~5% on my computer. Free 5% is nothing to sneeze at. But all meddling with low level arrays makes me nervous. I'm going to recheck everything and maybe write some tests. Then I'll push them to master

I also checked time of generation of Word32 and Word64. If we say that time for Word32 is read + update + store, and for Word64 it's read + 2·update + write. (read and write corresponds to index and carry). Then both update and (read+write)/2 take about 4ns. We have big performance problem at our hands by forcing GHC to store/read variables needlessly. Approach with fold solves it in special case. At least we need to expose enough internals in safe manner so people could write such special purpose folds if they need them

@OlivierSohn
Copy link
Contributor Author

I agree.

Another thing to consider (and maybe document for the user) is that with the Gen representation change, the array is pinned, so the GC won't be able to move it. It may or may not be a good thing, depending on the application I guess...

@Shimuuar
Copy link
Collaborator

I finally got around to measure impact of different vector variants. Here is distribution of run times:

image

Unboxed, Primitive, and PrimArray from primitive perform identically and Storable is about 5% slower. Probably because of extra pointer chase in ForeignPtr. I thought that probably primitive vector would lead to faster build but unboxed turned out to be slightly faster: 15.9s vs 16.1s. So current vector backend is likely optimal

I'll get to the foldMUniforms tomorrow.

@Shimuuar
Copy link
Collaborator

I rebased PR over current master. First of all benchmarks: it does provide nice ~25% speedup over replicateM_ n (uniform gen).

What isn't very good. It provides very specific primitive: iterate function N times. Could it be generalized? One thing that comes to mind is unfolds. Maybe there's something that could usefully generalize both?

There's obvious thing turn passing of i & c into monad but that turns 25% speedup into 10% slowdown presumably from state carrying boilerplate. Loop should be generated in the function so GHC could optimize it well

@OlivierSohn
Copy link
Contributor Author

Hello @Shimuuar, reading this MR brings back memories from when I was developping my little console game, which was a lot of fun! In the meantime I have moved to other personal projects (music, convolution reverbs, etc...) so I won't pursue the initial goal of merging this MR, but I hope it will be useful, some parts of it at least!
Cheers,
Olivier

@Shimuuar
Copy link
Collaborator

Well 25% speedup is nothing to sneer at :). I think I'll release 0.15 without this PR and start updating statistics. Then I'll revisit this PR. Maybe I'll get some idea in the meantime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants