Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diy::Master::exchange() error when nblock < nproc #29

Open
mrzv opened this issue May 23, 2015 · 4 comments
Open

diy::Master::exchange() error when nblock < nproc #29

mrzv opened this issue May 23, 2015 · 4 comments
Labels

Comments

@mrzv
Copy link
Member

mrzv commented May 23, 2015

Issue by Hadrien Croubois
Monday Mar 02, 2015 at 21:02 GMT


When the total number of block is smaller then the number of procecess running in parallele, calling diy::Master::exchange() gives an error

[Archteryx:13682] *** Process received signal ***
[Archteryx:13682] Signal: Floating point exception (8)
[Archteryx:13682] Signal code: Integer divide-by-zero (1)

Is that a known problem ? I'll try and have a lot a where it comes from

@mrzv mrzv added the bug label May 23, 2015
@mrzv
Copy link
Member Author

mrzv commented May 23, 2015

Comment by Dmitriy Morozov
Monday Mar 02, 2015 at 21:38 GMT


I've never run it in this regime, so it's not a known problem. But it's not difficult to guess what may be going wrong. There is a division by size() in flush(). That will definitely fail (since size is 0). There may be other problems.

I should note that this is not a regime I've thought about before. Other things might be failing as well.

@mrzv
Copy link
Member Author

mrzv commented May 23, 2015

Comment by Hadrien Croubois
Monday Mar 02, 2015 at 21:46 GMT


I understand that's not something you might expect, still you might run into it in specific cases :

  1. large number of node for pipeline work
  2. low throughput
  3. application where comupational complexity is low compare to comunication cost
  • you end up with large blocs being deployed on a small portion of your nodes

I should note that this is not a regime I've thought about before. Other things might be failing as well.

That's what I'm looking at

@mrzv
Copy link
Member Author

mrzv commented May 23, 2015

Comment by Dmitriy Morozov
Monday Mar 02, 2015 at 21:54 GMT


Oh, I don't question the usefulness of the setting. Just acknowledging that I haven't thought about it before.

I'll fix this problem when I get a chance. Meanwhile, you can see if something simple, like setting out_queues_limit = 0 in flush(), if size() == 0 solves it for you.

@mrzv
Copy link
Member Author

mrzv commented May 23, 2015

Comment by Dmitriy Morozov
Tuesday Mar 03, 2015 at 01:09 GMT


e63cbea might fix the immediate problem (in the way I described), but there are deeper problems with the design in this case. (DIY collectives break, as does the pattern of loading the data via collective IO during decomposition.) I need to think through this to figure out the best solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant