Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in group construction. #231

Open
Steamgjk opened this issue Mar 20, 2022 · 4 comments
Open

Stuck in group construction. #231

Steamgjk opened this issue Mar 20, 2022 · 4 comments

Comments

@Steamgjk
Copy link

Steamgjk commented Mar 20, 2022

I am trying to run simple_replicated_objects, so I create 3 VMs in Google Cloud.
I am using derecho.cfg in the demos/json_cfgs path, Here is my modifications:
(1) Change local ip to the corresponding VMs' ips, local_ids are 0, 1, 2 respectively
(2) Change leader ip to my VM-0 ip (local_id=0 is the leader)
(3) provider = sockets
(4) domain = ens4 (This is the NIC name of all VMs)
图片

However, After I launch the 3 VMs, I am trapped in constructing the groups. Below is the leader VM's console log.
We can see the other 2 VMs have successfully connected the leader, so the IP-related staff should be correct. Then I am not sure what goes wrong with the config (I am suspicious it is because of the json_layout, but I am not sure).
I attach the three cfg files for reference, and really appreciate if you staff can provide some help. Thanks!

图片

derecho-0(leader).cfg.txt
derecho-1.cfg.txt
derecho-2.cfg.txt

@songweijia
Copy link
Contributor

Hi Steamgjk, thank you for trying out derecho. After I checked your configuration file, I found there are two issues with it. The major issue is related to the json_layout configuration. Your current configuration needs 6 nodes to start the service that's why the system keeps waiting after you started three. If you don't mind overlapping Bar subgroup and Foo subgroup, you can do the following:

json_layout = '
[
    {
        "type_alias":   "Foo",
        "layout":       [
                            {
                                "min_nodes_by_shard": ["3"],
                                "max_nodes_by_shard": ["3"],
                                "reserved_node_ids_by_shard":[["0","1","2"]],
                                "delivery_modes_by_shard": ["Ordered"],
                                "profiles_by_shard": ["VCS"]
                            }
                        ]
    },
    {
        "type_alias":   "Bar",
        "layout":       [
                            {
                                "min_nodes_by_shard": ["3"],
                                "max_nodes_by_shard": ["3"],
                                "reserved_node_ids_by_shard":[["0","1","2"]],
                                "delivery_modes_by_shard": ["Ordered"],
                                "profiles_by_shard": ["DEFAULT"]
                            }
                        ]
    }
]'

The above setting will enforce the overlapping of the Foo and Bar subgroups.

The minor issue is the provider setting. As libfabric is deprecating the socket provider, we suggests using tcp provider.

@Steamgjk
Copy link
Author

Steamgjk commented Mar 21, 2022

Hi, @songweijia

Seems it still does not work in my 3-VM cluster. I have updated the 3 cfg files related to the 2 issues as the attached zip file, but it is still stuck there.

图片

图片

图片

cfg.zip

@songweijia
Copy link
Contributor

I just realized that you were using simple_replicated_objects instead of simple_replicated_objects_json. The former one specified a layout using the programmable DefaultSubgroupAllocation API which needs 6 nodes. It will NOT use the json layout configuration in that case. To use the json layout, you have to use the latter one.

So, you can either try with 6 nodes, or you can use my suggestion to the json layout with simple_replicated_objects_json.

@Steamgjk
Copy link
Author

Steamgjk commented Mar 21, 2022

simple_replicated_objects_json

@songweijia
With simple_replicated_objects_json, the group construction does not block any more.
But when I set up the node-0 (leader) and and node-1, after I set up node-2, then node-1 crashe.

I then check by commenting some codes step by step, then I notice the foo part is okay: After I comment the bar part

for(const uint32_t bar_subgroup_index : my_bar_subgroups) {
int32_t my_bar_shard = group.get_my_shard<Bar>(bar_subgroup_index);
std::vector<node_id_t> bar_members = group.get_subgroup_members<Bar>(bar_subgroup_index)[my_bar_shard];
uint32_t rank_in_bar = derecho::index_of(bar_members, my_id);
Replicated<Bar>& bar_rpc_handle = group.get_subgroup<Bar>(bar_subgroup_index);
if(rank_in_bar == 0) {
cout << "Appending to Bar." << endl;
derecho::rpc::QueryResults<void> void_future = bar_rpc_handle.ordered_send<RPC_NAME(append)>("Write from 0...");
derecho::rpc::QueryResults<void>::ReplyMap& sent_nodes = void_future.get();
cout << "Append delivered to nodes: ";
for(const node_id_t& node : sent_nodes) {
cout << node << " ";
}
cout << endl;
} else if(rank_in_bar == 1) {
cout << "Appending to Bar" << endl;
bar_rpc_handle.ordered_send<RPC_NAME(append)>("Write from 1...");
//Send to node rank 2 in shard 0 of the same Foo subgroup index as this Bar subgroup
node_id_t p2p_target = group.get_subgroup_members<Foo>(bar_subgroup_index)[0][2];
cout << "Reading Foo's state from node " << p2p_target << endl;
ExternalCaller<Foo>& p2p_foo_handle = group.get_nonmember_subgroup<Foo>();
derecho::rpc::QueryResults<int> foo_results = p2p_foo_handle.p2p_send<RPC_NAME(read_state)>(p2p_target);
int response = foo_results.get().get(p2p_target);
cout << " Response: " << response << endl;
} else if(rank_in_bar == 2) {
bar_rpc_handle.ordered_send<RPC_NAME(append)>("Write from 2...");
cout << "Printing log from Bar" << endl;
derecho::rpc::QueryResults<std::string> bar_results = bar_rpc_handle.ordered_send<RPC_NAME(print)>();
for(auto& reply_pair : bar_results.get()) {
cout << "Node " << reply_pair.first << " says the log is: " << reply_pair.second.get() << endl;
}
cout << "Clearing Bar's log" << endl;
derecho::rpc::QueryResults<void> void_future = bar_rpc_handle.ordered_send<RPC_NAME(clear)>();
}
}

then the cluster can run and I can see the printed logs "Node says...".

Then, I comment the foo part

for(const uint32_t foo_subgroup_index : my_foo_subgroups) {
int32_t my_foo_shard = group.get_my_shard<Foo>(foo_subgroup_index);
std::vector<node_id_t> shard_members = group.get_subgroup_members<Foo>(foo_subgroup_index)[my_foo_shard];
uint32_t rank_in_foo = derecho::index_of(shard_members, my_id);
Replicated<Foo>& foo_rpc_handle = group.get_subgroup<Foo>(foo_subgroup_index);
//Each member within the shard sends a different multicast
if(rank_in_foo == 0) {
int new_value = 1;
cout << "Changing Foo's state to " << new_value << endl;
derecho::rpc::QueryResults<bool> results = foo_rpc_handle.ordered_send<RPC_NAME(change_state)>(new_value);
decltype(results)::ReplyMap& replies = results.get();
cout << "Got a reply map!" << endl;
for(auto& reply_pair : replies) {
cout << "Reply from node " << reply_pair.first << " was " << std::boolalpha << reply_pair.second.get() << endl;
}
cout << "Reading Foo's state just to allow node 1's message to be delivered" << endl;
foo_rpc_handle.ordered_send<RPC_NAME(read_state)>();
} else if(rank_in_foo == 1) {
int new_value = 3;
cout << "Changing Foo's state to " << new_value << endl;
derecho::rpc::QueryResults<bool> results = foo_rpc_handle.ordered_send<RPC_NAME(change_state)>(new_value);
decltype(results)::ReplyMap& replies = results.get();
cout << "Got a reply map!" << endl;
for(auto& reply_pair : replies) {
cout << "Reply from node " << reply_pair.first << " was " << std::boolalpha << reply_pair.second.get() << endl;
}
} else if(rank_in_foo == 2) {
std::this_thread::sleep_for(std::chrono::seconds(1));
cout << "Reading Foo's state from the group" << endl;
derecho::rpc::QueryResults<int> foo_results = foo_rpc_handle.ordered_send<RPC_NAME(read_state)>();
for(auto& reply_pair : foo_results.get()) {
cout << "Node " << reply_pair.first << " says the state is: " << reply_pair.second.get() << endl;
}
}
}

and only maintain the bar part.

This time, it goes to the problem again: After launch node-0 and node-1, then I launch node-2, then node-1 crashes.

Then, I continue to comment

} else if(rank_in_bar == 1) {
cout << "Appending to Bar" << endl;
bar_rpc_handle.ordered_send<RPC_NAME(append)>("Write from 1...");
//Send to node rank 2 in shard 0 of the same Foo subgroup index as this Bar subgroup
node_id_t p2p_target = group.get_subgroup_members<Foo>(bar_subgroup_index)[0][2];
cout << "Reading Foo's state from node " << p2p_target << endl;
ExternalCaller<Foo>& p2p_foo_handle = group.get_nonmember_subgroup<Foo>();
derecho::rpc::QueryResults<int> foo_results = p2p_foo_handle.p2p_send<RPC_NAME(read_state)>(p2p_target);
int response = foo_results.get().get(p2p_target);
cout << " Response: " << response << endl;
} else if(rank_in_bar == 2) {
bar_rpc_handle.ordered_send<RPC_NAME(append)>("Write from 2...");
cout << "Printing log from Bar" << endl;
derecho::rpc::QueryResults<std::string> bar_results = bar_rpc_handle.ordered_send<RPC_NAME(print)>();
for(auto& reply_pair : bar_results.get()) {
cout << "Node " << reply_pair.first << " says the log is: " << reply_pair.second.get() << endl;
}
cout << "Clearing Bar's log" << endl;
derecho::rpc::QueryResults<void> void_future = bar_rpc_handle.ordered_send<RPC_NAME(clear)>();
}

so this time, only node-0 does void_future, node-1 and node-2 does not read, then the 3 nodes are fine.

But if I only comment

} else if(rank_in_bar == 2) {
bar_rpc_handle.ordered_send<RPC_NAME(append)>("Write from 2...");
cout << "Printing log from Bar" << endl;
derecho::rpc::QueryResults<std::string> bar_results = bar_rpc_handle.ordered_send<RPC_NAME(print)>();
for(auto& reply_pair : bar_results.get()) {
cout << "Node " << reply_pair.first << " says the log is: " << reply_pair.second.get() << endl;
}
cout << "Clearing Bar's log" << endl;
derecho::rpc::QueryResults<void> void_future = bar_rpc_handle.ordered_send<RPC_NAME(clear)>();
}

then node-0 does void_future and node-1 read, then node-1 still crashes after three nodes finish constructing the group.

node-1-log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants