You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Default/updated concurrency recovery settings (node_concurrent_recoveries, node_initial_primaries_recoveries) settings are not being honored and doesn’t have any effect on the recovery speed for clusters with batch mode enabled
This is happening because of the way we allocate unassigned shards in a batch. For a batch:
AllocationDeciders runs for all shards in the batch at once
For all eligible shards in the batch, shard state is update to initializing on the assigned node
// no need to keep iterating the unassigned shards, if we don't have anything in decision map
break;
}
} catch (Exceptione) {
logger.error("Failed to execute decision for shard {} while initializing {}", shard, e);
throwe;
}
}
}
Because the decider execution and update to the shard status is not happening together for a shard, the cluster state doesn't change after running deciders on the unassigned shards. ThrottlingAllocationDecider reads the cluster state to decide if a shard recovery can be started on the node or not by comparing ongoing recoveries on the node with configured value of recovery settings (node_concurrent_recoveries, node_initial_primaries_recoveries). So, when we run the allocation decision together for all shards in a batch, the decider doesn't account for the decisions made for the other shards in the batch and we end up initializing all shards at once.
Logs indicating the same:
[2024-05-16T03:14:58,926][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1140][35], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.416Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\
[2024-05-16T03:14:58,927][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1962][15], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.404Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\
[2024-05-16T03:14:58,928][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[test_latency_219][1], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.399Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\
[2024-05-16T03:14:58,930][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1058][5], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.403Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\
[2024-05-16T03:14:58,932][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1062][4], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.401Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\
Related component
Cluster Manager
To Reproduce
Add belwo log line at ThrottlingAllocationDecider.java#L179 logger.debug( "ThrottlingAllocationDecider decision, throttle: [{}] primary recovery limit [{}]," + " primaries in recovery [{}] invoked for [{}] on node [{}]", primariesInRecovery >= primariesInitialRecoveries, primariesInitialRecoveries, primariesInRecovery, shardRouting, node.node() );
Launch a cluster with batch mode enabled and with shards more than the concurrent recoveries value.
Restart the cluster and check added log to see if throttling is working as expected or not
Expected behavior
Number of ongoing shard recoveries on a node should adhere to the node concurrent recovery setting.
Additional Details
OpenSearch Version: 2.14
The text was updated successfully, but these errors were encountered:
Describe the bug
Default/updated concurrency recovery settings (node_concurrent_recoveries, node_initial_primaries_recoveries) settings are not being honored and doesn’t have any effect on the recovery speed for clusters with batch mode enabled
This is happening because of the way we allocate unassigned shards in a batch. For a batch:
OpenSearch/server/src/main/java/org/opensearch/gateway/BaseGatewayShardAllocator.java
Lines 89 to 113 in da3ab92
Because the decider execution and update to the shard status is not happening together for a shard, the cluster state doesn't change after running deciders on the unassigned shards. ThrottlingAllocationDecider reads the cluster state to decide if a shard recovery can be started on the node or not by comparing ongoing recoveries on the node with configured value of recovery settings (node_concurrent_recoveries, node_initial_primaries_recoveries). So, when we run the allocation decision together for all shards in a batch, the decider doesn't account for the decisions made for the other shards in the batch and we end up initializing all shards at once.
Logs indicating the same:
Related component
Cluster Manager
To Reproduce
logger.debug( "ThrottlingAllocationDecider decision, throttle: [{}] primary recovery limit [{}]," + " primaries in recovery [{}] invoked for [{}] on node [{}]", primariesInRecovery >= primariesInitialRecoveries, primariesInitialRecoveries, primariesInRecovery, shardRouting, node.node() );
Expected behavior
Number of ongoing shard recoveries on a node should adhere to the node concurrent recovery setting.
Additional Details
OpenSearch Version: 2.14
The text was updated successfully, but these errors were encountered: