Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ops Agent Failed to restart - Intermittent #1598

Open
abhigupta1207 opened this issue Jan 25, 2024 · 2 comments
Open

Ops Agent Failed to restart - Intermittent #1598

abhigupta1207 opened this issue Jan 25, 2024 · 2 comments

Comments

@abhigupta1207
Copy link

NOTE: To get the best support experience for bug fixes, please go to https://cloud.google.com/support-hub and follow the instructions. In comparison, Bug reports filed in this repo only have best effort support, and do not have guaranteed response / resolution SLOs

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Create a custom image with Ops Agent (2.45.0@1) baked with 3rd party application
  2. Create instance template with startup scripts which copy custom config files for Ops Agent configuration and restart ops agent
  3. Create a MIG with autoscaling enabled, when new VMs are created due to autoscaling, ops agent in some VMs fails to restart

Expected behavior
Ops Agent should restart because our autoscaling is based on custom prometheus metrics fetched by ops-agent

Environment (please complete the following information):

  • VM distro / OS: windows-server-2022-dc
  • Ops Agent version [e.g. 2.14.0] 2.45.0@1
  • Ops Agent configuration - prometheus receiver
  • Ops Agent log INFO 2024-01-16T14:04:35.492451600Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: Restarting Ops Agent to Load the Config

INFO 2024-01-16T14:04:35.692183900Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: Restart-Service : Service 'Google Cloud Ops Agent (google-cloud-ops-agent)' cannot be stopped due to the following

INFO 2024-01-16T14:04:35.727365100Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: error: Cannot stop google-cloud-ops-agent-opentelemetry-collector service on computer '.'.

INFO 2024-01-16T14:04:35.743086300Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: At C:\UNICOM\Scripts\Specialize-IntelligenceGoogleOpsAgent.ps1:54 char:5

INFO 2024-01-16T14:04:35.774301700Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + Restart-Service google-cloud-ops-agent -Force

INFO 2024-01-16T14:04:35.806250700Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

INFO 2024-01-16T14:04:35.837901700Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + CategoryInfo : CloseError: (System.ServiceProcess.ServiceController:ServiceController) [Restart-Service

INFO 2024-01-16T14:04:35.870150600Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: ], ServiceCommandException

INFO 2024-01-16T14:04:35.896522Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + FullyQualifiedErrorId : CouldNotStopService,Microsoft.PowerShell.Commands.RestartServiceCommand

INFO 2024-01-16T14:04:35.931813400Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1:

Additional context
Add any other context about the problem here. - Google Case 49102632

@jefferbrecht
Copy link
Member

jefferbrecht commented Jan 25, 2024

We would need to see your startup scripts in order to troubleshoot this effectively.

That being said, Restart-Service is notoriously fragile if it races with a service that's already in the process of starting up, which may be the case here. You can try inserting the following line before any Restart-Service to let it complete startup before attempting to restart:

(Get-Service google-cloud-ops-agent*).WaitForStatus('Running', '00:03:00')

The '00:03:00' is a timeout so that it doesn't block forever (e.g. in case there's a bad config or some other problem preventing normal startup); adjust this timeout as you need.

Since you're already creating a custom image anyway, another thing you can try instead is change the service startup type for all Ops Agent services to Manual. This should prevent the startup race and give your script a chance to copy in the configs before it starts up. Then you would call Start-Service instead of Restart-Service after copying the configs.

@abhigupta1207
Copy link
Author

@jefferbrecht Thanks for getting back, we will try the below solution and will update you.

Since you're already creating a custom image anyway, another thing you can try instead is change the service startup type for all Ops Agent services to Manual. This should prevent the startup race and give your script a chance to copy in the configs before it starts up. Then you would call Start-Service instead of Restart-Service after copying the configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants