Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EventBridge Scheduler mode for policy lambdas #9273

Merged
merged 24 commits into from
May 20, 2024

Conversation

scolebrook
Copy link
Contributor

Add optional EB Scheduler for periodic mode if scheduler_role mode property is present. EB Scheduler uses an execution role to invoke a Lambda function instead of a permission (mentioned in lambda.rst with link to AWS docs). All three schedule types are supported, with timezone (defaults to Etc/UTC), start_time, end_time and group_name (defaults to default) properties. If specified, the schedule group must already exist.

Use of EB Scheduler requires additional actions for the deployment identity (covered in deployment.rst).

Converting an existing periodic policy from EB Rule style to EB Scheduler style causes the old rule to be deleted if it exists. Similarly, converting in the other direction causes the schedule to be deleted if it exists. This requires checking for the schedule in all groups (since the group the schedule was in is lost at that point).

These conversion checks have resulted in extra API calls for the test_mu.py suite so the placebo_data.zip has been updated.

EB Scheduler allows 1,000,000 schedules per region with more upon request. EventBridge allows only 300 rules per region (as low as 100 in some regions) with more upon request. By moving periodic policies to EB Scheduler, more rule capacity is freed for other types of rules, like CloudTrail.

@scolebrook scolebrook marked this pull request as draft February 6, 2024 03:14
@scolebrook scolebrook marked this pull request as ready for review February 6, 2024 15:48
@scolebrook
Copy link
Contributor Author

scolebrook commented Feb 6, 2024

@kapilt Looks like the ubuntu-latest/3.10 test is getting a 503 error when trying to upload to codecov. Most likely a "them" problem. They've just updated their status page to indicate they're having a problem. But when it clears, I can't retry the job.

@scolebrook
Copy link
Contributor Author

Addresses periodic mode in #7989

c7n/mu.py Outdated
# check if schedule exists for func.event_name and remove it to clean up
# when switching between scheduled event types
# must check all groups
EventBridgeScheduleSource.remove_from_groups(self.session.client('scheduler'), func)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is new behavior, we can avoid the api call if we store state on the lambda via tag to avoid the extra api calls. that will also help clean up some of the test data set being modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, you're suggesting having a tag on the policy lambda fn to store the fact that a schedule is used for periodic mode instead of a rule? Rule based periodic mode and all other kinds of modes would not have the tag (unchanged from current), so we'd just be check for tag presence and adding/removing a tag when moving from one kind of periodic mode to the other. Am I understanding your suggestion correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another approach would be to leave periodic mode untouched and have a new periodic-schedule mode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new mode would be useful to avoid extraneous api calls on the migration, I think you'll still want to use tag state to avoid the scheduling group iteration. I think for migration purposes we could document using a custodian policy to do transition cleanup, albeit we'll also need support for scheduling groups and schedules as resources for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a new mode (periodic-schedule) to avoid all the extra work modifying and testing periodic to support 2 methods of implementation depending on what properties are provided in the policy, and either an additional custodian tag on the policy lambda to record the schedule group name (even if default), or perhaps another item in the custodian-info tag. I'm more concerned about the risk of breaking other things that depend on custodian-info containing only 2 : separated properties. I don't have enough visibility to know if appending to custodian-info is a bad idea. @kapilt What do you think?

As for migration, if you are converting an existing policy from periodic to periodic-schedule, mu.py would remove the cwe rule if I'm reading the code correctly. Shouldn't be a problem. Schedule groups as resources would be useful for tagging enforcement though. Another project for another day :-)!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for migration, i think we would do a followup pr to support additional resources on schedule/schedule groups, and then update docs with guidance on some sample policies that can do the clean up. I'm not clear the periodic to periodic-schedule would get the cwe removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little to mugc.py so it can remove a schedule when one exists. This required adding an auto tag (custodian-schedule) to hold the group name since mugc works with the function, not the policy. Added a note to the lambda.rst doc pointing out commenting out a policy, running mugc to remove all the bits, then removing the comments and running custodian run again to redeploy is a simple process when switching.

c7n/policy.py Outdated Show resolved Hide resolved
@scolebrook scolebrook marked this pull request as draft March 6, 2024 05:01
@scolebrook scolebrook marked this pull request as ready for review March 6, 2024 15:08
@scolebrook scolebrook changed the title Add EventBridge Scheduler as an option for periodic mode policies Add EventBridge Scheduler mode for policy lambdas Mar 8, 2024
@kapilt
Copy link
Collaborator

kapilt commented Mar 21, 2024

thanks for the changes, overall looks good, It does feel like their is some extraneous recording data though wrt to the volume of json in the pr, I haven't dived deep enough to understand if thats just due to extraneous functions in the recording or what the underlying cause is.

@scolebrook
Copy link
Contributor Author

@kapilt Found some of the test data files you saw were from before the pivot from modifying periodic to creating the new schedule mode. Removed them and cleaned up some other things from that pivot. Also added a new policy test to cover the custodian-schedule tag on schedule mode policy lambdas.

Still lots of test data files in the two new mu tests. test_eb_schedule creates a policy, then checks the schedule. Then it modifies the policy and checks the schedule twice for two different kinds of changes. Similarly, test_pause_resume_sched_policy creates the policy, then disables the schedule before re-enabling it. This was all necessary to get good coverage. If you see anything unneeded in the two test cases so we can simplify, let me know.

c7n/mu.py Outdated
# check for schedules with same func.event_name in any other group in case
# group_name has been changed
# there should only be one schedule across all groups for each policy lambda
self.remove_from_groups(self.client, func, params['GroupName'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like the group change should be a delta of the tag in actual function vs the configured function which alleviate having to do api calls to trawl through all schedule groups to check for the function schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kapilt Thanks for spotting that. That was from before I added the tag with the group name in it (needed for mugc). The problem is that the function is changed first (actually deployed to aws), then the events get processed. So by the time we're here in add, the tag with the group name has been changed to the new group name.

About the only way I can see to make this work, would be to update _create_or_update in the LambdaManager class in mu.py (line 457). It gets the config of an existing function as part of its process if the function actually exists. If it does, we could read the group name out of the tag before it's replaced and return that along side the existing values that method returns. Something like old_group_name which would default to None for a new function or for function with any mode other than schedule.

Then the question is how to get that value into the add method for the EventBridgeScheduleSource class.

_create_or_update is called by LambdaManager.publish() which then gets all the events at line 398 with a call to func.get_events(). This is where an instance of the EventBridgeScheduleSource is created. Rather than add the old_group_name as an argument to the add method (which would mess with all the other subclasses of AWSEventBase), it's probably best to pass the old group name in when creating an EventBridgeScheduleSource (line 956). That would mean none of the other classes need modification, and the EventBridgeScheduleSource would have the old group name, or None if there isn't one. It could then see if the old group name is different to the new one and then remove the schedule in the old group.

Lots of little tweaks to get the value of that tag all the way to where remove_from_groups is happening, but is probably the cleanest way to do it. It would remove all those extra API calls and just leave one call to remove a schedule only when we already know it's there AND it's no longer wanted.

There's a lot to unpack there. But let me know if that is an approach you think would be acceptable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't had a chance to read fully, traveling today, but I was thinking we could encapsulate via a provision event/facade abstraction with the old lambda state, the current lambda config/object, that we could propagate to event provisioners from lambda manager on publish which would provide the needed context to avoid the api calls which the current interface lacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's basically what my long verbage was attempting to describe. :-D

I've been working on an implementation of this. This is how it currently works.

In LambdaManager, publish calls _create_or_update which gets the current config if it exists (not a new function) before updating the function. Return that config to publish alongside the existing values going back. Extract the old group name from the tag if it exists with a regex. publish now has the previous function config available for any other purpose in the future.

publish then calls func.get_events where func is a PolicyLambda or LambdaFunction instance and iterates through the returned list of event objects. The PolicyLambda variety is where the event source objects get created. So we pass the old group name (or None) through to get_events and if we're creating an EventBridgeScheduleSource we pass it on in the init and store as a property. We can now delete an old sched when needed during the call to the add method.

I've tried this out and it works without needing to modify any other tests. It results in a decrease of about 15% in the number test json files for the two new tests (mostly in test_eb_scheduler).

I'll push a commit shortly with this implementation for you to have a look at when you get time. Thanks for your help refining this feature. Safe travels.

@scolebrook
Copy link
Contributor Author

@kapilt Any additional changes needed on this?

@kapilt kapilt assigned kapilt and unassigned kapilt May 18, 2024
c7n/mu.py Outdated
for e in func.get_events(self.session_factory):
# process any previous lambda configuration
old_group_name = None
if existing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic should really be in the event implementation, rather than hardcoding a particular event's pattern into the manager. the notion is to pass both current, and previous of the function so the event can do whatever specific logic it needs. as is this just violates encapsulation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on it.

Copy link
Collaborator

@kapilt kapilt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kapilt kapilt merged commit d459a38 into cloud-custodian:main May 20, 2024
22 checks passed
@scolebrook
Copy link
Contributor Author

@kapilt Thanks for your help pushing this across the finish line.

@kapilt
Copy link
Collaborator

kapilt commented May 20, 2024

@scolebrook no problem, thanks for working on it and making it happen.

@scolebrook scolebrook deleted the periodic_eb_scheduler branch June 12, 2024 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants