-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EventBridge Scheduler mode for policy lambdas #9273
Add EventBridge Scheduler mode for policy lambdas #9273
Conversation
@kapilt Looks like the ubuntu-latest/3.10 test is getting a 503 error when trying to upload to codecov. Most likely a "them" problem. They've just updated their status page to indicate they're having a problem. But when it clears, I can't retry the job. |
Addresses periodic mode in #7989 |
c7n/mu.py
Outdated
# check if schedule exists for func.event_name and remove it to clean up | ||
# when switching between scheduled event types | ||
# must check all groups | ||
EventBridgeScheduleSource.remove_from_groups(self.session.client('scheduler'), func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is new behavior, we can avoid the api call if we store state on the lambda via tag to avoid the extra api calls. that will also help clean up some of the test data set being modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, you're suggesting having a tag on the policy lambda fn to store the fact that a schedule is used for periodic mode instead of a rule? Rule based periodic mode and all other kinds of modes would not have the tag (unchanged from current), so we'd just be check for tag presence and adding/removing a tag when moving from one kind of periodic mode to the other. Am I understanding your suggestion correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another approach would be to leave periodic mode untouched and have a new periodic-schedule mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new mode would be useful to avoid extraneous api calls on the migration, I think you'll still want to use tag state to avoid the scheduling group iteration. I think for migration purposes we could document using a custodian policy to do transition cleanup, albeit we'll also need support for scheduling groups and schedules as resources for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a new mode (periodic-schedule
) to avoid all the extra work modifying and testing periodic to support 2 methods of implementation depending on what properties are provided in the policy, and either an additional custodian tag on the policy lambda to record the schedule group name (even if default), or perhaps another item in the custodian-info tag. I'm more concerned about the risk of breaking other things that depend on custodian-info
containing only 2 :
separated properties. I don't have enough visibility to know if appending to custodian-info
is a bad idea. @kapilt What do you think?
As for migration, if you are converting an existing policy from periodic
to periodic-schedule
, mu.py would remove the cwe rule if I'm reading the code correctly. Shouldn't be a problem. Schedule groups as resources would be useful for tagging enforcement though. Another project for another day :-)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for migration, i think we would do a followup pr to support additional resources on schedule/schedule groups, and then update docs with guidance on some sample policies that can do the clean up. I'm not clear the periodic to periodic-schedule would get the cwe removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a little to mugc.py so it can remove a schedule when one exists. This required adding an auto tag (custodian-schedule
) to hold the group name since mugc works with the function, not the policy. Added a note to the lambda.rst doc pointing out commenting out a policy, running mugc to remove all the bits, then removing the comments and running custodian run
again to redeploy is a simple process when switching.
… used in `periodic` mode
thanks for the changes, overall looks good, It does feel like their is some extraneous recording data though wrt to the volume of json in the pr, I haven't dived deep enough to understand if thats just due to extraneous functions in the recording or what the underlying cause is. |
@kapilt Found some of the test data files you saw were from before the pivot from modifying Still lots of test data files in the two new mu tests. |
c7n/mu.py
Outdated
# check for schedules with same func.event_name in any other group in case | ||
# group_name has been changed | ||
# there should only be one schedule across all groups for each policy lambda | ||
self.remove_from_groups(self.client, func, params['GroupName']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like the group change should be a delta of the tag in actual function vs the configured function which alleviate having to do api calls to trawl through all schedule groups to check for the function schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kapilt Thanks for spotting that. That was from before I added the tag with the group name in it (needed for mugc). The problem is that the function is changed first (actually deployed to aws), then the events get processed. So by the time we're here in add
, the tag with the group name has been changed to the new group name.
About the only way I can see to make this work, would be to update _create_or_update
in the LambdaManager
class in mu.py (line 457). It gets the config of an existing function as part of its process if the function actually exists. If it does, we could read the group name out of the tag before it's replaced and return that along side the existing values that method returns. Something like old_group_name
which would default to None for a new function or for function with any mode other than schedule
.
Then the question is how to get that value into the add
method for the EventBridgeScheduleSource
class.
_create_or_update
is called by LambdaManager.publish()
which then gets all the events at line 398 with a call to func.get_events()
. This is where an instance of the EventBridgeScheduleSource
is created. Rather than add the old_group_name
as an argument to the add
method (which would mess with all the other subclasses of AWSEventBase
), it's probably best to pass the old group name in when creating an EventBridgeScheduleSource
(line 956). That would mean none of the other classes need modification, and the EventBridgeScheduleSource
would have the old group name, or None if there isn't one. It could then see if the old group name is different to the new one and then remove the schedule in the old group.
Lots of little tweaks to get the value of that tag all the way to where remove_from_groups
is happening, but is probably the cleanest way to do it. It would remove all those extra API calls and just leave one call to remove a schedule only when we already know it's there AND it's no longer wanted.
There's a lot to unpack there. But let me know if that is an approach you think would be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't had a chance to read fully, traveling today, but I was thinking we could encapsulate via a provision event/facade abstraction with the old lambda state, the current lambda config/object, that we could propagate to event provisioners from lambda manager on publish which would provide the needed context to avoid the api calls which the current interface lacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's basically what my long verbage was attempting to describe. :-D
I've been working on an implementation of this. This is how it currently works.
In LambdaManager
, publish
calls _create_or_update
which gets the current config if it exists (not a new function) before updating the function. Return that config to publish
alongside the existing values going back. Extract the old group name from the tag if it exists with a regex. publish
now has the previous function config available for any other purpose in the future.
publish
then calls func.get_events
where func is a PolicyLambda
or LambdaFunction
instance and iterates through the returned list of event objects. The PolicyLambda
variety is where the event source objects get created. So we pass the old group name (or None) through to get_events
and if we're creating an EventBridgeScheduleSource
we pass it on in the init and store as a property. We can now delete an old sched when needed during the call to the add
method.
I've tried this out and it works without needing to modify any other tests. It results in a decrease of about 15% in the number test json files for the two new tests (mostly in test_eb_scheduler
).
I'll push a commit shortly with this implementation for you to have a look at when you get time. Thanks for your help refining this feature. Safe travels.
@kapilt Any additional changes needed on this? |
c7n/mu.py
Outdated
for e in func.get_events(self.session_factory): | ||
# process any previous lambda configuration | ||
old_group_name = None | ||
if existing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic should really be in the event implementation, rather than hardcoding a particular event's pattern into the manager. the notion is to pass both current, and previous of the function so the event can do whatever specific logic it needs. as is this just violates encapsulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@kapilt Thanks for your help pushing this across the finish line. |
@scolebrook no problem, thanks for working on it and making it happen. |
Add optional EB Scheduler for periodic mode if
scheduler_role
mode property is present. EB Scheduler uses an execution role to invoke a Lambda function instead of a permission (mentioned inlambda.rst
with link to AWS docs). All three schedule types are supported, with timezone (defaults toEtc/UTC
),start_time
,end_time
andgroup_name
(defaults todefault
) properties. If specified, the schedule group must already exist.Use of EB Scheduler requires additional actions for the deployment identity (covered in
deployment.rst
).Converting an existing periodic policy from EB Rule style to EB Scheduler style causes the old rule to be deleted if it exists. Similarly, converting in the other direction causes the schedule to be deleted if it exists. This requires checking for the schedule in all groups (since the group the schedule was in is lost at that point).
These conversion checks have resulted in extra API calls for the
test_mu.py
suite so theplacebo_data.zip
has been updated.EB Scheduler allows 1,000,000 schedules per region with more upon request. EventBridge allows only 300 rules per region (as low as 100 in some regions) with more upon request. By moving periodic policies to EB Scheduler, more rule capacity is freed for other types of rules, like CloudTrail.