Refactor allocation calculation #2554

r2k1 · 2024-02-20T22:40:47Z

What does this PR change?

Refactored allocation calculation logic.
Separated data fetching and processing for easier Prometheus response emulation, enhancing testability (examples added). Existing logic is almost untested due to the difficulty of such task.
Streamlined Prometheus query process, reducing core function size from 400 to 133 lines. It should be easier to track Prometheus query to its usage in the codebase.

Does this PR relate to any other PRs?

No

How will this PR impact users?

No impact is expected

How was this PR tested?

Added unit tests. Run few endpoints locally without noticable issues.

Does this PR require changes to documentation?

No

Have you labeled this PR and its corresponding Issue as "next release" if it should be part of the next OpenCost release? If not, why not?

I don't understand the meaning of the label. Is there are PRs that aren't "next release"?

Signed-off-by: r2k1 <yokree@gmail.com>

vercel · 2024-02-20T22:40:51Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
opencost	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 26, 2024 1:34am

mattray · 2024-02-21T02:59:44Z

Nice! Let's get some of the other @opencost/opencost-maintainers to weigh in

mbolt35

Need to do a bit more review, and definitely welcome insight from other contributors here. I do generally like the idea of separating the query and processing steps, so the idea is sound. We just want to ensure that we stay as semantically equivalent as possible.

mbolt35 · 2024-02-21T03:11:56Z

pkg/costmodel/allocation.go

+func (cm *CostModel) execAllPromQueries(queries []promQuery, time time.Time) error {
+	ctx := prom.NewNamedContext(cm.PrometheusClient, prom.AllocationContextName)
+	// Run all queries concurrently
+	errCh := make(chan error, len(queries))
+	for _, q := range queries {
+		go func(q string, out *[]*prom.QueryResult) {
+			var err error
+			*out, _, err = ctx.QuerySync(q, time)
+			errCh <- err
+		}(q.query, q.out)
 	}

-	// TODO:CLEANUP remove "max batch" idea and clusterStart/End
-	err := cm.buildPodMap(window, resolution, env.GetETLMaxPrometheusQueryDuration(), podMap, clusterStart, clusterEnd, ingestPodUID, podUIDKeyMap)
-	if err != nil {
-		log.Errorf("CostModel.ComputeAllocation: failed to build pod map: %s", err.Error())
+	// Collect errors from the queries
+	errs := make([]error, 0, len(queries))
+	for range queries {
+		if err := <-errCh; err != nil {
+			log.Errorf("CostModel.ComputeAllocation: query context error %s", err)
+			errs = append(errs, err)
+		}
 	}
-	// (2) Run and apply remaining queries
+	return errors.Join(errs...)
+}


Concerning this segment of code:

The code this replaced was already executing the queries concurrently, and correctly awaiting and collecting results/errors. There is an internal error collection mechanism on the prometheus context that handles both HTTP request errors as well as errors and warnings coming from prometheus:

opencost/pkg/prom/query.go

Lines 177 to 178 in 3c75f52

// report all warnings, request, and parse errors (nils will be ignored)

ctx.errorCollector.Report(query, warnings, requestError, results.Error)

This change appears to be mimicking what was already occurring, except the key difference is that there is a Sync request being made inside of a go routine instead of just calling QueryAtTime which is already an async request. If you would like to remove the unrolled code in favor of a loop, please follow the exact semantics that existed before. This is off the cuff, but something similar to the following:

// execute the queries concurrently for _, q := range queries { q.result = ctx.QuerySync(q, time) } // await results for each query for _, q := range queries { *q.out, _ = q.result.Await() } // use the ctx to retrieve the error collection if it exists if ctx.HasErrors() { for _, err := range ctx.Errors() { log.Errorf("CostModel.ComputeAllocation: query context error %s", err) } return ctx.ErrorCollection() } return nil

The ErrorCollection error type is important for upstream projects, just want to make sure it's not removed:

opencost/pkg/costmodel/allocation.go

Lines 594 to 600 in 3c75f52

if ctx.HasErrors() {

for _, err := range ctx.Errors() {

log.Errorf("CostModel.ComputeAllocation: query context error %s", err)

}

return allocSet, nil, ctx.ErrorCollection()

}

A quick note about prometheus query concurrency: opencost uses a max concurrency throttled prometheus client which limits the total concurrent outbound queries. This is to avoid prometheus memory bloating that can occur when making many concurrent query requests. See the env option:

opencost/pkg/env/costmodelenv.go

Lines 492 to 494 in 3c75f52

func GetMaxQueryConcurrency() int {

return env.GetInt(MaxQueryConcurrencyEnvVar, 5)

}

Reverted the original approach of async execution. I've been struggling to figure out how to apply when I began.
But after a few iterations and your suggestion it's clear now.

@mbolt35 While we are here, do you know why there a custom implementation of prometheus client that performs http response parsing? Why original prometheus client isn't used?

I should've been more specific. We are wrapping the prometheus.Client with concurrent query control logic as well as response handling for rate-limitting errors that come from managed prometheus. There is also a query file logger hook and metric tracking for concurrent query state.

opencost/pkg/prom/prom.go

Lines 113 to 129 in 3c75f52

//--------------------------------------------------------------------------

// RateLimitedPrometheusClient

//--------------------------------------------------------------------------

// RateLimitedPrometheusClient is a prometheus client which limits the total number of

// concurrent outbound requests allowed at a given moment.

type RateLimitedPrometheusClient struct {

id string

client prometheus.Client

auth *ClientAuth

queue collections.BlockingQueue[*workRequest]

decorator QueryParamsDecorator

rateLimitRetry *RateLimitRetryOpts

outbound atomic.Int32

fileLogger *golog.Logger

headerXScopeOrgId string

}

mbolt35 · 2024-02-21T03:57:54Z

pkg/costmodel/allocation.go

+type promQuery struct {
+	out   *[]*prom.QueryResult
+	query string
+}


To accomplish the looped queries, you'll likely want to add a:

result prom.QueryResultsChan

field

Signed-off-by: r2k1 <yokree@gmail.com>

nikovacevic · 2024-02-21T14:33:14Z

In general, I agree that this code, like many places in the code, could benefit from a careful clean up. The tough thing about that, and the reason we haven't done it, is that we don't have good tests. I would love to see more robust testing here, e.g., before/after compared over a period of time on a dynamic cluster, or significantly more complex unit test cases. Open to ideas.

In lieu of having good tests, we have gone with the conservative approach of "don't change it if it isn't broken." I'm still supportive of that approach, but understand that others may want to make changes here.

As far as a review goes, this one would require time that I'm not going to be able to offer until mid April. If this is still open then, I'll check back in and see how I can help.

ameijer · 2024-02-22T20:26:32Z

+1 I support this change. Test cases look good, maybe a bit more testing? I will take a closer look

AjayTripathy · 2024-02-26T01:34:37Z

@mbolt35 @ameijer if you guys both support can you lgtm / provide anymore feedback and we can look to merge?

sonarcloud · 2024-02-26T01:36:28Z

Quality Gate passed

Issues
1 New issue

Measures
0 Security Hotspots
21.5% Coverage on New Code
1.3% Duplication on New Code

See analysis details on SonarCloud

github-actions · 2024-05-26T01:49:56Z

This pull request has been marked as stale because it has been open for 90 days with no activity. Please remove the stale label or comment or this pull request will be closed in 5 days.

mattray · 2024-05-29T07:30:21Z

@r2k1 I think you still want to merge this right?

r2k1 added 7 commits February 20, 2024 20:40

log server port

85727a2

Signed-off-by: r2k1 <yokree@gmail.com>

create new function for querying prometheus

54feae8

Signed-off-by: r2k1 <yokree@gmail.com>

simplify prometheus querying

9b79dd7

Signed-off-by: r2k1 <yokree@gmail.com>

simplify error handling

18d21b2

Signed-off-by: r2k1 <yokree@gmail.com>

delete unused clusterStart and clusterEnd variables

d882125

Signed-off-by: r2k1 <yokree@gmail.com>

separate data fetching and processing for testing purposes

58ad8ea

Signed-off-by: r2k1 <yokree@gmail.com>

put back original error logging

607ab1b

Signed-off-by: r2k1 <yokree@gmail.com>

github-actions bot added the needs-follow-up label Feb 20, 2024

vercel bot deployed to Preview February 20, 2024 22:42 View deployment

mattray added opencost OpenCost issues vs. external/downstream P2 Estimated Priority (P0 is highest, P4 is lowest) kubecost Relevant to Kubecost's downstream project E2 Estimated level of Effort (1 is easiest, 4 is hardest) labels Feb 21, 2024

mbolt35 reviewed Feb 21, 2024

View reviewed changes

vercel bot deployed to Preview February 21, 2024 07:48 View deployment

r2k1 force-pushed the refactor-metrics branch from 36b6343 to 9d9e7e2 Compare February 21, 2024 08:00

vercel bot deployed to Preview February 21, 2024 08:00 View deployment

revert to original async/error handling logic

f13e53b

Signed-off-by: r2k1 <yokree@gmail.com>

r2k1 force-pushed the refactor-metrics branch from 9d9e7e2 to f13e53b Compare February 21, 2024 08:07

vercel bot deployed to Preview February 21, 2024 08:08 View deployment

Merge branch 'develop' into refactor-metrics

9dd3ed9

vercel bot deployed to Preview February 26, 2024 01:34 View deployment

github-actions bot added the Stale label May 26, 2024

mattray removed the Stale label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor allocation calculation #2554

Refactor allocation calculation #2554

r2k1 commented Feb 20, 2024 •

edited

vercel bot commented Feb 20, 2024 •

edited

mattray commented Feb 21, 2024

mbolt35 left a comment

mbolt35 Feb 21, 2024

r2k1 Feb 21, 2024

r2k1 Feb 21, 2024

mbolt35 Feb 21, 2024

mbolt35 Feb 21, 2024

nikovacevic commented Feb 21, 2024

ameijer commented Feb 22, 2024

AjayTripathy commented Feb 26, 2024

sonarcloud bot commented Feb 26, 2024

github-actions bot commented May 26, 2024

mattray commented May 29, 2024

	// report all warnings, request, and parse errors (nils will be ignored)
	ctx.errorCollector.Report(query, warnings, requestError, results.Error)

	if ctx.HasErrors() {
	for _, err := range ctx.Errors() {
	log.Errorf("CostModel.ComputeAllocation: query context error %s", err)
	}

	return allocSet, nil, ctx.ErrorCollection()
	}

	func GetMaxQueryConcurrency() int {
	return env.GetInt(MaxQueryConcurrencyEnvVar, 5)
	}

	//--------------------------------------------------------------------------
	// RateLimitedPrometheusClient
	//--------------------------------------------------------------------------

	// RateLimitedPrometheusClient is a prometheus client which limits the total number of
	// concurrent outbound requests allowed at a given moment.
	type RateLimitedPrometheusClient struct {
	id string
	client prometheus.Client
	auth *ClientAuth
	queue collections.BlockingQueue[*workRequest]
	decorator QueryParamsDecorator
	rateLimitRetry *RateLimitRetryOpts
	outbound atomic.Int32
	fileLogger *golog.Logger
	headerXScopeOrgId string
	}

Refactor allocation calculation #2554

Are you sure you want to change the base?

Refactor allocation calculation #2554

Conversation

r2k1 commented Feb 20, 2024 • edited

What does this PR change?

Does this PR relate to any other PRs?

How will this PR impact users?

How was this PR tested?

Does this PR require changes to documentation?

Have you labeled this PR and its corresponding Issue as "next release" if it should be part of the next OpenCost release? If not, why not?

vercel bot commented Feb 20, 2024 • edited

mattray commented Feb 21, 2024

mbolt35 left a comment

Choose a reason for hiding this comment

mbolt35 Feb 21, 2024

Choose a reason for hiding this comment

r2k1 Feb 21, 2024

Choose a reason for hiding this comment

r2k1 Feb 21, 2024

Choose a reason for hiding this comment

mbolt35 Feb 21, 2024

Choose a reason for hiding this comment

mbolt35 Feb 21, 2024

Choose a reason for hiding this comment

nikovacevic commented Feb 21, 2024

ameijer commented Feb 22, 2024

AjayTripathy commented Feb 26, 2024

sonarcloud bot commented Feb 26, 2024

Quality Gate passed

github-actions bot commented May 26, 2024

mattray commented May 29, 2024

r2k1 commented Feb 20, 2024 •

edited

vercel bot commented Feb 20, 2024 •

edited