Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when actor metrics are enabled #745

Open
antoniomacri opened this issue Oct 18, 2022 · 4 comments
Open

Memory leak when actor metrics are enabled #745

antoniomacri opened this issue Oct 18, 2022 · 4 comments

Comments

@antoniomacri
Copy link

Describe the bug
Enabling actor metrics:

	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

causes a memory leak when actors are stopped.

To Reproduce
Run the following main:

package main

import (
	"fmt"
	"github.com/asynkron/protoactor-go/actor"
	"github.com/google/uuid"
	"go.opentelemetry.io/otel/metric/global"
	"log"
	"math/rand"
	"net/http"
	_ "net/http/pprof"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()

	for i := 0; i < 20_000; i++ {
		message := &MyData{id: uuid.New().String(), value: randomString(100_000)}
		props := actor.PropsFromProducer(func() actor.Actor {
			return &myActor{data: make(map[string]*MyData)}
		})
		pid, err := system.Root.SpawnNamed(props, message.id)
		if err != nil && err != actor.ErrNameExists {
			fmt.Printf("Error while sending message to actor=%v err=%v\n", pid, err)
		}
		system.Root.Send(pid, message)
		if err := system.Root.PoisonFuture(pid).Wait(); err != nil {
			fmt.Printf("Error while terminating actor=%v err=%v\n", pid, err)
		}
		time.Sleep(5 * time.Millisecond)
	}
	fmt.Println("-- Done.")

	sigs := make(chan os.Signal, 1)
	signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
	<-sigs
}

type myActor struct {
	data map[string]*MyData
}

type MyData struct {
	id    string
	value string
}

func (a *myActor) Receive(ctx actor.Context) {
	switch msg := ctx.Message().(type) {
	case *MyData:
		a.data[msg.id] = msg
	}
}

func randomString(length int) string {
	const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
	b := make([]byte, length)
	for i := range b {
		b[i] = letterBytes[rand.Intn(len(letterBytes))]
	}
	return string(b)
}

A few GB of RAM are retained even though the actors are stopped immediately after sending the data message.

By replacing

	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

with

	system := actor.NewActorSystem()

the memory leak disappears.

Expected behavior
The memory is released when the actors are stopped.

Additional context
In props.go the default spawner contains this code:

		ctx := newActorContext(actorSystem, props, parentContext.Self())
		mb := props.produceMailbox()

		// prepare the mailbox number counter
		if ctx.actorSystem.Config.MetricsProvider != nil {
			sysMetrics, ok := ctx.actorSystem.Extensions.Get(extensionId).(*Metrics)
			if ok && sysMetrics.enabled {
				if instruments := sysMetrics.metrics.Get(metrics.InternalActorMetrics); instruments != nil {
					sysMetrics.PrepareMailboxLengthGauge()
					meter := global.Meter(metrics.LibName)
					if err := meter.RegisterCallback([]instrument.Asynchronous{instruments.ActorMailboxLength}, func(goCtx context.Context) {
						instruments.ActorMailboxLength.Observe(goCtx, int64(mb.UserMessageCount()), sysMetrics.CommonLabels(ctx)...)
					}); err != nil {
						err = fmt.Errorf("failed to instrument Actor Mailbox, %w", err)
						plog.Error(err.Error(), log.Error(err))
					}
				}
			}
		}

Specifically the closurefunc(goCtx context.Context) captures mb and ctx, and is not unregistered after the actor is stopped.

This seems to be the root cause of the memory leak.

@antoniomacri
Copy link
Author

Also, it's not clear to me how that metric (ActorMailboxLength) should work.
Shouldn't it sum the mailbox length for all the living actors?
It instead seems to replace the gauge value with the mailbox length of the last called back actor.

@antoniomacri
Copy link
Author

I'm keen to help with this issue.

Any hint on how to fix the leak and correctly implement the mailbox metric?

@rogeralsing
Copy link
Collaborator

Thanks for the detailed report.
I have to dive into this to see what the memory issue could be

@antoniomacri
Copy link
Author

It should be due to the registered callback for each actor, which keeps references to its mailbox and context via the closure.

So changing the callback logic should also fix the leak.

I'm thinking about registering a single callback in the OpenTelemetry Meter. Go implementation of the Meter doesn't provide a way to unregister the callback (although this seems mandated by the spec...). Therefore this should be handled explicitly. WDYT?

Thanks for the response.

antoniomacri added a commit to antoniomacri/protoactor-go that referenced this issue Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants