[SPARK-48298][Core] Add TCP mode to StatsD sink #46604

jiwen624 · 2024-05-16T03:25:38Z

What changes were proposed in this pull request?

A new trait DataSender is added in this PR to provide a unified data sender interface for the StatsdReporter. Under this trait we have implementations for UDP mode (UDPDataSender, which keeps the existing UDP-based implementation) and TCP mode (TCPDataSender). It is configurable which mode/sender to use and the default mode is still UDP.

The unit tests are refactored to add TCP mode testing (context provided by withSocketAndSinkTCP) in addition to the existing tests for the UDP mode (renamed from withSocketAndSink to withSocketAndSinkUDP)

Why are the changes needed?

As mentioned in the Jira ticket: https://issues.apache.org/jira/browse/SPARK-48298
Currently, the StatsdSink in Spark supports UDP mode only, which is the default mode of StatsD. However, in real production environments, we often find that a more reliable transmission of metrics is needed to avoid metrics loss in high-traffic systems, and also provide more flexibility in network configurations.

TCP mode is already supported by Statsd: https://github.com/statsd/statsd/blob/master/docs/server.md
Prometheus' statsd_exporter: https://github.com/prometheus/statsd_exporter
and also many other Statsd-based metrics proxies/receivers.

Does this PR introduce any user-facing change?

Yes.
The following new config options are added to conf/metrics.properties.template:
*.sink.statsd.protocol
*.sink.statsd.connTimeoutMs
A new error condition is defined in error-conditions.json for protocol configuration error.

How was this patch tested?

Added/Refactored unit tests. Also manually tested with metric configurations sending metrics to a Netcat server in TCP and UDP modes.

Was this patch authored or co-authored using generative AI tooling?

No.

jiwen624 · 2024-05-16T04:47:20Z

Hi @cloud-fan @dongjoon-hyun @yaooqinn Could you take a look when you get a chance or perhaps pull other people in for review? Thank you🙇

jiwen624 · 2024-05-23T17:01:47Z

core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala

+private object DataSender {
+  def get(host: String, port: Int, protocol: DataSenderType, connTimeoutMs: Int): DataSender = {
+    val ds = protocol match {
+      case TCP => new TCPDataSender(host, port, connTimeoutMs)


I'm not using TCP long connections here. The consideration is:

For metrics we do not heavily use the connection, it's usually a single-shot of a few metrics in a period of a few seconds, 10s by default. It's very costly even overwhelming for the Statsd server side to keep a large number of long-lived connections, given that in the production environments we usually have a ton of Spark applications/drivers/executors.

The drawback to consider is for a short TCP connection, the port cannot be reused for a short while due to TCP_TIMEWAIT, which is a tradeoff and it somewhat depends on what the interval is.

Also there is a cost of initializing and destroying a TCP connection.

Simplicity. To maintain a long connection, we may have to handle a lot more failure scenarios, e.g., detect & reconnect due to a server-side glitch or transient network issues, and possibly a retry mechanism, and perhaps a retry with backoff and jitter, etc. Not sure if this complexity is expected though given that the current UDP mode is just fire-and-forget.

Other considerations:

Short-lived TCP connection is better for load-balancing & automatic failover, etc when, let's say, the Statsd service is hosted by multiple hosts behind a load balancer (e.g., a DNS, pods behind a Service in Kubernetes). This can be another long story itself to talk about.

So I don't see an ideal solution for all scenarios (this is probably why we are offering both UDP and TCP modes), we might provide both options (short-lived and long-lived) for TCP mode in case it's worth it.

Please feel free to discuss if you have different opinions.

jiwen624 added 4 commits May 14, 2024 22:58

add TCP support for StatsD sink

7d041a8

refactor StatsdSink test to add TCP tests

6cf0bd0

add StatsdReporter TCP test cases

c127e04

fix test failure

a1243c1

github-actions bot added the CORE label May 16, 2024

jiwen624 marked this pull request as ready for review May 16, 2024 04:43

jiwen624 changed the title ~~[WIP][SPARK-48298][Core] Add TCP mode to StatsD sink~~ [SPARK-48298][Core] Add TCP mode to StatsD sink May 16, 2024

jiwen624 added 2 commits May 21, 2024 22:11

refactor test case

84c8246

Merge branch 'apache:master' into statsd-tcp

f73f2ab

jiwen624 commented May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48298][Core] Add TCP mode to StatsD sink #46604

[SPARK-48298][Core] Add TCP mode to StatsD sink #46604

jiwen624 commented May 16, 2024 •

edited

jiwen624 commented May 16, 2024

jiwen624 May 23, 2024 •

edited

[SPARK-48298][Core] Add TCP mode to StatsD sink #46604

Are you sure you want to change the base?

[SPARK-48298][Core] Add TCP mode to StatsD sink #46604

Conversation

jiwen624 commented May 16, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

jiwen624 commented May 16, 2024

jiwen624 May 23, 2024 • edited

Choose a reason for hiding this comment

jiwen624 commented May 16, 2024 •

edited

jiwen624 May 23, 2024 •

edited