Skip to content

Latest commit

 

History

History
172 lines (126 loc) · 5.08 KB

BENCHMARK.md

File metadata and controls

172 lines (126 loc) · 5.08 KB

Collection Benchmarking

This directory contains utilities to benchmark the GMP collection stack on GKE clusters.

Spinup

Make sure that your gcloud CLI is setup properly.

Create Cluster

Define the cluster name, location, and scale:

BASE_DIR=$(git rev-parse --show-toplevel)
PROJECT_ID=$(gcloud config get-value core/project)
ZONE=us-central1-b # recommended for benchmarks
CLUSTER=gmp-"bench-$USER"
NODE_COUNT=5
NODE_TYPE=e2-medium

Create the a cluster:

gcloud container clusters create "$CLUSTER" --zone "$ZONE" \
    --machine-type="$NODE_TYPE" --num-nodes="$NODE_COUNT" \
    --workload-pool="$PROJECT_ID.svc.id.goog" &&
gcloud container clusters get-credentials "$CLUSTER" --zone "$ZONE"

Build Container Images

While this is running, we can build container images for the benchmark. You can repeat the steps in this section to update the benchmark setup on code changes.

Repository: prometheus-engine

Build the container images from the current head of the repository:

pushd "$BASE_DIR" &&
make cloudbuild
popd

Repository: prometheus

Make sure that you have the prometheus repository checked out in the same parent directory as the prometheus-engine repository.

Then build the container images including any changes to the libraries it uses from gmp-collector:

PROMETHEUS_IMAGE_TAG=$(date "+bench_%Y%d%m_%H%M")
PROMETHEUS_IMAGE="gcr.io/$PROJECT_ID/prometheus:$PROMETHEUS_IMAGE_TAG"

pushd "$BASE_DIR/../prometheus" &&
make promu &&
go mod vendor &&
promu crossbuild -p linux/amd64 &&
gcloud builds submit --tag "$PROMETHEUS_IMAGE" &&
popd

Deploy

Deploy the base monitoring stack:

kubectl apply -f "$BASE_DIR/manifests/setup.yaml" &&
sleep 3 &&
kubectl apply -f "$BASE_DIR/manifests/operator.yaml"

Next, define a size of our example workload and deploy it. You may rerun this step as needed to change size.

kubectl apply -f "$BASE_DIR/examples/pod-monitoring.yaml" 

Lastly, we run the operator locally. Doing that instead of deploying it inside of the cluster doesn't affect any behavior but makes quick iteration quicker.

go run $BASE_DIR/cmd/operator/*.go \
  --project-id=$PROJECT_ID \
  --cluster=$CLUSTER \
  --image-collector="$PROMETHEUS_IMAGE" \
  --image-config-reloader="$RELOADER_IMAGE" \
  --priority-class=gmp-critical \

You may terminate the operator, rebuild images as needed by following the steps above, and start it again to deploy the new versions.

Teardown

To teardown the setup, simply delete the cluster:

gcloud container clusters delete "$CLUSTER" --zone "$ZONE"

Evaluation

Go to the Cloud Monitoring metric explorer for your project and check whether all targets are being scraped via the following MQL query (substitute the $CLUSTER name manually):

fetch prometheus_target
| metric 'prometheus.googleapis.com/up/gauge'
| filter (resource.cluster == '$CLUSTER')
| group_by [resource.job], [sum(val())]

Further interesting cluster-wide queries are:

# Number of active streams by job.
fetch prometheus_target
| metric 'prometheus.googleapis.com/scrape_samples_scraped/gauge'
| filter resource.cluster == '$CLUSTER'
| group_by [resource.job], [sum(val())]

# Total number of scraped Prometheus samples per second.
fetch prometheus_target
| metric 'prometheus.googleapis.com/prometheus_tsdb_head_samples_appended_total/counter'
| filter resource.cluster == '$CLUSTER'
| align rate(1m)
| every 1m
| group_by [], [sum(val())]

If no metrics show up, directly connect to one of the collector pods and inspect the "Targets", "Configuration, or "Service Discovery" pages in the Prometheus UI for further debugging.

COLLECTOR_POD=$(kubectl -n gmp-system get pod -l "app.kubernetes.io/name=collector" -o name | head -n 1)
kubectl -n gmp-system port-forward --address 0.0.0.0 $COLLECTOR_POD 19090

To inspect resource usage, provides Prometheus node_exporter metrics for node-wide resource consumption as well as cAdvisor metrics for container-level resource usage. They can either query them through MQL for an entire cluster, or in the collector's Prometheus UI for an individual node.

Some interesting PromQL queries:

# Percentage of total node CPU in use.
1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m]))

# CPU usage (fraction of a core) by container.
sum by(container) (rate(container_cpu_usage_seconds_total{container!="", container!="POD"}[2m]))

# Memory usage by container.
sum by(container) (container_memory_usage_bytes{container!="", container!="POD"})

# Number of actively scraped Prometheus time series.
sum by(job) (scrape_samples_scraped)

# Rate at which Prometheus samples are scraped.
rate(prometheus_tsdb_head_samples_appended_total[2m])

# Rate at which GCM samples are exported. This is expected to be lower as histogram series
# map to a single GCM distribution.
rate(gcm_export_samples_exported_total[2m])

# Rate at which samples are dropped in the collector because they cannot be exported fast enough.
rate(gcm_export_samples_dropped_total[2m])