This directory contains utilities to benchmark the GMP collection stack on GKE clusters.
Make sure that your gcloud
CLI is setup properly.
Define the cluster name, location, and scale:
BASE_DIR=$(git rev-parse --show-toplevel)
PROJECT_ID=$(gcloud config get-value core/project)
ZONE=us-central1-b # recommended for benchmarks
CLUSTER=gmp-"bench-$USER"
NODE_COUNT=5
NODE_TYPE=e2-medium
Create the a cluster:
gcloud container clusters create "$CLUSTER" --zone "$ZONE" \
--machine-type="$NODE_TYPE" --num-nodes="$NODE_COUNT" \
--workload-pool="$PROJECT_ID.svc.id.goog" &&
gcloud container clusters get-credentials "$CLUSTER" --zone "$ZONE"
While this is running, we can build container images for the benchmark. You can repeat the steps in this section to update the benchmark setup on code changes.
Build the container images from the current head of the repository:
pushd "$BASE_DIR" &&
make cloudbuild
popd
Make sure that you have the prometheus repository checked out in the same parent directory as the prometheus-engine repository.
Then build the container images including any changes to the libraries it uses from gmp-collector:
PROMETHEUS_IMAGE_TAG=$(date "+bench_%Y%d%m_%H%M")
PROMETHEUS_IMAGE="gcr.io/$PROJECT_ID/prometheus:$PROMETHEUS_IMAGE_TAG"
pushd "$BASE_DIR/../prometheus" &&
make promu &&
go mod vendor &&
promu crossbuild -p linux/amd64 &&
gcloud builds submit --tag "$PROMETHEUS_IMAGE" &&
popd
Deploy the base monitoring stack:
kubectl apply -f "$BASE_DIR/manifests/setup.yaml" &&
sleep 3 &&
kubectl apply -f "$BASE_DIR/manifests/operator.yaml"
Next, define a size of our example workload and deploy it. You may rerun this step as needed to change size.
kubectl apply -f "$BASE_DIR/examples/pod-monitoring.yaml"
Lastly, we run the operator locally. Doing that instead of deploying it inside of the cluster doesn't affect any behavior but makes quick iteration quicker.
go run $BASE_DIR/cmd/operator/*.go \
--project-id=$PROJECT_ID \
--cluster=$CLUSTER \
--image-collector="$PROMETHEUS_IMAGE" \
--image-config-reloader="$RELOADER_IMAGE" \
--priority-class=gmp-critical \
You may terminate the operator, rebuild images as needed by following the steps above, and start it again to deploy the new versions.
To teardown the setup, simply delete the cluster:
gcloud container clusters delete "$CLUSTER" --zone "$ZONE"
Go to the Cloud Monitoring metric explorer for your project and check whether all targets are
being scraped via the following MQL query (substitute the $CLUSTER
name manually):
fetch prometheus_target
| metric 'prometheus.googleapis.com/up/gauge'
| filter (resource.cluster == '$CLUSTER')
| group_by [resource.job], [sum(val())]
Further interesting cluster-wide queries are:
# Number of active streams by job.
fetch prometheus_target
| metric 'prometheus.googleapis.com/scrape_samples_scraped/gauge'
| filter resource.cluster == '$CLUSTER'
| group_by [resource.job], [sum(val())]
# Total number of scraped Prometheus samples per second.
fetch prometheus_target
| metric 'prometheus.googleapis.com/prometheus_tsdb_head_samples_appended_total/counter'
| filter resource.cluster == '$CLUSTER'
| align rate(1m)
| every 1m
| group_by [], [sum(val())]
If no metrics show up, directly connect to one of the collector pods and inspect the "Targets", "Configuration, or "Service Discovery" pages in the Prometheus UI for further debugging.
COLLECTOR_POD=$(kubectl -n gmp-system get pod -l "app.kubernetes.io/name=collector" -o name | head -n 1)
kubectl -n gmp-system port-forward --address 0.0.0.0 $COLLECTOR_POD 19090
To inspect resource usage, provides Prometheus node_exporter metrics for node-wide resource consumption as well as cAdvisor metrics for container-level resource usage. They can either query them through MQL for an entire cluster, or in the collector's Prometheus UI for an individual node.
Some interesting PromQL queries:
# Percentage of total node CPU in use.
1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m]))
# CPU usage (fraction of a core) by container.
sum by(container) (rate(container_cpu_usage_seconds_total{container!="", container!="POD"}[2m]))
# Memory usage by container.
sum by(container) (container_memory_usage_bytes{container!="", container!="POD"})
# Number of actively scraped Prometheus time series.
sum by(job) (scrape_samples_scraped)
# Rate at which Prometheus samples are scraped.
rate(prometheus_tsdb_head_samples_appended_total[2m])
# Rate at which GCM samples are exported. This is expected to be lower as histogram series
# map to a single GCM distribution.
rate(gcm_export_samples_exported_total[2m])
# Rate at which samples are dropped in the collector because they cannot be exported fast enough.
rate(gcm_export_samples_dropped_total[2m])