Selectively Forward Prometheus Metrics From Agents to AMP

2024年02月21日

In this blog post, I will explain how the AMP cost is reduced while supporting the AMG visualization functionality. This article only focuses on the cost optimization topic regarding AMP, and only focus on reducing the cost from a specific aspect of the AMP related bills.

Prologue

EVERY CENT COUNTS.

Background

Currently, the tianzhui.cloud web-site have several panels visualized using AMG dashboard.

With the default configurations, the AMP service has generated considerable cloud expenditure in the past 7 days (approximately).

For more information regarding the setup of Observability, refer to the previous blog post Integrate AWS EKS with AMP using Self-managed Collector and AMG.

Billing Analysis

It is observed that the AMP has generated the following billing items in the past 7 days (approximately).
- 700M metric datapoints / samples, which corresponds to a billing amount of US$65.
- 0.1 GB per month of storage, which corresponds to the AMP free tier pricing model.
- 17M metric datapoints /query sample processed, which corresponds to the AMP free tier pricing model.

Note
For detailed explanation of the AMP pricing model, refer to Amazon Managed Service for Prometheus pricing.

From the above cost statements, it could be obviously identified that the metric datapoints / samples forwarded from the Prometheus agent to the AMP workspace is the main and single item that forms the AMP related cost. In the next section, I will explain how I deep dived and reduced this cost.

Cost Optimization Methodology

Define the Cost Optimization Strategy

From an AMG end-user's perspective, the requirements towards Prometheus is only to collect the metrics that I'm interested which are used for visualization using Grafana. In other words, the requirements towards the metrics forwarded to the AMP is only to support the queries run by Grafana.

In the future, if new panels are added in the Grafana dashboard, the corresponding metrics will be additionaly forwarded from the Prometheus agent to the AMP.

Define the Necessary Metrics on the AMP Side

To enumerate all the metrics that are needed for the Grafana queries in PromQL, below PromQL query statements have been analyzed:

Panel Name	PromQL Query Statement	Metric Used
Node CPU Utilization	`100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	node_cpu_seconds_total
Network Bandwidth Usage	`sum(rate(container_network_receive_bytes_total{pod!="",namespace="example"}[5m])) by (pod)` `+ sum(rate(container_network_transmit_bytes_total{pod!="",namespace="example"}[5m])) by (pod)`	container_network_receive_bytes_total container_network_transmit_bytes_total
Node Memory Utilization	`(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100`	node_memory_MemTotal_bytes node_memory_MemAvailable_bytes
Pod Restart Count	`sum(kube_pod_container_status_restarts_total) by (namespace, pod)`	kube_pod_container_status_restarts_total
Pods Running State by Namespace	`count(kube_pod_status_phase{phase="Running"}) by (namespace)`	kube_pod_status_phase
CPU Usage by Namespace	`sum(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (namespace)`	container_cpu_usage_seconds_total
Disk Utilization by Persistent Volume	`sum by (persistentvolumeclaim, namespace) (kubelet_volume_stats_used_bytes)`	kubelet_volume_stats_used_bytes
Memory Usage by Namespace	`sum(container_memory_usage_bytes{container!="",container!="POD"}) by (namespace)`	container_memory_usage_bytes
Number of Deployments per Namespace	`count(kube_deployment_created) by (namespace)`	kube_deployment_created

To sum up, AMG only needs the following 11 metrics on the AMP side:

node_cpu_seconds_total
container_network_receive_bytes_total
container_network_transmit_bytes_total
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phase
container_cpu_usage_seconds_total
kubelet_volume_stats_used_bytes
container_memory_usage_bytes
kube_deployment_created

(Optional) Confirm the Metric Name

To double confirm if the above metric names are correct, go to Grafana web page (not the AMP console), list all metrics that are on the Prometheus server.
To list all metrics that are available on a Prometheus server from Grafana, you can leverage the Grafana's Explore feature.
1. Open Grafana and log in.
2. Navigate to Explore: Use the compass icon on the left sidebar to open the Explore section.

3. Select Your Prometheus Data Source: At the top, use the dropdown to select the Prometheus data source you have configured in Grafana.

Below configuration:

...
serverFiles:
  prometheus.yml:
    remote_write:
    - write_relabel_configs:
        - action: keep
          regex: '^(node_cpu_seconds_total|container_network_receive_bytes_total)$'
          source_labels: [__name__]
        - action: drop
          regex: '.*'
          source_labels: [__name__]
...

has the same result as this configuration:

...
serverFiles:
  prometheus.yml:
    remote_write:
    - write_relabel_configs:
        - action: keep
          regex: '^(node_cpu_seconds_total|container_network_receive_bytes_total)$'
          source_labels: [__name__]
...

Conclusion

With below configuration (Helm value file), only the metrics that are currently used by the Grafana dashboard are forwarded to the AMP by Prometheus agent.

...
serverFiles:
  prometheus.yml:
    remote_write:
    - write_relabel_configs:
        - action: keep
          regex: '^(node_cpu_seconds_total|container_network_receive_bytes_total|container_network_transmit_bytes_total|node_memory_MemTotal_bytes|node_memory_MemAvailable_bytes|kube_pod_container_status_restarts_total|kube_pod_status_phase|kubelet_volume_stats_used_bytes|container_memory_usage_bytes|kube_deployment_created|container_cpu_usage_seconds_total)$'
          source_labels: [__name__]
...

lineItem/UsageStartDate	lineItem/UsageEndDate	lineItem/ProductCode	lineItem/UsageType	lineItem/UsageAmount	lineItem/UnblendedRate	lineItem/UnblendedCost
2024-02-21T00:00:00Z	2024-02-21T01:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17841644	0.00000009	1.60574796
2024-02-21T02:00:00Z	2024-02-21T03:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17845427	0.00000009	1.60608843
2024-02-21T04:00:00Z	2024-02-21T05:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17296534	0.00000009	1.55668806
2024-02-21T06:00:00Z	2024-02-21T07:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17512803	0.00000009	1.57615227
2024-02-21T08:00:00Z	2024-02-21T09:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17802462	0.00000009	1.60222158
2024-02-21T10:00:00Z	2024-02-21T11:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17798373	0.00000009	1.60185357
2024-02-21T12:00:00Z	2024-02-21T13:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	17783600	0.00000009	1.600524
2024-02-21T14:00:00Z	2024-02-21T15:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	10111983	0.00000009	0.91007847
2024-02-21T16:00:00Z	2024-02-21T17:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	196540	0.00000009	0.0176886
2024-02-21T18:00:00Z	2024-02-21T19:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	223142	0.00000009	0.02008278
2024-02-21T20:00:00Z	2024-02-21T21:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	224352	0.00000009	0.02019168
2024-02-21T22:00:00Z	2024-02-21T23:00:00Z	AmazonPrometheus	USW2-AMP:MetricSampleCount	223545	0.00000009	0.02011905

The billing item regarding USW2-AMP:MetricSampleCount has been reduced by 98.7%, and the monthly cost of this billing category has been reduced from US$ 581 to US$ 7.

Costs ($) graph with hourly granularity.

Usage graph (Metric Datapoints) with hourly granularity.

Category: container Tags: AMP Prometheus AMG Grafana Observability PromQL public

Sky Cone