Selectively Forward Prometheus Metrics From Agents to AMP

2024年02月21日


In this blog post, I will explain how the AMP cost is reduced while supporting the AMG visualization functionality. This article only focuses on the cost optimization topic regarding AMP, and only focus on reducing the cost from a specific aspect of the AMP related bills.

Prologue

EVERY CENT COUNTS.


Background

Currently, the tianzhui.cloud web-site have several panels visualized using AMG dashboard.




With the default configurations, the AMP service has generated considerable cloud expenditure in the past 7 days (approximately).

For more information regarding the setup of Observability, refer to the previous blog post Integrate AWS EKS with AMP using Self-managed Collector and AMG.


Billing Analysis

It is observed that the AMP has generated the following billing items in the past 7 days (approximately).
- 700M metric datapoints / samples, which corresponds to a billing amount of US$65.
- 0.1 GB per month of storage, which corresponds to the AMP free tier pricing model.
- 17M metric datapoints /query sample processed, which corresponds to the AMP free tier pricing model.

Note
For detailed explanation of the AMP pricing model, refer to Amazon Managed Service for Prometheus pricing.

From the above cost statements, it could be obviously identified that the metric datapoints / samples forwarded from the Prometheus agent to the AMP workspace is the main and single item that forms the AMP related cost. In the next section, I will explain how I deep dived and reduced this cost.


Cost Optimization Methodology

Define the Cost Optimization Strategy

From an AMG end-user's perspective, the requirements towards Prometheus is only to collect the metrics that I'm interested which are used for visualization using Grafana. In other words, the requirements towards the metrics forwarded to the AMP is only to support the queries run by Grafana.

In the future, if new panels are added in the Grafana dashboard, the corresponding metrics will be additionaly forwarded from the Prometheus agent to the AMP.

Define the Necessary Metrics on the AMP Side

To enumerate all the metrics that are needed for the Grafana queries in PromQL, below PromQL query statements have been analyzed:
Panel Name PromQL Query Statement Metric Used
Node CPU Utilization
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
node_cpu_seconds_total
Network Bandwidth Usage
sum(rate(container_network_receive_bytes_total{pod!="",namespace="example"}[5m])) by (pod)
+ sum(rate(container_network_transmit_bytes_total{pod!="",namespace="example"}[5m])) by (pod)
container_network_receive_bytes_total
container_network_transmit_bytes_total
Node Memory Utilization
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
Pod Restart Count
sum(kube_pod_container_status_restarts_total) by (namespace, pod)
kube_pod_container_status_restarts_total
Pods Running State by Namespace
count(kube_pod_status_phase{phase="Running"}) by (namespace)
kube_pod_status_phase
CPU Usage by Namespace
sum(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (namespace)
container_cpu_usage_seconds_total
Disk Utilization by Persistent Volume
sum by (persistentvolumeclaim, namespace) (kubelet_volume_stats_used_bytes)
kubelet_volume_stats_used_bytes
Memory Usage by Namespace
sum(container_memory_usage_bytes{container!="",container!="POD"}) by (namespace)
container_memory_usage_bytes
Number of Deployments per Namespace count(kube_deployment_created) by (namespace) kube_deployment_created

To sum up, AMG only needs the following 11 metrics on the AMP side:
  • node_cpu_seconds_total
  • container_network_receive_bytes_total
  • container_network_transmit_bytes_total
  • node_memory_MemTotal_bytes
  • node_memory_MemAvailable_bytes
  • kube_pod_container_status_restarts_total
  • kube_pod_status_phase
  • container_cpu_usage_seconds_total
  • kubelet_volume_stats_used_bytes
  • container_memory_usage_bytes
  • kube_deployment_created

(Optional) Confirm the Metric Name
To double confirm if the above metric names are correct, go to Grafana web page (not the AMP console), list all metrics that are on the Prometheus server.
To list all metrics that are available on a Prometheus server from Grafana, you can leverage the Grafana's Explore feature.
1. Open Grafana and log in.
2. Navigate to Explore: Use the compass icon on the left sidebar to open the Explore section.


3. Select Your Prometheus Data Source: At the top, use the dropdown to select the Prometheus data source you have configured in Grafana.


Below configuration:
...
serverFiles:
  prometheus.yml:
    remote_write:
    - write_relabel_configs:
        - action: keep
          regex: '^(node_cpu_seconds_total|container_network_receive_bytes_total)$'
          source_labels: [__name__]
        - action: drop
          regex: '.*'
          source_labels: [__name__]
...
has the same result as this configuration:
...
serverFiles:
  prometheus.yml:
    remote_write:
    - write_relabel_configs:
        - action: keep
          regex: '^(node_cpu_seconds_total|container_network_receive_bytes_total)$'
          source_labels: [__name__]
...


Conclusion

With below configuration (Helm value file), only the metrics that are currently used by the Grafana dashboard are forwarded to the AMP by Prometheus agent.
...
serverFiles:
  prometheus.yml:
    remote_write:
    - write_relabel_configs:
        - action: keep
          regex: '^(node_cpu_seconds_total|container_network_receive_bytes_total|container_network_transmit_bytes_total|node_memory_MemTotal_bytes|node_memory_MemAvailable_bytes|kube_pod_container_status_restarts_total|kube_pod_status_phase|kubelet_volume_stats_used_bytes|container_memory_usage_bytes|kube_deployment_created|container_cpu_usage_seconds_total)$'
          source_labels: [__name__]
...

lineItem/UsageStartDate lineItem/UsageEndDate lineItem/ProductCode lineItem/UsageType lineItem/UsageAmount lineItem/UnblendedRate lineItem/UnblendedCost
2024-02-21T00:00:00Z 2024-02-21T01:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17841644 0.00000009 1.60574796
2024-02-21T02:00:00Z 2024-02-21T03:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17845427 0.00000009 1.60608843
2024-02-21T04:00:00Z 2024-02-21T05:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17296534 0.00000009 1.55668806
2024-02-21T06:00:00Z 2024-02-21T07:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17512803 0.00000009 1.57615227
2024-02-21T08:00:00Z 2024-02-21T09:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17802462 0.00000009 1.60222158
2024-02-21T10:00:00Z 2024-02-21T11:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17798373 0.00000009 1.60185357
2024-02-21T12:00:00Z 2024-02-21T13:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 17783600 0.00000009 1.600524
2024-02-21T14:00:00Z 2024-02-21T15:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 10111983 0.00000009 0.91007847
2024-02-21T16:00:00Z 2024-02-21T17:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 196540 0.00000009 0.0176886
2024-02-21T18:00:00Z 2024-02-21T19:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 223142 0.00000009 0.02008278
2024-02-21T20:00:00Z 2024-02-21T21:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 224352 0.00000009 0.02019168
2024-02-21T22:00:00Z 2024-02-21T23:00:00Z AmazonPrometheus USW2-AMP:MetricSampleCount 223545 0.00000009 0.02011905
The billing item regarding USW2-AMP:MetricSampleCount has been reduced by 98.7%, and the monthly cost of this billing category has been reduced from US$ 581 to US$ 7.

Costs ($) graph with hourly granularity.


Usage graph (Metric Datapoints) with hourly granularity.

-


Upvote


Downvote