Troubleshoot KubeCost Pod Creating Issue

2024年01月05日


This article records how I troubleshoot an issue that KubeCost pod could not be created. For a detailed process to deploy KubeCost, refer to my previous blog post: Deliver better Insights of your cloud bills using KubeCost and AWS CUR.

Symptom

The issue is that the KubeCost Pods and containers stuck at either Pending or ContainerCreating status, and could not be created successfully.

Investigation

The investigation step starts with the observation of the problematic status of the KubeCost Pods.

$ kubectl get po -n kubecost
NAME                                         READY   STATUS              RESTARTS   AGE
kubecost-cost-analyzer-69dd6cb7c8-hbl4b      0/2     ContainerCreating   0          9s
kubecost-prometheus-server-fd678dff7-x9zjn   0/1     ContainerCreating   0          14h

To get further information of the KubeCost Pods, check the events by describing the Pod.
$ kubectl describe po -n kubecost -l app=cost-analyzer
$ kubectl describe po -n kubecost -l app=prometheus
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  27s   default-scheduler  0/6 nodes are available: 1 node(s) had volume node affinity conflict, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  25s   karpenter
...
Or, one could observe below error messages:
...
Events:
  Type     Reason              Age                 From                     Message
  ----     ------              ----                ----                     -------
  Normal   Scheduled           77s                 default-scheduler        Successfully assigned kubecost/kubecost-cost-analyzer-7bdb9fbc86-lszx4 to ip-10-0-x-yy.us-west-2.compute.internal
  Warning  FailedAttachVolume  47s (x25 over 76s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-f7725059-d313-4192-9c79-d85fb5a658e9" : rpc error: code = Internal desc = Could not attach volume "vol-0f83***3443" to node "i-06fbxxx1a9e": error listing AWS instances: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
  status code: 403, request id: xxxx
Or, :
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  60s   default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition


$ kubectl describe po -n kubecost -l app=cost-analyzer
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  81s   default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition

Based on the error messages, it indicates that a service account could not assume an IAM role. It also shows that there is issue regarding PVC.

When checking PVC status, its status is pending.
$ kubectl get pvc kubecost-cost-analyzer -n kubecost
NAME                     STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
kubecost-cost-analyzer   Pending                                      gp2            17m

To get further information of the PVC, check the events by describing the PVC.
$ kubectl describe pvc kubecost-cost-analyzer -n kubecost
...
Events:
  Type     Reason                Age   From                                                                                      Message
  ----     ------                ----  ----                                                                                      -------
  Normal   Provisioning          17m   ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_d08f087d-e186-4d44-8ccb-a4a6c13ecbd4  External provisioner is provisioning volume for claim "kubecost/kubecost-cost-analyzer"
  Normal   WaitForFirstConsumer  17m   persistentvolume-controller                                                               waiting for first consumer to be created before binding
  Warning  ProvisioningFailed    17m   ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_7a11f7bb-513e-447e-90e5-b3aec6ce8b21  failed to provision volume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-b2c50706-b61d-4424-a03e-c6b0595fce03": could not create volume in EC2: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
           status code: 403, request id: xxxx
...
  Normal   ExternalProvisioning  2m33s (x62 over 17m)   persistentvolume-controller                                                               Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          2m15s (x12 over 17m)   ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_7a11f7bb-513e-447e-90e5-b3aec6ce8b21  External provisioner is provisioning volume for claim "kubecost/kubecost-cost-analyzer"
  Warning  ProvisioningFailed    2m15s (x3 over 8m43s)  ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_7a11f7bb-513e-447e-90e5-b3aec6ce8b21  (combined from similar events): failed to provisionvolume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-b2c50706-b61d-4424-a03e-c6b0595fce03": could not create volume in EC2: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
  status code: 403, request id: xxxx

From here, I can guess that it is the AWS EBS CSI Driver's service account that could not assume the role specified for the corresponding EKS add-on.

Solution

Previously, the trust policy of the IAM role for the Amazon EBS CSI Driver EKS add-on is misconfigured as below, which is obviously incorrect, but this misconfiguration does not lead to any impact to the application, so I haven't notice it.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F:aud": "sts.amazonaws.com",
                    "oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F:sub": "system:serviceaccount:kube-system:ebs-csi-*"
                }
            }
        }
    ]
}
Correct the trust policy to be:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-west-2.amazonaws.com/id/BFDB***8D49F:aud": "sts.amazonaws.com",
                    "oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F:sub": "system:serviceaccount:kube-system:ebs-csi-controller-sa"
                }
            }
        }
    ]
}

$ kubectl get pvc kubecost-cost-analyzer -n kubecost
NAME                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
kubecost-cost-analyzer   Bound    pvc-b2xxxx06-xxxx-xxxx-xxxx-c6b0xxxxce03   32Gi       RWO            gp2            20m

$ kubectl get po -n kubecost
NAME                                         READY   STATUS    RESTARTS   AGE
kubecost-cost-analyzer-5cb6fd4f9d-t6d94      2/2     Running   0          8m3s
kubecost-prometheus-server-fd678dff7-tq6wp   1/1     Running   0          8m3s

References


-

Category: container Tags: public

Upvote


Downvote