Troubleshoot KubeCost Pod Creating Issue
2024年01月05日
$ kubectl get po -n kubecost
To get further information of the KubeCost Pods, check the events by describing the Pod.
$ kubectl describe po -n kubecost -l app=cost-analyzer
$ kubectl describe po -n kubecost -l app=prometheus
$ kubectl describe po -n kubecost -l app=cost-analyzer
Based on the error messages, it indicates that a service account could not assume an IAM role. It also shows that there is issue regarding PVC.
When checking PVC status, its status is pending.
$ kubectl get pvc kubecost-cost-analyzer -n kubecost
To get further information of the PVC, check the events by describing the PVC.
$ kubectl describe pvc kubecost-cost-analyzer -n kubecost
From here, I can guess that it is the AWS EBS CSI Driver's service account that could not assume the role specified for the corresponding EKS add-on.
$ kubectl get pvc kubecost-cost-analyzer -n kubecost
$ kubectl get po -n kubecost
References
-
This article records how I troubleshoot an issue that KubeCost pod could not be created. For a detailed process to deploy KubeCost, refer to my previous blog post: Deliver better Insights of your cloud bills using KubeCost and AWS CUR.
Symptom
The issue is that the KubeCost Pods and containers stuck at either Pending or ContainerCreating status, and could not be created successfully.Investigation
The investigation step starts with the observation of the problematic status of the KubeCost Pods.$ kubectl get po -n kubecost
NAME READY STATUS RESTARTS AGE kubecost-cost-analyzer-69dd6cb7c8-hbl4b 0/2 ContainerCreating 0 9s kubecost-prometheus-server-fd678dff7-x9zjn 0/1 ContainerCreating 0 14h
To get further information of the KubeCost Pods, check the events by describing the Pod.
$ kubectl describe po -n kubecost -l app=cost-analyzer
$ kubectl describe po -n kubecost -l app=prometheus
... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 27s default-scheduler 0/6 nodes are available: 1 node(s) had volume node affinity conflict, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.. Warning FailedScheduling 25s karpenter ...Or, one could observe below error messages:
... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 77s default-scheduler Successfully assigned kubecost/kubecost-cost-analyzer-7bdb9fbc86-lszx4 to ip-10-0-x-yy.us-west-2.compute.internal Warning FailedAttachVolume 47s (x25 over 76s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-f7725059-d313-4192-9c79-d85fb5a658e9" : rpc error: code = Internal desc = Could not attach volume "vol-0f83***3443" to node "i-06fbxxx1a9e": error listing AWS instances: WebIdentityErr: failed to retrieve credentials caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity status code: 403, request id: xxxxOr, :
... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 60s default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
$ kubectl describe po -n kubecost -l app=cost-analyzer
... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 81s default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
Based on the error messages, it indicates that a service account could not assume an IAM role. It also shows that there is issue regarding PVC.
When checking PVC status, its status is pending.
$ kubectl get pvc kubecost-cost-analyzer -n kubecost
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE kubecost-cost-analyzer Pending gp2 17m
To get further information of the PVC, check the events by describing the PVC.
$ kubectl describe pvc kubecost-cost-analyzer -n kubecost
... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Provisioning 17m ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_d08f087d-e186-4d44-8ccb-a4a6c13ecbd4 External provisioner is provisioning volume for claim "kubecost/kubecost-cost-analyzer" Normal WaitForFirstConsumer 17m persistentvolume-controller waiting for first consumer to be created before binding Warning ProvisioningFailed 17m ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_7a11f7bb-513e-447e-90e5-b3aec6ce8b21 failed to provision volume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-b2c50706-b61d-4424-a03e-c6b0595fce03": could not create volume in EC2: WebIdentityErr: failed to retrieve credentials caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity status code: 403, request id: xxxx ... Normal ExternalProvisioning 2m33s (x62 over 17m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered. Normal Provisioning 2m15s (x12 over 17m) ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_7a11f7bb-513e-447e-90e5-b3aec6ce8b21 External provisioner is provisioning volume for claim "kubecost/kubecost-cost-analyzer" Warning ProvisioningFailed 2m15s (x3 over 8m43s) ebs.csi.aws.com_ebs-csi-controller-6fcb897cfc-79lgh_7a11f7bb-513e-447e-90e5-b3aec6ce8b21 (combined from similar events): failed to provisionvolume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-b2c50706-b61d-4424-a03e-c6b0595fce03": could not create volume in EC2: WebIdentityErr: failed to retrieve credentials caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity status code: 403, request id: xxxx
From here, I can guess that it is the AWS EBS CSI Driver's service account that could not assume the role specified for the corresponding EKS add-on.
Solution
Previously, the trust policy of the IAM role for the Amazon EBS CSI Driver EKS add-on is misconfigured as below, which is obviously incorrect, but this misconfiguration does not lead to any impact to the application, so I haven't notice it.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F:aud": "sts.amazonaws.com", "oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F:sub": "system:serviceaccount:kube-system:ebs-csi-*" } } } ] }Correct the trust policy to be:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.us-west-2.amazonaws.com/id/BFDB***8D49F:aud": "sts.amazonaws.com", "oidc.eks.us-west-2.amazonaws.com/id/BFDB***D49F:sub": "system:serviceaccount:kube-system:ebs-csi-controller-sa" } } } ] }
$ kubectl get pvc kubecost-cost-analyzer -n kubecost
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE kubecost-cost-analyzer Bound pvc-b2xxxx06-xxxx-xxxx-xxxx-c6b0xxxxce03 32Gi RWO gp2 20m
$ kubectl get po -n kubecost
NAME READY STATUS RESTARTS AGE kubecost-cost-analyzer-5cb6fd4f9d-t6d94 2/2 Running 0 8m3s kubecost-prometheus-server-fd678dff7-tq6wp 1/1 Running 0 8m3s
References
-