废柴阿尤4个月前更新关注私信 背景:在日常更新SPT平台时发现有些许pod处于Terminating状态长达10分钟之久,导致更新后的pod长时间处于Pending状态无法调度,使用加--force和--grace-period=0参数强制删除后发现新创建的pod还是旧版本,新版pod还是一直处于Pending,出现这种情况一般有以下几点问题: 节点故障:Pod 所在的节点可能已经失联,导致 Kubernetes 无法与其通信。 持久卷未卸载:Pod 使用的持久卷未能成功卸载。 PreStop Hook:Pod 的 PreStop 钩子未能成功完成或超时。 首先最要确定的就是pod所在节点是否有可能失联,当我用ssh 远程登陆时提示以下内容: 并且通过堡垒机输入账号密码登录也提示拒绝连接,到这已经八九不离十了,确定是节点失联无疑了。但现在首要任务是保证集群正常访问(技术群快要炸了)。首先让节点不可调度 root@spt-kubernetes-master:~# kubectl cordon spt-kubernetes-vice-1 node/spt-kubernetes-vice-1 cordoned root@spt-kubernetes-master:~# kubectl get nodes NAME STATUS ROLES AGE VERSION spt-elasticsearch-data Ready <none> 399d v1.20.0 spt-elasticsearch-logstash Ready <none> 399d v1.20.0 spt-elasticsearch-master Ready <none> 399d v1.20.0 spt-kubernetes-common Ready <none> 399d v1.20.0 spt-kubernetes-common-1 Ready <none> 399d v1.20.0 spt-kubernetes-common-2 Ready <none> 399d v1.20.0 spt-kubernetes-common-3 Ready <none> 399d v1.20.0 spt-kubernetes-master Ready control-plane,master 399d v1.20.0 spt-kubernetes-vice Ready <none> 399d v1.20.0 spt-kubernetes-vice-1 Ready,SchedulingDisabled <none> 399d v1.20.0 spt-mongod-iot Ready <none> 399d v1.20.0 spt-mongos-iot Ready <none> 399d v1.20.0 spt-rabbitmq-service Ready <none> 399d v1.20.0 删除spt-kubernetes-vice-1节点上所有pod PS:为什么不用 drain 优雅的驱逐pod,原因已经说了,旧版本pod一直处于Terminating状态(占着茅坑,不干活),导致新pod一直处于Pending kubectl get pod -owide | grep spt-kubernetes-vice-1 | awk '{print $1}'| xargs -I {} kubectl delete pod {} --force 这会无法访问的问题暂时解决了,接下来搞定node节点失联的问题。把问题反馈给资源管理人员之后,他们通过控制台登录(没登陆进去),发现一直报堆内存不足。无法登录ssh那就没有办法喽,只能资源管理人员帮我重启机器。 重启过后发现可以登录了,但是我将节点恢复调度后发现了新的问题 root@spt-kubernetes-master:~# kubectl get pod -owide | grep spt-kubernetes-vice-1 aiot-socialservicesystem-cfcbdf66d-rlkr5 0/1 ContainerCreating 0 35s <none> spt-kubernetes-vice-1 <none> <none> aiot-socialunitordersystem-ccc9c8cd9-f2skc 0/1 ContainerCreating 0 25s <none> spt-kubernetes-vice-1 <none> <none> aiot-socialunitordersystem-ccc9c8cd9-pd72z 0/1 ContainerCreating 0 31s <none> spt-kubernetes-vice-1 <none> <none> aiot-socialunitordersystem-ccc9c8cd9-zb8tk 0/1 ContainerCreating 0 31s <none> spt-kubernetes-vice-1 <none> <none> 新的pod长时间无法创建,没办法查看pod Events后发现是该spt-kubernetes-vice-1节点上flannel报错 root@spt-kubernetes-master:~# kubectl describe pod aiot-socialservicesystem-cfcbdf66d-rlkr5 ................此处省略一万字................ Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 45s default-scheduler Successfully assigned default/aiot-socialservicesystem-cfcbdf66d-rlkr5 to spt-kubernetes-vice-1 Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "181894f81c178d0fc035f3f006d2cbceae2e0c7138d29855d5dea9114358e218" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "de3021a7e42dbbe3ef3027ee5914a5207f4d20c3042725eb7af442c956460fe3" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "6e03439c752d5637bde6d721d56d767c0f4a676399f0d942a4989a484fe23a3b" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4a81190b56284c60183c16898164700c3c137b13bfe8028ae779ee695d38599e" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "95c60ea50e74de80abf83b03433123a27ab956b5409bbc4561f3c83a2840ea82" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cab1da3814483b17143c51c5d1de178499933248c752725f9bf55a80ee745927" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "bbde6e26a0c3e4e741d7120b4083d4e223faa3b92925420c989d7bf2eec14d78" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3c64dc61f7861938bb6f9a250c6621cbb4ede1425a6f4703d4a5f9dbdf445660" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 7h59m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "a440a77b4c87deb042b00166496c9b8197d340a9a135286b6860004b78ccbf4e" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory Normal SandboxChanged 7h59m (x12 over 7h59m) kubelet Pod sandbox changed, it will be killed and re-created. Warning FailedCreatePodSandBox 7h59m (x4 over 7h59m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d6f92c7d062b0682a6a12ff1f728659eb9b259eec921850ccfa15a41456a8128" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory 查看flannel后发现pod的状态码是137,pod非正常退出(包括kube-proxy也是非正常退出) root@spt-kubernetes-master:~# kubectl get pod -n kube-system -owide | grep spt-kubernetes-vice-1 kube-flannel-ds-wwrfb 0/1 Completed 1 356d 10.172.14.6 spt-kubernetes-vice-1 <none> <none> kube-proxy-c9jpv 0/1 Error 2 399d 10.172.14.6 spt-kubernetes-vice-1 这可能是之前节点OOM导致的,直接把两个pod删除重建就好。 root@spt-kubernetes-master:~# kubectl -n kube-system get pod -owide | grep spt-kubernetes-vice-1 | awk '{print $1}'| xargs -I {} kubectl delete pod -n kube-system {} --force root@spt-kubernetes-master:~# kubectl get pod -n kube-system -owide | grep spt-kubernetes-vice-1 kube-flannel-ds-79mvm 1/1 Running 1 5m58s kube-proxy-qkbj6 1/1 Running 0 29s 现在再去查看发现已经有新的pod在spt-kubernetes-vice-1节点上陆续创建。 root@spt-kubernetes-master:~# kubectl get pod -owide | grep spt-kubernetes-vice-1 aiot-socialservicesystem-cfcbdf66d-jgzq8 0/1 Running 0 2s 10.244.7.122 spt-kubernetes-vice-1 <none> <none> aiot-socialservicesystem-cfcbdf66d-rlkr5 1/1 Running 2 14m 10.244.7.120 spt-kubernetes-vice-1 <none> <none> aiot-socialservicesystem-cfcbdf66d-xwp7w 0/1 Running 0 6s 10.244.7.121 spt-kubernetes-vice-1 <none> <none> aiot-socialunitordersystem-ccc9c8cd9-8k7p6 0/1 ContainerCreating 0 0s <none> spt-kubernetes-vice-1 <none> <none> aiot-socialunitordersystem-ccc9c8cd9-t65gy 0/1 Running 0 26s 10.244.7.123 spt-kubernetes-vice-1 <none> <none> © 版权声明文章版权归作者所有,未经允许请勿转载。THE ENDKubernetes 喜欢就支持一下吧点赞9赞赏 分享QQ空间微博QQ好友海报分享复制链接收藏