社消平台(SPT)k8s节点失联导致pod无法调度

社消平台(SPT)k8s节点失联导致pod无法调度

背景:在日常更新SPT平台时发现有些许pod处于Terminating状态长达10分钟之久,导致更新后的pod长时间处于Pending状态无法调度,使用加--force--grace-period=0参数强制删除后发现新创建的pod还是旧版本,新版pod还是一直处于Pending,出现这种情况一般有以下几点问题:

  • 节点故障:Pod 所在的节点可能已经失联,导致 Kubernetes 无法与其通信。
  • 持久卷未卸载:Pod 使用的持久卷未能成功卸载。
  • PreStop Hook:Pod 的 PreStop 钩子未能成功完成或超时。

首先最要确定的就是pod所在节点是否有可能失联,当我用ssh 远程登陆时提示以下内容:

7e43e86dbc43700d1d938fba975fe36

并且通过堡垒机输入账号密码登录也提示拒绝连接,到这已经八九不离十了,确定是节点失联无疑了。但现在首要任务是保证集群正常访问(技术群快要炸了)。首先让节点不可调度

root@spt-kubernetes-master:~# kubectl cordon spt-kubernetes-vice-1
node/spt-kubernetes-vice-1 cordoned
root@spt-kubernetes-master:~# kubectl get nodes
NAME                         STATUS                     ROLES                  AGE    VERSION
spt-elasticsearch-data       Ready                      <none>                 399d   v1.20.0
spt-elasticsearch-logstash   Ready                      <none>                 399d   v1.20.0
spt-elasticsearch-master     Ready                      <none>                 399d   v1.20.0
spt-kubernetes-common        Ready                      <none>                 399d   v1.20.0
spt-kubernetes-common-1      Ready                      <none>                 399d   v1.20.0
spt-kubernetes-common-2      Ready                      <none>                 399d   v1.20.0
spt-kubernetes-common-3      Ready                      <none>                 399d   v1.20.0
spt-kubernetes-master        Ready                      control-plane,master   399d   v1.20.0
spt-kubernetes-vice          Ready                      <none>                 399d   v1.20.0
spt-kubernetes-vice-1        Ready,SchedulingDisabled   <none>                 399d   v1.20.0
spt-mongod-iot               Ready                      <none>                 399d   v1.20.0
spt-mongos-iot               Ready                      <none>                 399d   v1.20.0
spt-rabbitmq-service         Ready                      <none>                 399d   v1.20.0

删除spt-kubernetes-vice-1节点上所有pod

PS:为什么不用 drain 优雅的驱逐pod,原因已经说了,旧版本pod一直处于Terminating状态(占着茅坑,不干活),导致新pod一直处于Pending

kubectl get pod -owide | grep spt-kubernetes-vice-1 | awk '{print $1}'| xargs -I {} kubectl delete pod {} --force

这会无法访问的问题暂时解决了,接下来搞定node节点失联的问题。把问题反馈给资源管理人员之后,他们通过控制台登录(没登陆进去),发现一直报堆内存不足。无法登录ssh那就没有办法喽,只能资源管理人员帮我重启机器。

f061c6bb1843b5f1e6b3410bd958f95

重启过后发现可以登录了,但是我将节点恢复调度后发现了新的问题

root@spt-kubernetes-master:~# kubectl get pod -owide | grep spt-kubernetes-vice-1
aiot-socialservicesystem-cfcbdf66d-rlkr5                  0/1     ContainerCreating   0          35s     <none>          spt-kubernetes-vice-1        <none>           <none>
aiot-socialunitordersystem-ccc9c8cd9-f2skc                0/1     ContainerCreating   0          25s     <none>          spt-kubernetes-vice-1        <none>           <none>
aiot-socialunitordersystem-ccc9c8cd9-pd72z                0/1     ContainerCreating   0          31s     <none>          spt-kubernetes-vice-1        <none>           <none>
aiot-socialunitordersystem-ccc9c8cd9-zb8tk                0/1     ContainerCreating   0          31s     <none>          spt-kubernetes-vice-1        <none>           <none>

新的pod长时间无法创建,没办法查看pod Events后发现是该spt-kubernetes-vice-1节点上flannel报错

root@spt-kubernetes-master:~# kubectl describe pod aiot-socialservicesystem-cfcbdf66d-rlkr5
................此处省略一万字................
Events:
  Type     Reason                  Age                     From               Message
  ----     ------                  ----                    ----               -------
  Normal   Scheduled               45s                     default-scheduler  Successfully assigned default/aiot-socialservicesystem-cfcbdf66d-rlkr5 to spt-kubernetes-vice-1
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "181894f81c178d0fc035f3f006d2cbceae2e0c7138d29855d5dea9114358e218" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "de3021a7e42dbbe3ef3027ee5914a5207f4d20c3042725eb7af442c956460fe3" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "6e03439c752d5637bde6d721d56d767c0f4a676399f0d942a4989a484fe23a3b" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4a81190b56284c60183c16898164700c3c137b13bfe8028ae779ee695d38599e" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "95c60ea50e74de80abf83b03433123a27ab956b5409bbc4561f3c83a2840ea82" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cab1da3814483b17143c51c5d1de178499933248c752725f9bf55a80ee745927" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "bbde6e26a0c3e4e741d7120b4083d4e223faa3b92925420c989d7bf2eec14d78" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3c64dc61f7861938bb6f9a250c6621cbb4ede1425a6f4703d4a5f9dbdf445660" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  7h59m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "a440a77b4c87deb042b00166496c9b8197d340a9a135286b6860004b78ccbf4e" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          7h59m (x12 over 7h59m)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  7h59m (x4 over 7h59m)   kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d6f92c7d062b0682a6a12ff1f728659eb9b259eec921850ccfa15a41456a8128" network for pod "aiot-socialservicesystem-cfcbdf66d-rlkr5": networkPlugin cni failed to set up pod "aiot-socialservicesystem-cfcbdf66d-rlkr5_default" network: open /run/flannel/subnet.env: no such file or directory

查看flannel后发现pod的状态码是137,pod非正常退出(包括kube-proxy也是非正常退出)

root@spt-kubernetes-master:~# kubectl get pod -n kube-system -owide | grep spt-kubernetes-vice-1
kube-flannel-ds-wwrfb                           0/1     Completed   1          356d   10.172.14.6     spt-kubernetes-vice-1        <none>           <none>
kube-proxy-c9jpv                                0/1     Error       2          399d   10.172.14.6     spt-kubernetes-vice-1

这可能是之前节点OOM导致的,直接把两个pod删除重建就好。

root@spt-kubernetes-master:~# kubectl -n kube-system get pod -owide | grep spt-kubernetes-vice-1 | awk '{print $1}'| xargs -I {} kubectl delete pod -n kube-system {} --force
root@spt-kubernetes-master:~# kubectl get pod -n kube-system -owide | grep spt-kubernetes-vice-1
kube-flannel-ds-79mvm                           1/1     Running   1          5m58s
kube-proxy-qkbj6                                1/1     Running   0          29s

现在再去查看发现已经有新的pod在spt-kubernetes-vice-1节点上陆续创建。

root@spt-kubernetes-master:~# kubectl get pod -owide | grep spt-kubernetes-vice-1
aiot-socialservicesystem-cfcbdf66d-jgzq8                  0/1     Running             0          2s      10.244.7.122    spt-kubernetes-vice-1        <none>           <none>
aiot-socialservicesystem-cfcbdf66d-rlkr5                  1/1     Running             2          14m     10.244.7.120    spt-kubernetes-vice-1        <none>           <none>
aiot-socialservicesystem-cfcbdf66d-xwp7w                  0/1     Running             0          6s      10.244.7.121    spt-kubernetes-vice-1        <none>           <none>
aiot-socialunitordersystem-ccc9c8cd9-8k7p6                0/1     ContainerCreating   0          0s      <none>          spt-kubernetes-vice-1        <none>           <none>
aiot-socialunitordersystem-ccc9c8cd9-t65gy                0/1     Running       0          26s     10.244.7.123    spt-kubernetes-vice-1        <none>           <none>
© 版权声明
THE END
喜欢就支持一下吧
点赞9赞赏 分享
评论 抢沙发
头像
欢迎您留下宝贵的见解!
提交
头像

昵称

夸夸
夸夸
还有吗!没看够!
取消
昵称表情代码图片

    暂无评论内容