写于 2021-12,背景:K8s 1.23 GA,1.24 即将引入对 dockershim 移除。本文聚焦"集群日常运维"——升级、节点隔离、API 速查与排错清单。
一、kubectl 命令速查
1.1 资源操作
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| # 查看(-o wide 显示节点,-A 所有 namespace)
kubectl get all -o wide -A
kubectl get pods -n test
kubectl get nodes -o wide
# 描述(看详细事件)
kubectl describe pod <pod-name> -n <ns>
kubectl describe node <node-name>
# 日志
kubectl logs -f <pod-name> -n <ns>
kubectl logs -f <pod-name> -c <container-name> -n <ns> # 多容器 Pod
kubectl logs --previous <pod-name> -n <ns> # 上一个崩溃实例
# 进入容器
kubectl exec -it <pod-name> -n <ns> -- sh
kubectl exec -it <pod-name> -c <container> -n <ns> -- sh
# 删除
kubectl delete pod <pod-name> -n <ns>
kubectl delete -f deployment.yaml
kubectl delete ns <ns-name>
# 强制删除卡 Terminating 的 Pod
kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force
# 编辑(直接改集群里的配置)
kubectl edit deployment <name> -n <ns>
kubectl edit cm <configmap> -n <ns>
# apply / replace
kubectl apply -f deployment.yaml
kubectl replace -f deployment.yaml --force
|
1.2 排错三剑客
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # describe:看 Event,最常用
kubectl describe pod <pod> -n <ns>
# 关键信息:
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Normal Scheduled 60s default-scheduler Successfully assigned ...
# Normal Pulling 58s kubelet Pulling image "nginx:1.25"
# Warning Failed 30s kubelet Failed to pull image: ... timeout
# logs:看容器内输出
kubectl logs -f <pod> -n <ns>
kubectl logs --previous <pod> -n <ns> # 上一个实例
# exec:进容器内部
kubectl exec -it <pod> -n <ns> -- sh
|
二、版本升级
重要原则:K8s 一年发 4 个版本(1.24+),生产不要追新;只升补丁版本(z 版本)几乎零风险,升次要版本(y 版本)要谨慎。
2.1 升级前准备
| 检查项 | 命令 |
|---|
| 当前版本 | kubectl version |
| 节点数 | kubectl get nodes |
| 是否有运行中的 workload | kubectl get pods -A |
| 备份 etcd | etcdctl snapshot save /tmp/etcd.db |
| 备份 /etc/kubernetes | cp -r /etc/kubernetes /etc/kubernetes.bak |
| 备份证书 | cp -r /etc/kubernetes/pki /etc/kubernetes/pki.bak |
2.2 升级流程(小版本,如 1.28.5 → 1.28.8)
步骤:
- 升级 master 节点(先升级第一个,验证后再升级其他)
- 升级 worker 节点(一次一个)
- 验证
master1 操作:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # 1. 备份
mkdir -p /usr/local/bin/bak
mv /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy} /usr/local/bin/bak
# 2. 下载新包
wget https://dl.k8s.io/v1.28.8/kubernetes-server-linux-amd64.tar.gz
# (生产环境建议从内网文件服务器拉取)
# 3. 解压到 /usr/local/bin
tar -xf kubernetes-server-linux-amd64.tar.gz \
--strip-components=3 -C /usr/local/bin \
kubernetes/server/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy}
# 4. 重启组件
systemctl restart kube-apiserver.service
systemctl restart kube-controller-manager.service
systemctl restart kube-scheduler.service
systemctl restart kubelet.service
systemctl restart kube-proxy.service
# 5. 验证
kubectl version
kubectl get nodes
|
其他 master 节点(master2 / master3):
1
2
3
4
5
6
7
| # 一键批量升级
for NODE in master2 master3; do
echo "===== $NODE ====="
ssh $NODE "mkdir -p /usr/local/bin/bak && mv /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy} /usr/local/bin/bak"
scp /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy} $NODE:/usr/local/bin/
ssh $NODE "systemctl restart kube-apiserver kube-controller-manager kube-scheduler kubelet kube-proxy"
done
|
worker 节点:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # worker 只跑 kubelet + kube-proxy
for NODE in worker1 worker2 worker3; do
echo "===== $NODE ====="
# 先 cordon + drain(见下节)
kubectl cordon $NODE
kubectl drain $NODE --delete-local-data --ignore-daemonsets --force
# 升级
ssh $NODE "mkdir -p /usr/local/bin/bak && mv /usr/local/bin/kube{let,-proxy} /usr/local/bin/bak"
scp /usr/local/bin/kube{let,-proxy} $NODE:/usr/local/bin/
ssh $NODE "systemctl restart kubelet kube-proxy"
# 验证 + uncordon
sleep 30
kubectl uncordon $NODE
done
|
2.3 升级验证
1
2
3
4
5
6
7
8
9
10
11
12
| # 所有节点版本一致
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# master1 Ready control-plane 50d v1.28.8
# master2 Ready control-plane 50d v1.28.8
# master3 Ready control-plane 50d v1.28.8
# worker1 Ready <none> 30d v1.28.8
# worker2 Ready <none> 30d v1.28.8
# 集群健康
kubectl get cs
kubectl get pods -A | grep -v Running
|
三、节点隔离:cordon / drain / uncordon
K8s 提供 3 个状态机操作:
| 命令 | 作用 | 是否驱逐 Pod |
|---|
kubectl cordon <node> | 标记节点 unschedulable | 否 |
kubectl drain <node> | cordon + 驱逐所有 Pod | 是 |
kubectl uncordon <node> | 恢复 schedulable | 否 |
3.1 维护前的标准操作
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # 1. cordon:禁止新 Pod 调度上来
kubectl cordon master1
# node/master1 cordoned
# 2. drain:驱逐现有 Pod
kubectl drain master1 --delete-local-data --ignore-daemonsets --force
# --delete-local-data:删除带 emptyDir 的 Pod
# --ignore-daemonsets:忽略 DaemonSet(kube-proxy / calico 不能被驱逐)
# --force:强制驱逐没有 PDB 保护的 Pod
# 3. 维护:升级内核 / 改网络 / 重启 etcd
...
# 4. uncordon:恢复
kubectl uncordon master1
# node/master1 uncordoned
|
3.2 查看节点上的所有 Pod
1
2
3
4
5
6
7
| # 某个节点上所有 Pod
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=worker2
# 批量删除(驱逐后用)
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=worker2 | \
grep -v "kube-system" | awk '{print $1,$2}' | \
xargs -n2 kubectl delete pod -n
|
四、强制删除卡住的 Pod
症状:kubectl delete pod xxx 后 Pod 一直 Terminating(>5 分钟)。
原因:
- Pod 里有容器没响应 SIGTERM
- Pod 有 finalizer 卡住
- kubelet 与 apiserver 通信异常
解决:
1
2
3
4
5
6
7
8
9
10
11
12
| # 强制删除(跳过 graceful period)
kubectl delete pod <pod> -n <ns> --grace-period=0 --force
# 还是卡?看 finalizer
kubectl get pod <pod> -n <ns> -o yaml | grep finalizers
# 移除 finalizer(必须先 kubectl edit)
kubectl edit pod <pod> -n <ns>
# 删除 metadata.finalizers 字段,保存
# 再删
kubectl delete pod <pod> -n <ns> --force --grace-period=0
|
五、备份与恢复
5.1 etcd 备份
1
2
3
4
5
6
7
8
9
| # 在每个 etcd 节点执行(或一个能访问所有 etcd 的节点)
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ssl/etcd-ca.pem \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem
# 验证
etcdctl snapshot status /tmp/etcd-snapshot-20231215.db --write-out=table
|
定时备份(crontab):
1
| 0 2 * * * /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1
|
1
2
3
4
5
6
7
8
9
10
11
| #!/bin/bash
# /usr/local/bin/etcd-backup.sh
BACKUP_DIR=/var/backups/etcd
mkdir -p $BACKUP_DIR
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-$(date +\%Y\%m\%d-\%H\%M\%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ssl/etcd-ca.pem \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem
# 保留 7 天
find $BACKUP_DIR -name "etcd-*.db" -mtime +7 -delete
|
5.2 etcd 恢复
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| # 1. 停所有 master 上的 apiserver(避免新数据写入)
for NODE in master1 master2 master3; do
ssh $NODE "systemctl stop kube-apiserver"
done
# 2. 在每个 etcd 节点恢复
etcdctl snapshot restore /tmp/etcd-snapshot.db \
--name master1 \
--initial-cluster master1=https://192.168.139.133:2380,... \
--initial-advertise-peer-urls https://192.168.139.133:2380 \
--data-dir /var/lib/etcd-restore
# 3. 改 etcd.service 指向新 data-dir
# ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
# 改 data-dir: /var/lib/etcd → /var/lib/etcd-restore
# 4. 启动 etcd
systemctl start etcd
# 5. 启动 apiserver
for NODE in master1 master2 master3; do
ssh $NODE "systemctl start kube-apiserver"
done
# 6. 验证
kubectl get nodes
|
六、master 节点 IP 更换
K8s master 节点 IP 变了(机房迁移),需要:
- 替换
/etc/kubernetes 下所有 IP - 替换
/etc/etcd/etcd.config.yml IP - 替换 kube-apiserver systemd unit
- 重新签发证书(新 IP 必须在证书 hostname 列表里)
- 重启所有组件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| # 1. 替换配置
oldip=192.168.0.121
newip=192.168.133.121
find /etc/kubernetes -type f -exec grep -l "$oldip" {} \;
find /etc/kubernetes -type f -exec sed -i "s/$oldip/$newip/g" {} \;
find /etc/etcd/etcd.config.yml -type f -exec grep -l "$oldip" {} \;
find /etc/etcd/etcd.config.yml -type f -exec sed -i "s/$oldip/$newip/g" {} \;
# 2. 重新签发所有证书(用新 IP)
cd /data/softs/pki
cfssl gencert -initca etcd-ca-csr.json | cfssljson -bare /etc/etcd/ssl/etcd-ca
cfssl gencert -ca=/etc/etcd/ssl/etcd-ca.pem -ca-key=/etc/etcd/ssl/etcd-ca-key.pem \
-config=ca-config.json \
-hostname=127.0.0.1,master1,$newip \
-profile=kubernetes etcd-csr.json | cfssljson -bare /etc/etcd/ssl/etcd
# ... 同样方式签发 apiserver / front-proxy / controller-manager / scheduler / admin / kube-proxy
# 3. 重新生成 kubeconfig
kubectl config set-cluster kubernetes --certificate-authority=/etc/kubernetes/pki/ca.pem \
--embed-certs=true --server=https://$newip:6443 \
--kubeconfig=/etc/kubernetes/admin.kubeconfig
# ... controller-manager / scheduler / kube-proxy / bootstrap 都重做
# 4. 重启
systemctl daemon-reload
systemctl restart kube-apiserver kube-controller-manager kube-scheduler
systemctl restart kubelet kube-proxy
|
七、排错清单
7.1 Pod 排错
| 症状 | 排查命令 | 原因 |
|---|
| Pending | kubectl describe pod 看 Events | 资源不够 / 节点 selector 不匹配 / PVC 未绑定 |
| ImagePullBackOff | kubectl describe pod | 镜像拉取失败(私有仓库 auth 错 / 镜像名错 / 网络问题) |
| CrashLoopBackOff | kubectl logs --previous | 容器启动后立即退出(应用配置错 / OOMKilled) |
| ContainerCreating | kubectl describe pod + kubectl get events | volume 挂载失败 / CNI 没就绪 |
| Error | kubectl logs | 镜像启动命令失败 / 配置文件错 |
7.2 节点排错
| 症状 | 排查命令 | 原因 |
|---|
| NotReady | kubectl describe node | kubelet 没起 / 网络不通 / 磁盘满 |
| 内存压力 | kubectl describe node | 节点压力驱逐 Pod |
7.3 网络排错
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # 同 Pod 内容器互通(同 Pod 共享网络)
kubectl exec -it pod1 -c c1 -- ping <pod2-ip>
# 同节点不同 Pod 互通
kubectl exec -it pod1 -- ping <pod2-ip>
# 跨节点 Pod 互通
kubectl exec -it pod1 -- ping <node2-ip>
# Service 解析
kubectl exec -it pod1 -- nslookup kubernetes.default
# Service 访问
kubectl exec -it pod1 -- curl http://<service-ip>:<port>
|
7.4 性能问题
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # 节点资源使用
kubectl top node
kubectl top pod -A
# 找最忙的 Pod
kubectl top pod -A --sort-by=memory | head -10
kubectl top pod -A --sort-by=cpu | head -10
# 找 Pending 原因
kubectl get pods -A | grep Pending
kubectl describe pod <pending-pod>
# kube-proxy 模式(iptables vs ipvs)
kubectl get cm kube-proxy -n kube-system -o yaml | grep mode
|
八、常见坑
- 升级大版本跳级:必须逐级升(1.27 → 1.28 → 1.29),不能 1.26 → 1.28 跳升
- cordon 后忘了 uncordon:节点永久 unschedulable
- drain 没加 –ignore-daemonsets:calico / kube-proxy 被驱逐,节点网络断
- etcd 备份忘了定时:灾难时无法恢复
- 证书 hostname 漏配:节点迁移后 kubectl 报 x509 错误
- kubeconfig 没更新:节点迁移后 kubectl 仍连老 IP
九、前置知识 / 下一步
前置:
下一步:
- K8s 资源限制与探针(2022-03-15)—— Pod 调优
- Kubeadm 一键部署(2022-06-15)—— 生产部署实战
参考资料