升级 vs 卸载
K8s 集群"生命周期"两个最复杂的操作:
- 升级:小版本(1.28.5 → 1.28.8)可以"无脑"替换二进制;大版本(1.28 → 1.29)要看组件参数变化
- 卸载:彻底清掉所有 master / worker 上的 K8s 文件、systemd 单元、容器运行时、证书
适用版本:升级 1.28.5 → 1.28.8 / 卸载 K8s 1.28.5
1. 升级策略
1.1 大版本 vs 小版本
- 小版本升级(patch 1.28.x → 1.28.y):只换二进制 + 重启组件
- 次版本升级(1.28 → 1.29):要关注 deprecated API(如
policy/v1beta1)、CRD schema 变化、组件 flag 变化 - 大版本升级(1.x → 2.x):官方不支持,必须重装
1.2 升级顺序
1
| master1 → master2 → master3 → worker1 → worker2 → ... → workerN
|
etcd 先升,K8s 控制面组件(apiserver → controller-manager → scheduler)后升,最后 worker。
1.3 当前环境
| 组件 | 当前 | 目标 |
|---|
| etcd | v3.5.11 | 3.5.13 |
| kubernetes-server | 1.28.5 | 1.28.8 |
| coredns | 1.29.0 | 1.29.0(不动) |
2. 升级 etcd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| # 备份
mkdir -p /data/etcd-bak
etcdctl snapshot save /data/etcd-bak/etcd-snapshot-$(date +%Y%m%d).db
# 停 etcd
systemctl stop etcd.service
mv /usr/local/bin/etcd* /data/etcd-bak/
# 装新版本
cd /data/softs
tar -xf etcd-v3.5.13-linux-amd64.tar.gz
mv etcd-v3.5.13-linux-amd64/etcd /usr/local/bin/
mv etcd-v3.5.13-linux-amd64/etcdctl /usr/local/bin/
mv etcd-v3.5.13-linux-amd64/etcdutl /usr/local/bin/
# 同步到其他 master
for NODE in master2 master3; do
ssh $NODE "mv /usr/local/bin/etcd* /data/etcd-bak/"
scp /usr/local/bin/etcd* $NODE:/usr/local/bin/
done
# 启动
systemctl start etcd.service
etcdctl version
# etcdctl version: 3.5.13
|
3. 升级 K8s 控制面
3.1 升级 master1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # 备份
mkdir -p /usr/local/bin/bak
mv /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy} /usr/local/bin/bak
# 装新版本
cd /data/softs
tar -xf kubernetes-server-linux-amd64.tar.gz \
--strip-components=3 \
-C /usr/local/bin \
kubernetes/server/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy}
# 重启所有 master 组件
systemctl restart kube-apiserver.service
systemctl restart kube-controller-manager.service
systemctl restart kube-scheduler.service
systemctl restart kubelet.service
systemctl restart kube-proxy.service
# 验证
kubectl get node
# master1 应该还是 Ready,但 VERSION 还是 1.28.5(只有 etcd 那台没升时)
|
3.2 升级 master2 + master3
1
2
3
4
5
6
7
8
9
10
11
12
| for NODE in master2 master3; do
echo "=== $NODE ==="
ssh root@$NODE "mkdir -p /usr/local/bin/bak && \
mv /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy} /usr/local/bin/bak"
scp /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy} \
$NODE:/usr/local/bin/
ssh root@$NODE "systemctl restart kube-apiserver.service && \
systemctl restart kube-controller-manager.service && \
systemctl restart kube-scheduler.service && \
systemctl restart kubelet.service && \
systemctl restart kube-proxy.service"
done
|
3.3 升级 worker
1
2
3
4
5
6
7
8
| for NODE in worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9; do
echo "=== $NODE ==="
ssh root@$NODE "mkdir -p /usr/local/bin/bak && \
mv /usr/local/bin/kube{let,-proxy} /usr/local/bin/bak"
scp /usr/local/bin/kube{let,-proxy} $NODE:/usr/local/bin/
ssh root@$NODE "systemctl restart kubelet.service && \
systemctl restart kube-proxy.service"
done
|
3.4 验证
1
2
| kubectl get node
# 全部 Ready,VERSION 显示 1.28.8
|
4. 升级时的节点隔离
升级 master 时要先把 master 标记不可调度(避免新 Pod 调度到正在重启的 master):
1
2
3
4
5
6
7
8
| # 隔离
kubectl cordon master1
kubectl drain master1 --delete-local-data --ignore-daemonsets --force
systemctl restart kube-apiserver.service
# ...
# 恢复
kubectl uncordon master1
|
drain 驱逐所有非 DaemonSet Pod 到其他节点。--delete-emptydir-data 允许删除 emptyDir 类型数据(一般安全)。
5. 卸载 K8s 集群
5.1 停服务顺序
1
2
3
4
5
6
7
8
9
10
11
| # 1. 停所有 K8s 服务
systemctl status kube-apiserver.service
systemctl status kube-controller-manager.service
systemctl status kube-scheduler.service
systemctl status kubelet.service
systemctl status kube-proxy.service
# 停
systemctl stop kube-apiserver.service
systemctl disable kube-apiserver.service
# ... 同上其他
|
5.2 删 systemd 单元
1
2
3
4
5
6
7
8
9
10
11
12
| rm -rf /etc/systemd/system/kube-apiserver.service
rm -rf /etc/systemd/system/multi-user.target.wants/kube-apiserver.service
rm -rf /etc/systemd/system/kube-controller-manager.service
rm -rf /etc/systemd/system/multi-user.target.wants/kube-controller-manager.service
rm -rf /etc/systemd/system/kube-scheduler.service
rm -rf /etc/systemd/system/multi-user.target.wants/kube-scheduler.service
rm -rf /etc/systemd/system/kubelet.service
rm -rf /etc/systemd/system/multi-user.target.wants/kubelet.service
rm -rf /etc/systemd/system/kube-proxy.service
rm -rf /etc/systemd/system/multi-user.target.wants/kube-proxy.service
rm -rf ~/bootstrap
systemctl daemon-reload
|
5.3 删目录
1
2
3
4
5
6
7
| rm -rf /etc/kubernetes
rm -rf /usr/local/bin/kube{let,ctl,-apiserver,-controller-manager,-scheduler,-proxy}
rm -rf /var/lib/kubelet
rm -rf /var/log/kubernetes
rm -rf /etc/cni/net.d
rm -rf /root/.kube
rm -rf ~/bootstrap/bootstrap.secret.yaml
|
5.4 清环境变量
1
2
3
4
5
6
| # 临时
unset KUBECONFIG
# 永久
sudo sed -i '/KUBECONFIG/d' /etc/profile ~/.bashrc ~/.profile
source ~/.bashrc
|
5.5 停 etcd
1
2
3
4
5
6
7
8
| systemctl status etcd.service
systemctl stop etcd.service
systemctl disable etcd.service
rm -rf /etc/systemd/system/etcd.service
rm -rf /usr/local/bin/etcd*
rm -rf /etc/etcd
rm -rf /var/lib/etcd
systemctl daemon-reload
|
5.6 停容器运行时
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # 停所有 docker 相关
systemctl stop docker.service
systemctl disable docker.service
systemctl stop cri-docker.socket
systemctl disable cri-docker.socket
systemctl stop cri-docker.service
systemctl disable cri-docker.service
systemctl stop containerd.service
systemctl disable containerd.service
# 删 systemd 单元
rm -rf /etc/systemd/system/{docker.service,docker.socket,cri-docker.service,cri-docker.socket,containerd.service}
rm -rf /etc/docker
rm -rf /usr/local/bin/{containerd*,docker*,runc,cri-dockerd,ctr}
rm -rf /var/lib/docker
rm -rf /var/lib/containerd
rm -rf /run/cri-dockerd.sock
rm -rf /var/run/docker.sock
systemctl daemon-reload
|
5.7 其他清理
1
2
3
4
5
6
7
8
9
10
11
12
| # 网络(IPIP 模式 / CNI)
modprobe -r ipip 2>/dev/null
ip link delete tunl0 2>/dev/null
rm -rf /opt/cni/bin
rm -rf /var/lib/cni
rm -rf /var/lib/calico 2>/dev/null
# bootstrap 目录
rm -rf ~/bootstrap
# 历史镜像
docker system prune -af 2>/dev/null
|
5.8 验证
1
2
3
4
5
6
7
8
9
10
11
| # 进程
ps -ef | grep -E "kube|etcd|docker|containerd" | grep -v grep
# 应当空
# 端口
ss -tlnp | grep -E "6443|2379|2380|10250|10251|10252|10255"
# 应当空
# 残留目录
ls /usr/local/bin | grep -E "kube|etcd|containerd|docker"
# 应当空
|
6. namespace 卡 Terminating 终极清理
卸载过程中如果 namespace 一直卡 Terminating(典型场景:Rook-Ceph、cert-manager):
1
2
3
4
5
| # 通用方法
ns=rook-ceph
kubectl get namespace $ns -o json | jq '.spec = {"finalizers":[]}' > temp.json
curl -k -H "Content-Type: application/json" -X PUT \
--data-binary @temp.json 127.0.0.1:8001/api/v1/namespaces/$ns/finalize
|
或:
1
| kubectl patch ns rook-ceph --type=merge -p '{"metadata":{"finalizers":null}}'
|
但要先启动 kubectl proxy:
6.1 PVC 卡 Terminating
1
2
3
4
| kubectl patch pvc jenkins-agent-pvc -n kube-ops --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl patch pv pvc-c87554c7-6b20-42ed-8cdd-8be1d630a744 \
-p '{"metadata":{"finalizers":null}}' --type=merge
|
6.2 Pod 卡 Terminating
1
2
3
4
| kubectl patch pod nacos-0 -n safety-xjprod \
--type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete po <pod> -n <ns> --grace-period=0 --force
|
6.3 后台强删 namespace
1
2
| nohup kubectl delete ns cattle-impersonation-system --grace-period=0 --force > /dev/null 2>&1 &
nohup kubectl delete ns cattle-system --grace-period=0 --force > /dev/null 2>&1 &
|
6.4 KubeSphere 等第三方组件
1
2
3
4
5
6
7
8
9
| # 看 CRD
kubectl get crd,clusterrole,clusterrolebinding | grep kubesphere
# 删 cluster CRD
kubectl edit crd clusters.cluster.kubesphere.io
# 删 metadata.finalizers 保存
# 后台删 namespace
nohup kubectl delete namespace kubesphere-system --force --grace-period=0 > /dev/null 2>&1 &
|
7. 排错速查
| 现象 | 原因 | 解决 |
|---|
升级后 kubectl get node 仍是旧版本 | 部分 master 没升完 | 检查每台 master 的 kubelet --version |
etcdctl: this member is not a leader | etcd 升级没让 leader 先升 | 先升 leader,重启会自动选主 |
kubectl drain 报 “cannot delete Pods using local storage” | 用了 local PV | 加 --delete-emptydir-data --force |
卸载后 kubelet 还在 | daemon-reload 没跑 | 跑 systemctl daemon-reload 再 systemctl reset-failed |
| 节点还在 cluster 里 | kubectl delete node 没跑 | kubectl delete node <name> |
8. 小结
升级 / 卸载是 K8s 集群"压箱底"操作:
- 小版本升级 只换二进制,2 分钟一台
- 卸载 顺序:停服务 → 删单元 → 删目录 → 清环境变量
- finalizers 是 namespace 卡死的最常见原因,patch 一下就解决
- 保留备份(etcd snapshot、admin.kubeconfig)永远不亏
下一步:K8s 节点扩容:worker 加入集群一键脚本。