为什么需要这些插件
K8s 1.28 集群"裸装"起来后,节点虽然 Ready,但还不能真正跑业务——还差三件套:
- 网络插件(Calico / Flannel / Cilium):给 Pod 分配 IP,实现跨节点 Pod 互通 + Network Policy
- DNS 插件(CoreDNS):让 Pod 能用 Service 名解析集群内服务
- Metrics 插件(Metrics-server):给 kubectl top / HPA 提供 metrics API
本文部署 1.28 时代最稳的组合:Calico v3.29 + CoreDNS 1.10 + Metrics-server v0.6.4。
适用版本:Calico v3.29.3 / CoreDNS 1.10.1 / Metrics-server v0.6.4 / K8s 1.28.5
1. Calico 网络插件
1.1 安装方式选择
Calico 支持两种部署:
- Operator 模式(推荐,K8s 1.20+):通过 tigera-operator 管理
- Static manifests 模式:直接 apply 一堆 yaml
生产用 Operator 模式:
1
2
3
4
| mkdir -p /data/k8scnf/calico && cd /data/k8scnf/calico
# 下载 v3.29.3
curl -L -o calico.yaml https://raw.githubusercontent.com/projectcalico/calico/v3.29.3/manifests/calico.yaml
|
1.2 关键参数
Calico 部署后,所有 Pod 的 IP 都来自 cluster-cidr,本集群是 172.218.0.0/12。
Pod CIDR 与 kube-controller-manager 的 --cluster-cidr 必须一致。
1.3 应用
1
| kubectl apply -f /data/k8scnf/calico/calico.yaml
|
1.4 验证
1
2
3
4
5
6
| kubectl get pods -A
# NAMESPACE NAME READY STATUS RESTARTS AGE
# kube-system calico-kube-controllers-7ddc4f45bc-9j22g 1/1 Running 0 2m
# kube-system calico-node-b68zm 1/1 Running 0 2m
# kube-system calico-node-fstr6 1/1 Running 0 2m
# kube-system calico-node-xp8pp 1/1 Running 0 2m
|
每台 Node 都有一个 calico-node-* Pod Running,集群就 OK。
1.5 IPIP / BGP 模式
Calico 默认用 IPIP(IP-in-IP)封装。如果数据中心路由器支持 BGP,可以切到 BGP 模式减少封装开销:
1
2
| # calicoctl get bgpPeer
# 或编辑 calico-node DaemonSet
|
1.6 卸载 Calico
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # 删除 K8s 资源
kubectl delete -f calico.yaml
# 检查并清理每个节点的网络(卸载 IPIP 模块)
ip addr show | grep -i tunl # 看到 tunl0 表示 IPIP 还在
modprobe -r ipip
# 删除 CNI 残留
rm -rf /etc/cni/net.d/*
rm -rf /opt/cni/bin/*
rm -rf /var/lib/cni/*
rm -rf /var/lib/calico
systemctl restart kubelet
|
1.7 常见问题
报错:
1
2
| felix/ipsets.go 569: Bad return code from 'ipset list'.
error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace."
|
原因:内核与 ipset userspace 版本不兼容。
解决:升级 Calico 到 v3.29.3(修复了该问题),如果还在用 v3.26.1 必须升级:
1
2
3
4
| docker pull docker.io/calico/node:v3.29.3
docker tag calico/node:v3.29.3 <private-registry>/base/calico/node:v3.29.3
docker push <private-registry>/base/calico/node:v3.29.3
# 对 cni、kube-controllers 镜像同样处理
|
2. CoreDNS 部署
2.1 安装 Helm
1
2
3
4
5
6
| cd /data/softs
tar xvf helm-v3.13.3-linux-amd64.tar.gz
cp linux-amd64/helm /usr/local/bin/
rm -rf linux-amd64
helm version
# version.BuildInfo{Version:"v3.13.3", ...}
|
2.2 添加仓库
1
2
3
4
5
6
7
8
9
10
11
| helm repo add coredns https://coredns.github.io/helm
helm repo update
# 离线下载
helm pull coredns/coredns
# Error: Get "...coredns-1.29.0.tgz": dial tcp 20.205.243.166:443: connect: connection refused
# 离线下载(手动从 GitHub release 拿)
tar xvf coredns-1.29.0.tgz
mv /data/coredns /data/k8scnf
cd /data/k8scnf/coredns
|
2.3 修改 values.yaml
1
2
| service:
clusterIP: "10.96.0.10"
|
把镜像地址改国内源:
1
| sed -i "s#registry.k8s.io/#registry.aliyuncs.com/google_containers/#g" coredns/values.yaml
|
2.4 安装
1
| helm install coredns ./ -n kube-system
|
2.5 验证
1
2
3
4
5
6
| # 部署一个测试 Pod
kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools
# 在容器内
> host kubernetes
kubernetes.default.svc.cluster.local has address 10.96.0.1
|
2.6 kubectl apply 方式(备选)
1
2
3
4
5
6
7
8
9
10
| # 来自 K8s 1.28.5 源码
curl -L -o coredns.yaml https://raw.githubusercontent.com/kubernetes/kubernetes/v1.28.5/cluster/addons/dns/coredns/coredns.yaml.base
# 修改 4 个占位符
# __DNS__DOMAIN__ → cluster.local
# __DNS__MEMORY__LIMIT__ → 150Mi
# __DNS__SERVER__ → 10.96.0.10
# image: registry.aliyuncs.com/google_containers/coredns:v1.10.1
kubectl apply -f /data/k8scnf/coredns/coredns.yaml
|
2.7 性能优化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # 副本数控制
kubectl -n kube-system scale --replicas=3 deployment/coredns
# HPA
cat << "EOF" | kubectl apply -f -
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: coredns
namespace: kube-system
spec:
maxReplicas: 10
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coredns
targetCPUUtilizationPercentage: 50
EOF
|
2.8 常见问题
CoreDNS 一直 CrashLoopBackOff:
1
2
| kubectl describe pods -n kube-system coredns-687c56b785-fz2mx
# 原因:因为 /etc/resolv.conf 中存在 nameserver 127.0.0.53 回环地址造成循环引用
|
解决:在所有节点上:
1
2
3
4
5
6
7
8
| sudo apt install -y unbound
sudo systemctl stop systemd-resolved
sudo systemctl disable systemd-resolved
sudo rm -rf /etc/resolv.conf
sudo touch /etc/resolv.conf
systemctl daemon-reload
systemctl restart docker.service
systemctl restart kubelet
|
3. Metrics-server 部署
3.1 下载
1
2
3
4
5
| mkdir -p /data/k8scnf/metrics-server && cd /data/k8scnf/metrics-server
# HA 版本(生产用)
curl -L -o high-availability.yaml \
https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.6.4/high-availability.yaml
|
3.2 修改镜像 + 参数
1
2
| # 镜像改国内源
sed -i "s#registry.k8s.io/metrics-server#registry.cn-hangzhou.aliyuncs.com/google_containers#g" high-availability.yaml
|
加 --kubelet-insecure-tls(跳过 kubelet 证书校验,自签证书集群必备):
1
2
3
4
5
6
7
8
9
10
11
12
| spec:
template:
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls
image: registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server:v0.6.4
|
挂载主机 pki 目录(让 metrics-server 验证 kubelet 证书):
1
2
3
4
5
6
7
8
9
10
11
| volumeMounts:
- mountPath: /tmp
name: tmp-dir
- name: ca-ssl
mountPath: /etc/kubernetes/pki
volumes:
- emptyDir: {}
name: tmp-dir
- name: ca-ssl
hostPath:
path: /etc/kubernetes/pki
|
修改 apiVersion: policy/v1beta1 为 policy/v1(K8s 1.25+ 必改):
1
| sed -i 's#policy/v1beta1#policy/v1#g' high-availability.yaml
|
3.3 部署
1
| kubectl apply -f /data/k8scnf/metrics-server/high-availability.yaml
|
3.4 验证
1
2
3
4
5
6
7
8
9
10
11
12
| kubectl get pods -A | grep metrics
# kube-system metrics-server-5ff5cfd88f-j68v5 1/1 Running 0 15s
# kube-system metrics-server-5ff5cfd88f-zx47p 1/1 Running 0 15s
kubectl top node
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# master1 170m 4% 1706Mi 45%
# master2 186m 4% 1669Mi 44%
# master3 148m 3% 1641Mi 43%
kubectl top pod -A
# 看每个 Pod 的 CPU / 内存
|
3.5 排错
kubectl top node 报 Metrics API not available:
1
2
| kubectl logs -n kube-system -l k8s-app=metrics-server
# 通常是 --kubelet-insecure-tls 没加 或 pki 挂载错
|
4. 集群 Ready 验证
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # 集群核心组件
kubectl get cs
# NAME STATUS MESSAGE ERROR
# scheduler Healthy ok
# controller-manager Healthy ok
# etcd-0 Healthy ok
# 节点
kubectl get node
# 全部 Ready
# 部署一个测试 pod
kubectl run busybox --image=busybox:1.28 -- sleep 3600
kubectl exec busybox -- nslookup kubernetes
# Server: 10.96.0.10
# Address 1: 10.96.0.10 coredns.kube-system.svc.cluster.local
# Name: kubernetes
# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
|
至此 K8s 集群可观测、可调度、可服务发现——达到生产可用状态。
5. 小结
网络 + DNS + Metrics 三大件是 K8s 集群"能跑起来"的最低要求:
- Calico 是 1.28 时代最稳的网络方案(IPIP 模式兼容所有网络环境)
- CoreDNS 必须用 helm 或 kubectl apply,别忘了关 systemd-resolved
- Metrics-server 必须加
--kubelet-insecure-tls(自签证书集群)
下一步:K8s Dashboard + Reloader 部署实战。