写于 2021-09,背景:K8s 1.22 时代,Calico 3.22、CoreDNS 1.8、metrics-server 0.6、Dashboard v2.5 稳定。本文基于二进制部署的 K8s 集群,补齐 4 大必装插件。
一、插件全景
K8s 装好 master 三大组件(apiserver / scheduler / controller-manager)后,还不能跑业务。还差 4 类插件:
| 类别 | 作用 | 必装? |
|---|
| CNI(网络) | Pod 之间跨节点通信 | 必装 |
| DNS | Service 域名解析(coredns) | 必装 |
| Metrics | Pod/Node 资源指标(kubectl top) | 强烈推荐 |
| Dashboard | Web UI | 按需 |
二、CNI:Calico 网络插件
Calico 是 K8s 生态最流行的 CNI 之一,纯 L3 路由(无需 iptables 桥接),性能好,支持 NetworkPolicy。
2.1 下载
1
| mkdir -p /k8s/calico/config && cd /k8s/calico/config
|
2.2 tigera-operator.yaml
1
2
3
4
| kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/tigera-operator.yaml
# 验证
kubectl get pods -n tigera-operator
|
2.3 custom-resources.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| cat > /k8s/calico/config/custom-resources.yaml << 'EOF'
# This section includes base Calico installation configuration.
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
# Note: the ipPools section cannot be modified post-install.
ipPools:
- blockSize: 26
cidr: 172.218.0.0/16 # 与 kube-apiserver 的 --service-cluster-ip-range 不冲突
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
---
# This section configures the Calico API server.
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
EOF
kubectl create -f /k8s/calico/config/custom-resources.yaml
# 验证
kubectl get pods -n calico-system
# NAME READY STATUS RESTARTS AGE
# calico-kube-controllers-7ddc4f45bc-6vwhr 1/1 Running 0 2m26s
# calico-node-2sq7m 1/1 Running 0 2m26s
# calico-node-9jt4c 1/1 Running 0 2m26s
# calico-node-znjtn 1/1 Running 0 2m26s
|
2.4 替代部署方式:calico.yaml
如果不想用 operator,也可以直接 apply 大 YAML:
1
2
3
4
5
6
7
8
9
| wget https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml
# 修改 Pod CIDR
sed -i 's|192.168.0.0/16|172.218.0.0/16|g' calico.yaml
# 或者编辑文件:
# - name: CALICO_IPV4POOL_CIDR
# value: "172.218.0.0/16"
kubectl apply -f calico.yaml
|
2.5 卸载 Calico
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # 1. 删除 K8s 资源
kubectl delete -f calico.yaml
# 2. 检查残留(tunl0 接口)
ip addr show | grep tunl
modprobe -r ipip
# 3. 清理本地 CNI 文件
rm -rf /etc/cni/net.d/*
rm -rf /opt/cni/bin/*
rm -rf /var/lib/cni/*
# 4. 重启 kubelet
systemctl restart kubelet
|
三、CoreDNS:集群 DNS
K8s Service 通过域名访问(如 mysql.default.svc.cluster.local),依赖 CoreDNS 提供解析。
3.1 方式一:Helm 安装(推荐)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| # 装 helm(如未装)
wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz
tar xvf helm-v3.13.3-linux-amd64.tar.gz
sudo cp linux-amd64/helm /usr/local/bin/
# 添加 chart 仓库
helm repo add coredns https://coredns.github.io/helm
helm repo update
# 拉 chart
helm pull coredns/coredns
tar xvf coredns-*.tgz
# 修改 values.yaml
# 1. 改 clusterIP(必须与 kube-apiserver 的 service-cluster-ip-range 在同段)
sed -i 's|clusterIP:|clusterIP: "10.96.0.10"|' coredns/values.yaml
# 默认的 10.96.0.10 即可
# 2. 改镜像源到阿里云(避免拉 registry.k8s.io 超时)
sed -i "s#registry.k8s.io/#registry.aliyuncs.com/google_containers/#g" coredns/values.yaml
# 安装
helm install coredns ./coredns/ -n kube-system
# 验证
kubectl get pods -n kube-system -l k8s-app=kube-dns
|
3.2 方式二:kubectl apply 原生 YAML
1
2
3
4
5
6
7
8
9
10
11
| # 下载基础 manifest
wget https://raw.githubusercontent.com/kubernetes/kubernetes/v1.28.5/cluster/addons/dns/coredns/coredns.yaml.base
# 替换占位符
sed -i 's|__DNS__DOMAIN__|cluster.local|g' coredns.yaml.base
sed -i 's|__DNS__MEMORY__LIMIT__|150Mi|g' coredns.yaml.base
sed -i 's|__DNS__SERVER__|10.96.0.10|g' coredns.yaml.base
sed -i 's|__DNS__IMAGE__|registry.aliyuncs.com/google_containers/coredns:v1.10.1|g' coredns.yaml.base
mv coredns.yaml.base coredns.yaml
kubectl apply -f coredns.yaml
|
3.3 验证 DNS
1
2
3
4
5
6
7
8
9
10
| # 跑一个 busybox 测试
kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools
# 解析 kubernetes
/ # host kubernetes
# kubernetes.default.svc.cluster.local has address 10.96.0.1
# 跨 namespace 解析
/ # host coredns.kube-system
# coredns.kube-system.svc.cluster.local has address 10.96.0.10
|
3.4 常见错误:CrashLoopBackOff
1
2
3
| # 查看日志
kubectl logs -n kube-system coredns-687c56b785-fz2mx
# "loop" error: 127.0.0.53 in /etc/resolv.conf
|
根因:Ubuntu 22.04 默认启用 systemd-resolved(监听 127.0.0.53:53),CoreDNS 转发时形成循环。
解决:
1
2
3
4
5
6
7
8
9
10
| sudo apt install -y unbound
sudo systemctl stop systemd-resolved
sudo systemctl disable systemd-resolved
sudo rm -rf /etc/resolv.conf
sudo touch /etc/resolv.conf
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# 重启 docker 和 kubelet
sudo systemctl restart docker
sudo systemctl restart kubelet
|
四、Metrics Server:资源指标
kubectl top node / kubectl top pod 依赖 metrics-server 采集的指标。HPA(自动扩缩容)也用它。
4.1 单机版
1
| kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.6.4/components.yaml
|
4.2 高可用版
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| mkdir -p /k8s/metrics-server && cd /k8s/metrics-server
wget https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.6.4/high-availability.yaml
# 关键修改(必须做):
# 1. 镜像地址改阿里云
sed -i "s#registry.k8s.io/metrics-server#registry.cn-hangzhou.aliyuncs.com/google_containers#g" high-availability.yaml
# 2. 加 --kubelet-insecure-tls(自签证书场景)
# 编辑 high-availability.yaml,找到 args 段,加入:
# - --kubelet-insecure-tls
# 3. 1.25+ 改 PodDisruptionBudget apiVersion
sed -i 's|policy/v1beta1|policy/v1|g' high-availability.yaml
# 部署
kubectl apply -f /k8s/metrics-server/high-availability.yaml
|
4.3 验证
1
2
3
4
5
6
7
8
9
10
11
| # 等 1~2 分钟
kubectl top node
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# master1 170m 4% 1706Mi 45%
# master2 186m 4% 1669Mi 44%
# master3 148m 3% 1641Mi 43%
kubectl top pod -A
# NAMESPACE NAME CPU(cores) MEMORY(bytes)
# kube-system calico-kube-controllers-7ddc4f45bc-6vwhr 6m 55Mi
# kube-system coredns-687c56b785-fz2mx 2m 57Mi
|
4.4 常见错误
报错 “x509: certificate signed by unknown authority”:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| # 编辑 high-availability.yaml,给 metrics-server 挂载 CA
spec:
template:
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --kubelet-insecure-tls # ← 必须
- --metric-resolution=15s
image: registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server:v0.6.4
volumeMounts:
- mountPath: /tmp
name: tmp-dir
- mountPath: /etc/kubernetes/pki
name: ca-ssl
volumes:
- emptyDir: {}
name: tmp-dir
- name: ca-ssl
hostPath:
path: /etc/kubernetes/pki
|
五、Dashboard:Web UI
K8s 官方 Dashboard,提供 Web 控制台。v2.7+ 默认 token 鉴权。
5.1 部署
1
| kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml
|
5.2 暴露 NodePort(默认只 ClusterIP)
1
2
3
4
5
6
7
8
9
10
11
| # 编辑 Service
kubectl edit svc kubernetes-dashboard -n kubernetes-dashboard
# 改 type: ClusterIP → type: NodePort
# ports 段加 nodePort: 32443
spec:
type: NodePort
ports:
- port: 443
targetPort: 8443
nodePort: 32443
|
5.3 创建 admin 用户
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| cat > /k8s/dashboard/dashboard-user.yaml << 'EOF'
apiVersion: v1
kind: ServiceAccount
metadata:
name: admin-user
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: admin-user
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: admin-user
namespace: kube-system
EOF
kubectl apply -f /k8s/dashboard/dashboard-user.yaml
|
5.4 生成登录 Token
1
2
3
4
5
6
7
| # K8s 1.24+ 用 create token
kubectl -n kube-system create token admin-user
# 把输出复制,登录时粘贴
# K8s 1.23 及以下用 secret
kubectl -n kube-system get secret | grep admin-user
kubectl -n kube-system describe secret <SECRET_NAME>
|
5.5 登录
浏览器访问:
1
| https://192.168.139.150:32443/
|
选 “Token”,粘贴上面生成的 token。
六、验证集群完整功能
装完 4 大插件后,跑一遍"集群健康度"自检:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
| # 1. 节点全部 Ready
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# master1 Ready control-plane 1d v1.28.5
# master2 Ready control-plane 1d v1.28.5
# master3 Ready control-plane 1d v1.28.5
# worker1 Ready <none> 1d v1.28.5
# worker2 Ready <none> 1d v1.28.5
# 2. Pod 全部 Running
kubectl get pods -A
# 3. 部署测试
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- name: busybox
image: busybox:1.36
command: ["sleep", "3600"]
EOF
# 4. DNS 解析
kubectl exec busybox -n default -- nslookup kubernetes
# Server: 10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
# Name: kubernetes
# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
# 5. 跨 namespace 解析
kubectl exec busybox -n default -- nslookup coredns.kube-system
# 应该返回 10.96.0.10
# 6. 资源指标
kubectl top node
kubectl top pod -A
# 7. 部署 Nginx 三副本
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
EOF
# 看副本是否跨节点
kubectl get pod -o wide
# NAME READY STATUS IP NODE
# nginx-deployment-7c5ddbdf54-7dmkk 1/1 Running 172.218.136.5 master3
# nginx-deployment-7c5ddbdf54-dwxt7 1/1 Running 172.218.137.68 master1
# nginx-deployment-7c5ddbdf54-dxdpm 1/1 Running 172.218.180.6 master2
# 清理
kubectl delete deployment nginx-deployment
kubectl delete pod busybox
|
七、常见坑
- CNI Pod CIDR 与 service-cluster-ip-range 重叠:Calico 报 “ip pool in use” 错误
- CoreDNS CrashLoopBackOff:
/etc/resolv.conf 有 127.0.0.53,停 systemd-resolved - metrics-server 报 x509:加
--kubelet-insecure-tls 或挂载 /etc/kubernetes/pki - Dashboard 登录 401:1.24+ 改用
kubectl create token,1.23 及以下用 secret - Dashboard 看不到 namespace:ServiceAccount 没绑 cluster-admin
八、前置知识 / 下一步
前置:
下一步:
- K8s 集群管理(2021-12-15)—— 升级、节点隔离
- K8s 资源限制与探针(2022-03-15)—— Pod 调优
- Kubeadm 一键部署(2022-06-15)—— 生产部署实战
参考资料