为什么 etcd 必须独立部署
etcd 是 K8s 唯一的状态存储后端,所有 Pod/Service/ConfigMap 状态都写在 etcd 里。生产环境对 etcd 的要求:
- 3 节点或 5 节点集群(raft 多数派容错)
- 所有通信必须 TLS(peer + client)
- 必须与 kube-apiserver 分开部署(不要复用 master 节点或用容器化 etcd)
本文用 cfssl 工具签发全套证书,部署 3 节点 etcd 集群。
适用版本:etcd v3.5.11 / cfssl 1.6.4 / Ubuntu 22.04
部署位置:master1、master2、master3 各一份
1. 安装 cfssl 工具
1
2
3
4
5
6
7
8
| cd /data/softs
cp cfssl_1.6.4_linux_amd64 /usr/local/bin/cfssl
cp cfssljson_1.6.4_linux_amd64 /usr/local/bin/cfssljson
cp cfssl-certinfo_1.6.4_linux_amd64 /usr/local/bin/cfssl-certinfo
chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson /usr/local/bin/cfssl-certinfo
ls /usr/local/bin
# cfssl cfssl-certinfo cfssljson containerd ...
|
2. etcd 证书生成
所有证书的生成操作只在 master1 上做,然后 scp 到其他 master。
2.1 创建证书目录
1
2
| mkdir -p /etc/etcd/ssl
mkdir -p /data/softs/pki && cd /data/softs/pki
|
需要的 11 个 JSON 配置文件(生产里通常从运维仓库分发):
1
2
3
4
| admin-csr.json apiserver-csr.json ca-config.json
ca-csr.json etcd-ca-csr.json etcd-csr.json
front-proxy-ca-csr.json front-proxy-client-csr.json
kube-proxy-csr.json manager-csr.json scheduler-csr.json
|
2.2 生成 etcd CA 与证书
1
2
3
4
5
6
7
8
9
10
11
| # 生成 CA 根证书
cfssl gencert -initca etcd-ca-csr.json | cfssljson -bare /etc/etcd/ssl/etcd-ca
# 生成 etcd server 证书(hostname 和 IP 必须包含所有 3 个节点)
cfssl gencert \
-ca=/etc/etcd/ssl/etcd-ca.pem \
-ca-key=/etc/etcd/ssl/etcd-ca-key.pem \
-config=ca-config.json \
-hostname=127.0.0.1,master1,master2,master3,<master1-ip>,<master2-ip>,<master3-ip> \
-profile=kubernetes \
etcd-csr.json | cfssljson -bare /etc/etcd/ssl/etcd
|
生成结果:
1
2
3
| ls /etc/etcd/ssl/
# etcd-ca.csr etcd-ca-key.pem etcd-ca.pem
# etcd.csr etcd-key.pem etcd.pem
|
2.3 把证书复制到其他 master
1
2
3
4
5
6
| for NODE in master2 master3; do
ssh $NODE "mkdir -p /etc/etcd/ssl"
for FILE in etcd-ca-key.pem etcd-ca.pem etcd-key.pem etcd.pem; do
scp /etc/etcd/ssl/${FILE} $NODE:/etc/etcd/ssl/${FILE}
done
done
|
内网 IP 用占位符 <master1-ip> 等替代;实际操作换成真实 IP。
3. etcd 二进制安装
1
2
3
4
5
6
7
8
9
| cd /data/softs
tar -xf etcd-v3.5.11-linux-amd64.tar.gz
mv etcd-v3.5.11-linux-amd64/etcd /usr/local/bin/
mv etcd-v3.5.11-linux-amd64/etcdctl /usr/local/bin/
mv etcd-v3.5.11-linux-amd64/etcdutl /usr/local/bin/
etcdctl version
# etcdctl version: 3.5.11
# API version: 3.5
|
推送到 master2 / master3:
1
2
3
| for NODE in master2 master3; do
scp /usr/local/bin/etcd* $NODE:/usr/local/bin/
done
|
4. etcd 配置文件
4.1 etcd.config.yml(master1 示例)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| name: 'master1'
data-dir: /var/lib/etcd
wal-dir: /var/lib/etcd/wal
snapshot-count: 5000
heartbeat-interval: 100
election-timeout: 1000
quota-backend-bytes: 0
listen-peer-urls: 'https://<master1-ip>:2380'
listen-client-urls: 'https://<master1-ip>:2379,http://127.0.0.1:2379'
max-snapshots: 3
max-wals: 5
initial-advertise-peer-urls: 'https://<master1-ip>:2380'
advertise-client-urls: 'https://<master1-ip>:2379'
initial-cluster: 'master1=https://<master1-ip>:2380,master2=https://<master2-ip>:2380,master3=https://<master3-ip>:2380'
initial-cluster-token: 'etcd-k8s-cluster'
initial-cluster-state: 'new'
strict-reconfig-check: false
enable-v2: true
enable-pprof: true
client-transport-security:
cert-file: '/etc/etcd/ssl/etcd.pem'
key-file: '/etc/etcd/ssl/etcd-key.pem'
client-cert-auth: true
trusted-ca-file: '/etc/etcd/ssl/etcd-ca.pem'
auto-tls: true
peer-transport-security:
cert-file: '/etc/etcd/ssl/etcd.pem'
key-file: '/etc/etcd/ssl/etcd-key.pem'
peer-client-cert-auth: true
trusted-ca-file: '/etc/etcd/ssl/etcd-ca.pem'
auto-tls: true
|
master2 / master3 的配置只需要改 name 字段、listen-peer-urls 里的 IP。
4.2 etcd.service 单元
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| [Unit]
Description=Etcd Service
Documentation=https://coreos.com/etcd/docs/latest/
After=network.target
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Alias=etcd3.service
|
5. 启动与验证
5.1 启动
1
2
3
| systemctl daemon-reload
systemctl enable --now etcd.service
systemctl status etcd.service
|
5.2 验证集群状态
1
2
3
4
5
6
7
| export ETCDCTL_API=3
etcdctl --endpoints="<master1-ip>:2379,<master2-ip>:2379,<master3-ip>:2379" \
--cacert=/etc/etcd/ssl/etcd-ca.pem \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
endpoint status --write-out=table
|
预期输出:
1
2
3
4
5
6
7
| +------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <master1-ip>:2379 | b55511e3290daa2b | 3.5.11 | 20 kB | true | false | 2 | 8 | 8 | |
| <master2-ip>:2379 | 64aba84b878707bb | 3.5.11 | 20 kB | false | false | 2 | 8 | 8 | |
| <master3-ip>:2379 | 3eafefedb1462a99 | 3.5.11 | 20 kB | false | false | 2 | 8 | 8 | |
+------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|
只要 IS LEADER 有一个 true,RAFT TERM 都一致,ERRORS 都为空,集群就是健康的。
5.3 etcd 性能压测
5.4 备份与恢复
1
2
3
4
5
6
7
8
9
10
| # 备份
etcdctl snapshot save /tmp/etcd-snapshot.db
etcdctl snapshot status /tmp/etcd-snapshot.db --write-out=table
# 恢复(停 etcd 后)
etcdctl snapshot restore /tmp/etcd-snapshot.db \
--name master1 \
--initial-cluster master1=https://...,master2=...,master3=... \
--initial-advertise-peer-urls https://<master1-ip>:2380 \
--data-dir /var/lib/etcd-restore
|
6. 排错速查
| 现象 | 原因 | 解决 |
|---|
etcdctl: command not found | 旧版本客户端 | export ETCDCTL_API=3 |
certificate verify failed | CA 不匹配 | 检查 trusted-ca-file |
| raft term 不增长 | 集群未选主 | 检查 initial-cluster 是否 3 节点都在 |
db size exceeds quota | 默认 2GB | 调大 quota-backend-bytes |
| peer URL 改后无法加入 | 老的 peer 还在 | 删 data-dir 重启 |
7. 小结
etcd 集群是 K8s 状态的"心脏":
- 必须 3 节点或 5 节点(raft 多数派)
- 所有通信必须 TLS(peer + client 都要)
- 证书 hostname 列表必须包含所有 3 节点
- 定期备份(
etcdctl snapshot save)
下一步:K8s 1.28 核心组件二进制部署:apiserver/scheduler/controller-manager。