为什么用 Rook 部署 Ceph
Ceph 是开源分布式存储"老大哥",提供 3 种存储接口:
- RBD(块存储):单 Pod 独占,ReadWriteOnce,性能最高
- CephFS(共享文件系统):多 Pod 共享,ReadWriteMany,适合数据共享
- RGW(对象存储):S3 兼容,适合大文件备份
Rook 是 CNCF 毕业项目,把 Ceph 复杂的手工部署变成 K8s Operator 模式:CRD 定义存储集群,Operator 维护所有守护进程。
适用版本:Rook v1.17.4 / Ceph v19.2.2(Reef)/ K8s 1.28.5
内核要求:≥ 4.17(CephFS 配额强制)
1. 架构
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| ┌─────────────────────────────┐
│ K8s Control Plane │
└──────────────┬───────────────┘
│ Operator 管理
▼
┌─────────────────────────────┐
│ Rook Ceph Operator (CRD) │
└──────┬──────┬──────┬────────┘
│ │ │
┌───────▼──┐ ┌─▼───┐ ┌─▼───────┐
│ MON │ │ MGR │ │ OSD │
│ (3 副本) │ │ 2 │ │ (N 块盘)│
└──────────┘ └─────┘ └─────────┘
│ │
└──────┬──────┘
▼
┌────────────────────────┐
│ RBD / CephFS / RGW │
└────────────────────────┘
|
Rook 部署后产出的 Operator 守护:
- Operator:控制端,监控存储守护进程
- Agent:每个存储节点
- OSD:每块硬盘一个 OSD,提供存储
- MON:监控集群,记录拓扑
- MDS:CephFS 元数据服务
- RGW:S3 接口(可选)
- MGR:统一管理入口
2. 部署 Rook Operator
2.1 克隆代码
1
2
| git clone --single-branch --branch v1.17.4 https://github.com/rook/rook.git
cd rook/deploy/examples
|
2.2 镜像替换
默认镜像从 quay.io / registry.k8s.io 拉,生产环境要先 tag 到私有 harbor:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| docker login -u admin -p {{HARBOR_PASSWORD}} <harbor-ip>:13001
# operator
docker pull quay.io/cephcsi/cephcsi:v3.14.0
docker tag quay.io/cephcsi/cephcsi:v3.14.0 <harbor-ip>:13001/base/cephcsi/cephcsi:v3.14.0
docker push <harbor-ip>:13001/base/cephcsi/cephcsi:v3.14.0
docker pull registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.13.0
docker tag registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.13.0 <harbor-ip>:13001/base/sig-storage/csi-node-driver-registrar:v2.13.0
docker push <harbor-ip>:13001/base/sig-storage/csi-node-driver-registrar:v2.13.0
docker pull registry.k8s.io/sig-storage/csi-resizer:v1.13.2
docker tag registry.k8s.io/sig-storage/csi-resizer:v1.13.2 <harbor-ip>:13001/base/sig-storage/csi-resizer:v1.13.2
docker push <harbor-ip>:13001/base/sig-storage/csi-resizer:v1.13.2
docker pull registry.k8s.io/sig-storage/csi-provisioner:v5.2.0
docker tag registry.k8s.io/sig-storage/csi-provisioner:v5.2.0 <harbor-ip>:13001/base/sig-storage/csi-provisioner:v5.2.0
docker push <harbor-ip>:13001/base/sig-storage/csi-provisioner:v5.2.0
docker pull registry.k8s.io/sig-storage/csi-snapshotter:v8.2.1
docker tag registry.k8s.io/sig-storage/csi-snapshotter:v8.2.1 <harbor-ip>:13001/base/sig-storage/csi-snapshotter:v8.2.1
docker push <harbor-ip>:13001/base/sig-storage/csi-snapshotter:v8.2.1
docker pull registry.k8s.io/sig-storage/csi-attacher:v4.8.1
docker tag registry.k8s.io/sig-storage/csi-attacher:v4.8.1 <harbor-ip>:13001/base/sig-storage/csi-attacher:v4.8.1
docker push <harbor-ip>:13001/base/sig-storage/csi-attacher:v4.8.1
docker pull docker.io/rook/ceph:v1.17.4
docker tag docker.io/rook/ceph:v1.17.4 <harbor-ip>:13001/base/rook/ceph:v1.17.4
docker push <harbor-ip>:13001/base/rook/ceph:v1.17.4
docker pull quay.io/ceph/ceph:v19.2.2
docker tag quay.io/ceph/ceph:v19.2.2 <harbor-ip>:13001/base/ceph/ceph:v19.2.2
docker push <harbor-ip>:13001/base/ceph/ceph:v19.2.2
|
修改 operator.yaml 镜像源:
1
2
| sed -i "s#image: rook/ceph:v1.14.8#image: <harbor-ip>:13001/base/rook/ceph:v1.17.4#g" operator.yaml
sed -i "s#image: quay.io/ceph/ceph:v18.2.2#image: <harbor-ip>:13001/base/ceph/ceph:v19.2.2#g" cluster.yaml
|
2.3 应用
1
| kubectl create -f crds.yaml -f common.yaml -f operator.yaml
|
2.4 创建 CephCluster
cluster.yaml 关键配置:
1
2
3
4
5
6
7
| spec:
cephVersion:
image: <harbor-ip>:13001/base/ceph/ceph:v19.2.2
dataDirHostPath: /var/lib/rook
storage:
useAllNodes: true # 用所有节点
useAllDevices: true # 用所有未分区盘
|
或精确指定:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| spec:
storage:
useAllNodes: false
useAllDevices: false
config:
nodes:
- name: "master1"
devices:
- name: "sdb"
- name: "sdc"
- name: "master2"
devices:
- name: "sdb"
- name: "sdc"
- name: "master3"
devices:
- name: "sdb"
- name: "sdc"
|
1
| kubectl create -f cluster.yaml
|
2.5 验证
1
2
| kubectl -n rook-ceph get pods
# 应该有 3 个 mon、2 个 mgr、多个 osd、csi-rbdplugin、csi-cephfsplugin 等
|
3. 块存储(RBD)StorageClass
3.1 创建 CephBlockPool
1
2
3
4
5
6
7
8
9
10
| apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph
spec:
failureDomain: host # host 级容灾
replicated:
size: 3 # 3 副本
requireSafeReplicaSize: true
|
3.2 创建 StorageClass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph # namespace:cluster
pool: replicapool
imageFormat: "2"
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete
|
3.3 测试 PVC
1
2
3
4
5
6
7
8
9
10
11
12
| apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rbd-pvc
namespace: rook-ceph
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: rook-ceph-block
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| apiVersion: v1
kind: Pod
metadata:
name: rbd-demo-pod
namespace: rook-ceph
spec:
containers:
- name: rbd-demo-pod
image: busybox:stable-glibc
command: ["sleep", "3600"]
volumeMounts:
- mountPath: /var/lib/www/html
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: rbd-pvc
|
验证:
1
2
3
4
5
| kubectl exec rbd-demo-pod -n rook-ceph -- df -ah
# 看到 /dev/rbd0 1G
kubectl delete -f rbd-pod.yaml
kubectl delete -f rbd-pvc.yaml
|
4. CephFS 共享文件系统
4.1 创建 CephFilesystem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: myfs
namespace: rook-ceph
spec:
metadataPool:
replicated:
size: 3
requireSafeReplicaSize: true
dataPools:
- failureDomain: host
replicated:
size: 3
requireSafeReplicaSize: true
metadataServer:
activeCount: 1 # 生产建议 3
preserveFilesystemOnDelete: true
|
4.2 StorageClass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: myfs
pool: myfs-replicated
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete
|
4.3 测试 RWX
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: busybox-data-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2Gi
storageClassName: rook-cephfs
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
spec:
replicas: 2
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
containers:
- name: busybox-container
image: busybox:stable-glibc
command: ["sleep", "3600"]
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: busybox-data-pvc
|
两个 Pod 都能读写 /data,就是 CephFS 的 RWX 能力。
5. RBD 块存储的局限
RBD 是块设备,只能 ReadWriteOnce 挂载到单个 Pod。Deployment 滚动更新时:
- 旧 Pod 还挂载着 RBD
- 新 Pod 调度到不同节点,想再挂载同一个 PV → 失败
- 旧 Pod 没删除前,RBD 不让别的 Pod 挂
解决方案:
- 用 CephFS / GlusterFS 这种支持 RWX 的存储
- 给 Node 加 label,Pod 强制调度到固定 Node(不推荐,单点)
6. Dashboard
1
| kubectl apply -f dashboard-external-https.yaml
|
获取密码:
1
| kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode
|
默认用户 admin,可以登录后改密。
1
2
| kubectl create -f toolbox.yaml
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
|
常用命令:
1
2
3
4
5
6
7
| ceph status # 集群状态
ceph osd status # OSD 状态
ceph osd tree # OSD 拓扑
ceph df # 容量使用
ceph health detail # 健康详情
ceph mgr module enable rook
ceph orch set backend rook
|
8. 常见问题
8.1 MON_DISK_LOW: mon is low on available space
/var/lib/rook 或 mon 容器所在节点的 /var 满了:
1
2
3
4
5
6
| docker system prune -af
journalctl --vacuum-time=1w
journalctl --vacuum-size=500M
# 大头往往是 /var/lib/docker 和 /var/lib/kubelet
du -sh /var/lib/* | sort -hr | head
|
8.2 MON_CLOCK_SKEW detected on mon.b, mon.c
时钟漂移。检查所有节点 ntp/chrony:
1
2
3
4
5
6
7
8
| kubectl -n rook-ceph get cm rook-config-override -o yaml
# 放宽容忍
apiVersion: v1
data:
config: |
[global]
mon clock drift allowed = 1
|
重启 mon:
1
| kubectl -n rook-ceph delete pod $(kubectl -n rook-ceph get pods -o custom-columns=NAME:.metadata.name --no-headers | grep mon)
|
8.3 TOO_FEW_OSDS
少于 3 块 OSD。Ceph 需要至少 3 副本。检查每台节点上是否真的发现了裸盘:
1
2
| lsblk
# 应该是裸盘,不能有分区或文件系统
|
8.4 卸载卡 Terminating
1
2
3
4
5
| NAMESPACE=rook-ceph
kubectl proxy &
kubectl get namespace $NAMESPACE -o json | jq '.spec = {"finalizers":[]}' > temp.json
curl -k -H "Content-Type: application/json" -X PUT --data-binary @temp.json 127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize
|
8.5 cephcluster 无法删除
1
2
| kubectl edit cephcluster.ceph.rook.io -n rook-ceph
# 删掉 finalizers 字段
|
8.6 卸载后硬盘数据未清
1
2
3
| sgdisk --zap-all /dev/sdb
dd if=/dev/zero of=/dev/sdb bs=1M count=100 oflag=direct,dsync
blkdiscard /dev/sdb
|
9. 小结
Rook-Ceph 是 K8s 上跑 Ceph 的事实标准:
- RBD 适合 RWO 高性能场景(数据库)
- CephFS 适合 RWX 共享场景(多 Pod 数据共享)
- MON/OSD 至少 3 副本,生产至少 3 节点 9 块盘
- 磁盘满、时钟漂移 是最常见的 2 类故障
下一步:GlusterFS 10 分布式文件系统:七种卷类型 + Kadalu CSI 接入。