Featured image of post Rook-Ceph 1.17 分布式存储:块存储 RBD + CephFS + OSD/MON 实战

Rook-Ceph 1.17 分布式存储:块存储 RBD + CephFS + OSD/MON 实战

Rook 1.17 + Ceph 19 Quincy/Reef 部署、块存储 RBD/CephFS、MON/OSD/MGR 排错

为什么用 Rook 部署 Ceph

Ceph 是开源分布式存储"老大哥",提供 3 种存储接口:

  • RBD(块存储):单 Pod 独占,ReadWriteOnce,性能最高
  • CephFS(共享文件系统):多 Pod 共享,ReadWriteMany,适合数据共享
  • RGW(对象存储):S3 兼容,适合大文件备份

Rook 是 CNCF 毕业项目,把 Ceph 复杂的手工部署变成 K8s Operator 模式:CRD 定义存储集群,Operator 维护所有守护进程。

适用版本:Rook v1.17.4 / Ceph v19.2.2(Reef)/ K8s 1.28.5 内核要求:≥ 4.17(CephFS 配额强制)


1. 架构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
                 ┌─────────────────────────────┐
                 │     K8s Control Plane        │
                 └──────────────┬───────────────┘
                                │ Operator 管理
                 ┌─────────────────────────────┐
                 │  Rook Ceph Operator (CRD)   │
                 └──────┬──────┬──────┬────────┘
                        │      │      │
                ┌───────▼──┐ ┌─▼───┐ ┌─▼───────┐
                │   MON    │ │ MGR │ │   OSD   │
                │ (3 副本) │ │ 2  │ │  (N 块盘)│
                └──────────┘ └─────┘ └─────────┘
                        │             │
                        └──────┬──────┘
                  ┌────────────────────────┐
                  │  RBD / CephFS / RGW   │
                  └────────────────────────┘

Rook 部署后产出的 Operator 守护:

  • Operator:控制端,监控存储守护进程
  • Agent:每个存储节点
  • OSD:每块硬盘一个 OSD,提供存储
  • MON:监控集群,记录拓扑
  • MDS:CephFS 元数据服务
  • RGW:S3 接口(可选)
  • MGR:统一管理入口

2. 部署 Rook Operator

2.1 克隆代码

1
2
git clone --single-branch --branch v1.17.4 https://github.com/rook/rook.git
cd rook/deploy/examples

2.2 镜像替换

默认镜像从 quay.io / registry.k8s.io 拉,生产环境要先 tag 到私有 harbor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
docker login -u admin -p {{HARBOR_PASSWORD}} <harbor-ip>:13001

# operator
docker pull quay.io/cephcsi/cephcsi:v3.14.0
docker tag quay.io/cephcsi/cephcsi:v3.14.0 <harbor-ip>:13001/base/cephcsi/cephcsi:v3.14.0
docker push <harbor-ip>:13001/base/cephcsi/cephcsi:v3.14.0

docker pull registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.13.0
docker tag registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.13.0 <harbor-ip>:13001/base/sig-storage/csi-node-driver-registrar:v2.13.0
docker push <harbor-ip>:13001/base/sig-storage/csi-node-driver-registrar:v2.13.0

docker pull registry.k8s.io/sig-storage/csi-resizer:v1.13.2
docker tag registry.k8s.io/sig-storage/csi-resizer:v1.13.2 <harbor-ip>:13001/base/sig-storage/csi-resizer:v1.13.2
docker push <harbor-ip>:13001/base/sig-storage/csi-resizer:v1.13.2

docker pull registry.k8s.io/sig-storage/csi-provisioner:v5.2.0
docker tag registry.k8s.io/sig-storage/csi-provisioner:v5.2.0 <harbor-ip>:13001/base/sig-storage/csi-provisioner:v5.2.0
docker push <harbor-ip>:13001/base/sig-storage/csi-provisioner:v5.2.0

docker pull registry.k8s.io/sig-storage/csi-snapshotter:v8.2.1
docker tag registry.k8s.io/sig-storage/csi-snapshotter:v8.2.1 <harbor-ip>:13001/base/sig-storage/csi-snapshotter:v8.2.1
docker push <harbor-ip>:13001/base/sig-storage/csi-snapshotter:v8.2.1

docker pull registry.k8s.io/sig-storage/csi-attacher:v4.8.1
docker tag registry.k8s.io/sig-storage/csi-attacher:v4.8.1 <harbor-ip>:13001/base/sig-storage/csi-attacher:v4.8.1
docker push <harbor-ip>:13001/base/sig-storage/csi-attacher:v4.8.1

docker pull docker.io/rook/ceph:v1.17.4
docker tag docker.io/rook/ceph:v1.17.4 <harbor-ip>:13001/base/rook/ceph:v1.17.4
docker push <harbor-ip>:13001/base/rook/ceph:v1.17.4

docker pull quay.io/ceph/ceph:v19.2.2
docker tag quay.io/ceph/ceph:v19.2.2 <harbor-ip>:13001/base/ceph/ceph:v19.2.2
docker push <harbor-ip>:13001/base/ceph/ceph:v19.2.2

修改 operator.yaml 镜像源:

1
2
sed -i "s#image: rook/ceph:v1.14.8#image: <harbor-ip>:13001/base/rook/ceph:v1.17.4#g" operator.yaml
sed -i "s#image: quay.io/ceph/ceph:v18.2.2#image: <harbor-ip>:13001/base/ceph/ceph:v19.2.2#g" cluster.yaml

2.3 应用

1
kubectl create -f crds.yaml -f common.yaml -f operator.yaml

2.4 创建 CephCluster

cluster.yaml 关键配置:

1
2
3
4
5
6
7
spec:
  cephVersion:
    image: <harbor-ip>:13001/base/ceph/ceph:v19.2.2
  dataDirHostPath: /var/lib/rook
  storage:
    useAllNodes: true              # 用所有节点
    useAllDevices: true            # 用所有未分区盘

或精确指定:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
spec:
  storage:
    useAllNodes: false
    useAllDevices: false
    config:
    nodes:
      - name: "master1"
        devices:
          - name: "sdb"
          - name: "sdc"
      - name: "master2"
        devices:
          - name: "sdb"
          - name: "sdc"
      - name: "master3"
        devices:
          - name: "sdb"
          - name: "sdc"
1
kubectl create -f cluster.yaml

2.5 验证

1
2
kubectl -n rook-ceph get pods
# 应该有 3 个 mon、2 个 mgr、多个 osd、csi-rbdplugin、csi-cephfsplugin 等

3. 块存储(RBD)StorageClass

3.1 创建 CephBlockPool

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host              # host 级容灾
  replicated:
    size: 3                       # 3 副本
    requireSafeReplicaSize: true

3.2 创建 StorageClass

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph            # namespace:cluster
  pool: replicapool
  imageFormat: "2"
  imageFeatures: layering
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete

3.3 测试 PVC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rbd-pvc
  namespace: rook-ceph
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: rook-ceph-block
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Pod
metadata:
  name: rbd-demo-pod
  namespace: rook-ceph
spec:
  containers:
    - name: rbd-demo-pod
      image: busybox:stable-glibc
      command: ["sleep", "3600"]
      volumeMounts:
        - mountPath: /var/lib/www/html
          name: data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: rbd-pvc

验证:

1
2
3
4
5
kubectl exec rbd-demo-pod -n rook-ceph -- df -ah
# 看到 /dev/rbd0 1G

kubectl delete -f rbd-pod.yaml
kubectl delete -f rbd-pvc.yaml

4. CephFS 共享文件系统

4.1 创建 CephFilesystem

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: myfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
      requireSafeReplicaSize: true
  dataPools:
    - failureDomain: host
      replicated:
        size: 3
        requireSafeReplicaSize: true
  metadataServer:
    activeCount: 1                # 生产建议 3
  preserveFilesystemOnDelete: true

4.2 StorageClass

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: myfs
  pool: myfs-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete

4.3 测试 RWX

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: busybox-data-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
  storageClassName: rook-cephfs
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox-container
        image: busybox:stable-glibc
        command: ["sleep", "3600"]
        volumeMounts:
        - name: data-volume
          mountPath: /data
      volumes:
      - name: data-volume
        persistentVolumeClaim:
          claimName: busybox-data-pvc

两个 Pod 都能读写 /data,就是 CephFS 的 RWX 能力。


5. RBD 块存储的局限

RBD 是块设备,只能 ReadWriteOnce 挂载到单个 Pod。Deployment 滚动更新时:

  1. 旧 Pod 还挂载着 RBD
  2. 新 Pod 调度到不同节点,想再挂载同一个 PV → 失败
  3. 旧 Pod 没删除前,RBD 不让别的 Pod 挂

解决方案

  1. CephFS / GlusterFS 这种支持 RWX 的存储
  2. 给 Node 加 label,Pod 强制调度到固定 Node(不推荐,单点)

6. Dashboard

1
kubectl apply -f dashboard-external-https.yaml

获取密码:

1
kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode

默认用户 admin,可以登录后改密。


7. Toolbox(排障工具)

1
2
kubectl create -f toolbox.yaml
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash

常用命令:

1
2
3
4
5
6
7
ceph status              # 集群状态
ceph osd status          # OSD 状态
ceph osd tree            # OSD 拓扑
ceph df                  # 容量使用
ceph health detail       # 健康详情
ceph mgr module enable rook
ceph orch set backend rook

8. 常见问题

8.1 MON_DISK_LOW: mon is low on available space

/var/lib/rook 或 mon 容器所在节点的 /var 满了:

1
2
3
4
5
6
docker system prune -af
journalctl --vacuum-time=1w
journalctl --vacuum-size=500M

# 大头往往是 /var/lib/docker 和 /var/lib/kubelet
du -sh /var/lib/* | sort -hr | head

8.2 MON_CLOCK_SKEW detected on mon.b, mon.c

时钟漂移。检查所有节点 ntp/chrony:

1
2
3
4
5
6
7
8
kubectl -n rook-ceph get cm rook-config-override -o yaml

# 放宽容忍
apiVersion: v1
data:
  config: |
    [global]
    mon clock drift allowed = 1

重启 mon:

1
kubectl -n rook-ceph delete pod $(kubectl -n rook-ceph get pods -o custom-columns=NAME:.metadata.name --no-headers | grep mon)

8.3 TOO_FEW_OSDS

少于 3 块 OSD。Ceph 需要至少 3 副本。检查每台节点上是否真的发现了裸盘:

1
2
lsblk
# 应该是裸盘,不能有分区或文件系统

8.4 卸载卡 Terminating

1
2
3
4
5
NAMESPACE=rook-ceph
kubectl proxy &

kubectl get namespace $NAMESPACE -o json | jq '.spec = {"finalizers":[]}' > temp.json
curl -k -H "Content-Type: application/json" -X PUT --data-binary @temp.json 127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize

8.5 cephcluster 无法删除

1
2
kubectl edit  cephcluster.ceph.rook.io -n rook-ceph
# 删掉 finalizers 字段

8.6 卸载后硬盘数据未清

1
2
3
sgdisk --zap-all /dev/sdb
dd if=/dev/zero of=/dev/sdb bs=1M count=100 oflag=direct,dsync
blkdiscard /dev/sdb

9. 小结

Rook-Ceph 是 K8s 上跑 Ceph 的事实标准:

  1. RBD 适合 RWO 高性能场景(数据库)
  2. CephFS 适合 RWX 共享场景(多 Pod 数据共享)
  3. MON/OSD 至少 3 副本,生产至少 3 节点 9 块盘
  4. 磁盘满、时钟漂移 是最常见的 2 类故障

下一步:GlusterFS 10 分布式文件系统:七种卷类型 + Kadalu CSI 接入

使用 Hugo 构建
主题 StackJimmy 设计