NFS 存储实战：单节点 + Keepalived 双机热备 + nfs-subdir 动态供给

为什么 NFS 在 K8s 时代还有用

很多人觉得 K8s 时代应该用 Rook-Ceph / Longhorn 这类"云原生存储"，NFS 是"老古董"。但实际上：

NFS 在 RWX 场景（多个 Pod 同时读写同一份数据）依然是最稳的
NFS + nfs-subdir-external-provisioner 是 K8s 动态 PV 的"最短路径"——5 分钟搞定一个 StorageClass
NFS + Keepalived + rsync 组成"穷人版"高可用，不需要 3 副本，单机故障 30 秒恢复
在国产化、私有化场景，NFS 是"无 Ceph 也能跑"的兜底方案

适用版本：NFS 4.1 / nfs-subdir-external-provisioner 4.0.18 / Ubuntu 22.04

1. 单节点 NFS 服务

1.1 安装服务端

1
2
3
4
5
6
7
apt install -y nfs-kernel-server

# 验证
systemctl is-enabled nfs-server
systemctl status nfs-server
cat /proc/fs/nfsd/versions
# +3 +4 +4.1 +4.2

1.2 共享目录

1
2
3
4
5
6
7
mkdir -p /home/nfs
chown -R nobody:nogroup /home/nfs
chmod 777 /home/nfs

mkdir -p /nfs
chown -R nobody:nogroup /nfs
chmod 777 /nfs

1.3 配置 /etc/exports

1
2
echo "/home/nfs  *(rw,sync,no_root_squash,no_subtree_check)" >> /etc/exports
echo "/nfs  *(rw,sync,no_root_squash,no_subtree_check)" >> /etc/exports

参数解释：

*：所有网段可访问（生产应换成具体子网）
rw：读写
sync：同步写入内存和硬盘
no_root_squash：root 用户保留权限
no_subtree_check：不检查父目录权限

应用：

1
2
3
4
exportfs -a
systemctl restart nfs-server
exportfs -v
showmount -e

1.4 所有节点安装客户端

1
2
3
4
apt install -y nfs-common
showmount -e <nfs-server-ip>
# Export list for <nfs-server-ip>:
# /home/nfs *

1.5 客户端挂载测试

1
2
3
mount -t nfs <nfs-server-ip>:/home/nfs /mnt
df -h
umount /mnt

1.6 客户端常见挂不上

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 一些机器 mount 失败：mount.nfs: Connection timed out
# 解决：
mount -t nfs -o vers=3,nolock,proto=tcp <nfs-server-ip>:/nfs /mnt/nfs

# K8s 上修改 StorageClass
kubectl edit storageclass nfs-client

mountOptions:
  - vers=3
  - nolock
  - proto=tcp

2. K8s 中使用 NFS 存储

2.1 直接在 Pod 中引用 NFS

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - image: redis
        name: redis
        env:
        - name: ALLOW_EMPTY_PASSWORD
          value: "yes"
        volumeMounts:
        - name: redis-persistent-storage
          mountPath: /data
      volumes:
      - name: redis-persistent-storage
        nfs:
          path: /home/nfs
          server: <nfs-server-ip>

2.2 静态 PV + PVC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
  namespace: go
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Recycle
  storageClassName: nfs
  mountOptions:
    - hard
    - nfsvers=4.1
  nfs:
    path: /nfs
    server: <nfs-server-ip>

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
  namespace: go
spec:
  storageClassName: nfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

PV 卡在 Terminating 时的强删：

1
kubectl patch pv nfs-pv -p '{"metadata":{"finalizers":null}}' -n go

2.3 动态 StorageClass（推荐）

nfs-subdir-external-provisioner 是 NFS 动态供给的标配——它会监听 PVC 自动在 NFS 上创建子目录。

1
2
3
4
cd /data/softs
tar -xvf nfs-subdir-external-provisioner-4.0.18.tgz
mv nfs-subdir-external-provisioner /data/k8scnf/nfs/
cd /data/k8scnf/nfs/nfs-subdir-external-provisioner/

修改 values.yaml：

1
2
3
4
repository: registry.cn-shenzhen.aliyuncs.com/atomic/nfs-subdir-external-provisioner
nfs:
  server: <nfs-server-ip>
  path: /home/nfs

安装：

1
2
kubectl create namespace nfs
helm install nfs-subdir-external-provisioner . -n nfs

设置默认 StorageClass：

1
kubectl patch storageclass nfs-client -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

卸载：

1
2
3
4
5
6
7
helm uninstall nfs-subdir-external-provisioner -n nfs
kubectl delete namespace nfs
kubectl delete serviceaccount nfs-client-provisioner -n nfs
kubectl delete clusterrolebinding run-nfs-subdir-external-provisioner
kubectl delete role/leader-locking-nfs-subdir-external-provisioner -n nfs
kubectl delete rolebinding/leader-locking-nfs-subdir-external-provisioner -n nfs
kubectl delete storageclass nfs-client

2.4 测试动态 PV

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cat << "EOF" > test-claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-claim
  namespace: nfs
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Mi
  storageClassName: nfs-client
EOF

cat << "EOF" > test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: nfs
spec:
  containers:
  - name: test-pod
    image: busybox
    command: ["/bin/sh", "-c", "echo SUCCESS > /data/SUCCESS; sleep 3600"]
    volumeMounts:
    - name: nfs-pvc
      mountPath: /data
  volumes:
  - name: nfs-pvc
    persistentVolumeClaim:
      claimName: test-claim
  restartPolicy: Never
EOF

kubectl apply -f test-claim.yaml
kubectl apply -f test-pod.yaml

# 等 1 分钟后
ls /nfs/nfs-test-claim-pvc-*/
# 看到 SUCCESS 文件

kubectl delete -f test-pod.yaml
kubectl delete -f test-claim.yaml
# NFS 上的 pvc 目录自动清理

3. 双机热备 + rsync/lsyncd

3.1 架构

1
2
3
4
5
6
7
┌──────────────┐     keepalived     ┌──────────────┐
│ worker4      │ ←── VRRP VIP ────→ │ worker2      │
│ nfs-server   │   <10.x.x.x>      │ nfs-server   │
│ rsync 主     │                    │ rsync 备     │
└──────────────┘                    └──────────────┘
       ↓                                   ↓
   /nfs <──── lsyncd + rsync 实时同步 ────/nfs

3.2 两台机器都装 NFS + keepalived

两台机器都跑 NFS 服务（同一份数据），keepalived 决定哪个 IP 在线。

1
2
3
4
5
6
7
8
# 两台都执行
apt install -y nfs-kernel-server
mkdir -p /nfs
chown -R nobody:nogroup /nfs
chmod 777 /nfs
echo "/nfs  *(rw,sync,no_root_squash,no_subtree_check)" >> /etc/exports
exportfs -a
systemctl restart nfs-server

3.3 keepalived 配置（必须 nopreempt）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
vrrp_instance VI_1 {
    state BACKUP                  # 两台都是 BACKUP（vs master）
    interface enp3s0
    virtual_router_id 111
    priority 100                   # worker4 设 100
    priority 80                    # worker2 设 80
    advert_int 1
    nopreempt                      # **必须**：防止主备反复切换时数据丢失
    authentication { auth_type PASS; auth_pass 1111; }
    virtual_ipaddress { <nfs-vip>; }
}

健康检查脚本（nfs 挂了就把 keepalived 停掉，让 VIP 漂移）：

1
2
3
4
5
6
7
8
9
#!/bin/bash
A=`ps -C nfsd --no-header | wc -l`
if [ $A -eq 0 ]; then
  systemctl restart nfs-server.service
  sleep 2
  if [ `ps -C nfsd --no-header | wc -l` -eq 0 ]; then
    pkill keepalived
  fi
fi

NFS 双机热备的 nfs_check.sh 比 keepalived 自身的 chk_*.sh 更重要——必须让 NFS 挂了之后主动让出 VIP，否则客户端挂载会卡死。

3.4 rsync + lsyncd 实时同步

lsyncd 监听 inotify 事件触发 rsync 同步，是 Linux 文件实时同步的"标配"。

1
2
apt install -y rsync lsyncd
systemctl enable --now rsync lsyncd

内核参数优化（必做，否则 inotify 句柄不够）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sysctl -w fs.inotify.max_queued_events="99999999"
sysctl -w fs.inotify.max_user_watches="99999999"
sysctl -w fs.inotify.max_user_instances="65535"

cat >> /etc/sysctl.conf << "EOF"
fs.inotify.max_queued_events=99999999
fs.inotify.max_user_watches=99999999
fs.inotify.max_user_instances=65535
EOF
sysctl -p

rsync 配置文件（/etc/rsyncd.conf）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsyncd.lock
[backup]
path = /nfs
comment = sync nfs from client
uid = root
gid = root
port = 873
ignore errors
use chroot = no
read only = no
list = no
max connections = 200
timeout = 600
auth users = root
secrets file = /etc/rsync.password
hosts allow = 10.0.0.0/8

密码文件：

1
2
3
4
5
echo 'root:{{NFS_RSYNC_PASSWORD}}' > /etc/rsync.password
chmod 600 /etc/rsync.password

echo "{{NFS_RSYNC_PASSWORD}}" > /etc/rsyncd.password
chmod 600 /etc/rsyncd.password

实际密码用占位符 {{NFS_RSYNC_PASSWORD}} 替代，原密码登记在 _drafts/私人笔记.md。

lsyncd 配置（worker4 把 /nfs 推到 worker2）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
settings {
  logfile = "/var/log/lsyncd/lsyncd.log",
  statusFile = "/var/log/lsyncd/lsyncd.status",
  inotifyMode = "CloseWrite",
  maxProcesses = 8,
}
sync {
  default.rsync,
  source = "/nfs",
  target = "root@<worker2-ip>::backup",
  delete = true,
  delay = 1,
  exclude = {".*"},
  rsync = {
    binary = "/usr/bin/rsync",
    archive = true,
    compress = false,
    verbose = true,
    password_file = "/etc/rsyncd.password",
    _extra = {"--bwlimit=20000"}
  }
}

worker2 的配置对称——把 target 指向 worker4。

3.5 验证

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 在 worker4 上
mount -t nfs <nfs-vip>:/nfs /mnt
cd /mnt
echo "test" > test.txt
cd ../
umount /mnt

# 在 worker2 上
ls /nfs
# 应该能看到 test.txt

3.6 模拟故障

停 worker4 keepalived → VIP 飘到 worker2
启 worker4 keepalived（nopreempt 不会让 VIP 回来）
停 worker2 keepalived → VIP 飘回 worker4
内网其他机器 ping VIP 一直是通的

4. Ingress 持久化日志

Ingress-nginx 容器本身是无状态的，access log 默认写到 stdout（容器内）。生产场景希望 log 落盘到 NFS，方便 ELK / Loki 采集。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
cat << "EOF" > /data/k8scnf/ingress-nginx/ingress-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ingress-nfs
  namespace: ingress-nginx
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-client
EOF

kubectl apply -f /data/k8scnf/ingress-nginx/ingress-nfs.yaml

修改 ingress-nginx-controller：

1
kubectl edit -n ingress-nginx DaemonSet/ingress-nginx-controller

1
2
3
4
5
6
7
volumeMounts:
  - mountPath: /var/log/nginx/
    name: ingress-nfs
volumes:
  - name: ingress-nfs
    persistentVolumeClaim:
      claimName: ingress-nfs

修改 ConfigMap 改变日志格式：

1
kubectl edit -n ingress-nginx configmaps/ingress-nginx-controller

NFS 上自动创建 default-ingress-nfs-pvc-xxx/ 目录，所有 ingress-controller Pod 的 access log 都会落在这里。

单点故障注意：NFS 本身是单点，NFS 挂了 ingress 日志会卡。生产建议加 keepalived VIP（本文方案）。

5. 排错

现象	原因	解决
`mount.nfs: Connection timed out`	防火墙	放通 2049/111/20048
`mount.nfs: Operation not permitted`	NFS 版本不匹配	加 `-o vers=3,nolock,proto=tcp`
PVC 一直 Pending	节点没装 nfs-common	所有节点 `apt install nfs-common`
nfs-subdir 装不上	helm chart 镜像拉不到	改国内源 + push 私有 harbor
双机热备 VIP 频繁漂移	没设 nopreempt	keepalived.conf 加 `nopreempt`
lsyncd 同步延迟大	inotify 句柄不够	调 `fs.inotify.*` 参数

6. 2024 NFS 现状

本文 2016 年写时 NFS 还是 K8s 动态存储的"穷人首选"。8 年后（2024）回望，NFS 协议本身有了 NFS 4.2、pNFS 等增强，但更重要的是生态位发生了变化——长虹、Ceph/Longhorn/OpenEBS 等云原生存储在 K8s 场景蚕食了 NFS 的份额。下面是当前状态。

6.1 NFS 协议本身：4.2 与 pNFS

NFS 版本	发布时间	关键特性	2024 现状
NFSv3	1995	异步写、TCP 支持	老系统兼容仍用
NFSv4.0	2003	状态化、复合操作、Kerberos	Linux 默认
NFSv4.1	2010	pNFS（并行 NFS）、会话	主流
NFSv4.2	2016	服务器端复制、稀疏文件、空间预留、应用数据块（ADBs）	2024 标配

pNFS 的核心思想：把元数据和数据通道分离——元数据服务器只管 layout，客户端直接和多个存储节点传输数据，理论上能做到并行带宽聚合（类似 CephFS 的 striping）。

1
2
3
# 挂载 pNFS（需要服务端支持）
mount -t nfs -o nfsvers=4.1,minorversion=1 <server>:/export /mnt
# 或 nfsvers=4.2 启用 4.2 特性

实际部署率：pNFS 在企业存储（NetApp / Dell EMC Isilon / IBM Spectrum Scale）早就支持，但Linux 内核的 pNFS client 直到 5.x 才稳定，很多 K8s 节点跑的还是 NFSv3/4.1 而不是 pNFS。

6.2 nfs-subdir-external-provisioner 现状

版本	时间	状态
v3.x	2019-2021	旧 API，CSI 转换中
v4.0.x	2022-2023	当时主流
v4.0.18+	2023-2024	当前推荐
v4.0.20+	2024	k8s 1.28+ 兼容

v4.x 关键变化：

完全重写为 CSI（Container Storage Interface） 驱动
支持 Kubernetes 1.25+（旧的 in-tree NFS provisioner 已废弃）
Helm chart 改为标准 oci:// 仓库：

1
2
3
4
5
6
# 2024 推荐装法（Helm 3.8+）
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --set nfs.server=<nfs-server-ip> \
  --set nfs.path=/home/nfs \
  -n nfs-client --create-namespace

6.3 2024 替代方案对比

方案	类型	适用场景	复杂度	性能	HA
NFS + nfs-subdir	文件（RWX）	中小规模、传统业务	低	中	需 keepalived 自建
Longhorn	块（RWX via NFS）	K8s 云原生首选	中	中	内置
Rook-Ceph	块/文件/对象	大规模、混合负载	高	高	内置（多副本）
OpenEBS	块/文件	国产化、CSI 友好	中	中	可选
CubeFS	文件/对象/块	国产云原生	中	高	内置
Vitess / TiDB	数据库专用	持久化数据库	高	高	内置
Local Path / HostPath	本地盘	单节点 / StatefulSet	低	最高	无
AWS EBS / GCP PD	云盘	云上	低	高	内置
NFS CSI driver for K8s	NFS 官方版	NFS 4.1+	低	中	自建

6.4 2024 选型建议

业务场景	推荐
CI/CD 临时构建、多 Pod 共享配置	NFS + nfs-subdir 仍最优（RWX 简单）
生产数据库（MySQL / PG / Redis）	云盘 / Rook-Ceph RBD / 本地盘（RWO 性能）
AI / 大数据	CephFS / Lustre / JuiceFS（高带宽）
国产化	CubeFS / 浪潮 AS13000
小团队 + 简单	Longhorn（一键装、UI 友好）
私有化 + 信创	NFS + nfs-subdir（Linux 原生 + 国产 OS 兼容）

6.5 实战：Longhorn 一键装

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Longhorn 是 Rancher 出品的云原生块存储
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.6.4/deploy/longhorn.yaml

# 默认 StorageClass
kubectl get sc longhorn

# 创建一个 StatefulSet 用 Longhorn
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pg
spec:
  serviceName: pg
  replicas: 1
  selector:
    matchLabels: { app: pg }
  template:
    metadata:
      labels: { app: pg }
    spec:
      containers:
      - name: pg
        image: postgres:16
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: longhorn
      resources:
        requests:
          storage: 10Gi
EOF

Longhorn 自动做 3 副本 + 跨节点同步 + 在线扩容 + 备份——NFS 的 keepalived + rsync 全套都不用做。

6.6 NFS 2024 仍然"有用"的场景

虽然新项目优先选 Longhorn/Ceph，但以下场景 NFS 仍是最佳选择：

跨 namespace / 跨集群共享数据（Longhorn 默认单集群）
Windows + Linux 混合客户端（NFS 协议通用性强）
AI 训练把数据集放在 NFS，多个训练 Pod 同时读（RWX）
传统业务——银行 ERP、政府办公系统（必须 NFS、SMB 协议）
国产化 / 信创项目（NFS 是 Linux 内核自带，不依赖任何商业软件）

6.7 一句话总结

2016 年的 NFS 4.1 + nfs-subdir + keepalived 方案在 2024 仍然能跑——但新项目建议直接用 Longhorn（块 + 简单）或 Ceph（大规模 + 混合）。
NFS 不是"过时"了，而是从默认选择退到备选——看场景用。

7. 小结

NFS 在 K8s 存储选型中仍然有一席之地：

动态 StorageClass（nfs-subdir）让 NFS 用法跟云盘一样简单
keepalived + rsync + lsyncd 组成"穷人版"高可用
NFS 单点风险依然存在，关键业务建议 Ceph / Longhorn
2024 新选择：Longhorn（首选）/ Rook-Ceph（大规模）/ 国产 CubeFS（信创）

下一步：Rook-Ceph 1.17 分布式存储：块存储 RBD + CephFS + OSD/MON。 2024 推荐：先看 Longhorn 官方文档（一键装，3 副本 + 备份），再做 Ceph。