Featured image of post Loki + Promtail 日志体系:helm 部署 + yaml 自定义 + 排错

Loki + Promtail 日志体系:helm 部署 + yaml 自定义 + 排错

Loki 2.9.3 日志聚合、Promtail DaemonSet 采集、helm 部署 + 私有 yaml 定制 + 容器权限修复

K8s 日志采集的 3 种方式

方式资源占用适用
DaemonSet 采集(默认)中小集群
Sidecar 采集大集群、多租户隔离
应用主动推送取决于应用自定义需求

本文部署 Loki + Promtail——K8s 时代最主流的日志方案(比 ELK 轻量 10 倍,查询语法像 PromQL)。

适用版本:Loki 2.9.3 / Promtail / K8s 1.28.5


1. Loki 架构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
                 ┌──────────────────┐
                       Grafana       查询
                 └─────────┬────────┘
                            LogQL
                 ┌─────────▼────────┐
                        Loki         存储 + 索引
                   (write/read/    
                    backend/       
                    gateway)       
                 └─────────┬────────┘
                            push
        ┌──────────────────┴──────────────────┐
                                              
   ┌────▼────┐                            ┌────▼────┐
    Promtail ←─ 采集 /var/log/pods/     Promtail
     (Node1)                              (Node2)
   └─────────┘                            └─────────┘
                                               
   /var/log/pods/                          /var/log/pods/
   /var/lib/docker/containers/             /var/lib/docker/containers/

3 大模块:

  • Grafana Agent / Promtail:每节点一个 DaemonSet,抓容器日志
  • Loki:存储 + 查询(分 write / read / backend / gateway 4 个微服务)
  • Grafana:可视化查询

2. helm 部署(推荐)

2.1 添加仓库

1
2
3
4
5
6
7
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# 拉 chart
helm pull grafana/loki-stack
tar xvf loki-stack-2.10.0.tgz
cd loki-stack

2.2 备份默认 values

1
cp values.yaml values-prod.yaml

2.3 修改 values-prod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
loki:
  enabled: true
  persistence:
    enabled: true
    storageClassName: nfs-client
    accessModes:
      - ReadWriteOnce
    size: 10Gi

promtail:
  enabled: true

grafana:
  enabled: true
  service:
    type: NodePort
  persistence:
    enabled: true
    storageClassName: nfs-client
    accessModes:
      - ReadWriteOnce
    size: 10Gi

2.4 处理冲突 RBAC

1
2
3
4
5
6
k get ClusterRole
# ClusterRole "loki-promtail" in namespace "" exists and cannot be imported into the current release

cd loki-stack
kubectl delete ClusterRole loki-promtail
kubectl delete ClusterRole loki-grafana-clusterrole

2.5 部署

1
2
3
4
5
6
7
kubectl create ns logging

helm upgrade --install loki . \
  -f values-prod.yaml \
  -n logging

kubectl logs -f -n logging loki-0

2.6 验证

1
2
3
4
kubectl get pod -n logging
# loki-0          1/1     Running
# loki-promtail-xxx      1/1     Running
# loki-grafana-xxx       1/1     Running

3. 私有 yaml 部署(更可控)

如果不想用 helm 那一堆微服务,可以直接用 statefulset + configmap。

3.1 准备镜像

1
2
3
docker pull grafana/loki:2.9.3
docker tag grafana/loki:2.9.3 <harbor>/base/grafana/loki:2.9.3
docker push <harbor>/base/grafana/loki:2.9.3

3.2 完整 yaml

/data/k8scnf/loki.yaml(节选关键部分):

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: loki
  namespace: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki
  namespace: logging
  labels:
    app: loki
data:
  loki.yaml: |
    auth_enabled: false
    ingester:
      chunk_idle_period: 3m
      chunk_block_size: 262144
      chunk_retain_period: 1m
      max_transfer_retries: 0
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
      wal:
        enabled: true
        dir: /data/wal
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 8h
    schema_config:
      configs:
      - from: "2024-01-19"
        store: boltdb-shipper
        object_store: filesystem
        schema: v11
        index:
          prefix: index_
          period: 24h
    server:
      http_listen_port: 3100
    storage_config:
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks
    chunk_store_config:
      max_look_back_period: 0s
    table_manager:
      retention_deletes_enabled: true
      retention_period: 48h
    compactor:
      working_directory: /data/loki/boltdb-shipper-compactor
      shared_store: filesystem
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: loki
  namespace: logging
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  serviceName: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      serviceAccountName: loki
      securityContext:
        fsGroup: 10001
        runAsGroup: 10001
        runAsNonRoot: true
        runAsUser: 10001
      initContainers:
        - name: fix-permissions
          image: busybox:latest
          securityContext:
            privileged: true
            runAsGroup: 0
            runAsNonRoot: false
            runAsUser: 0
          command:
          - sh
          - -c
          - >-
            id;
            mkdir -p /data/loki;
            chown 10001:10001 /data -R;
            ls -la /data/
          volumeMounts:
          - mountPath: /data
            name: loki-storage
      containers:
        - name: loki
          image: grafana/loki:2.9.3
          args:
            - -config.file=/etc/loki/config/loki.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/loki/config/loki.yaml
              subPath: loki.yaml
            - name: loki-storage
              mountPath: /data
          ports:
            - name: http-metrics
              containerPort: 3100
          livenessProbe:
            httpGet:
              path: /ready
              port: http-metrics
            initialDelaySeconds: 45
          readinessProbe:
            httpGet:
              path: /ready
              port: http-metrics
            initialDelaySeconds: 45
          securityContext:
            readOnlyRootFilesystem: true
  volumeClaimTemplates:
    - metadata:
        name: loki-storage
      spec:
        accessModes: [ReadWriteOnce]
        resources:
          requests:
            storage: 3Gi
---
apiVersion: v1
kind: Service
metadata:
  name: loki
  namespace: logging
spec:
  type: ClusterIP
  ports:
    - port: 3100
      name: http-metrics
  selector:
    app: loki
---
apiVersion: v1
kind: Service
metadata:
  name: loki-outer
  namespace: logging
spec:
  type: NodePort
  ports:
    - port: 3100
      nodePort: 32537
  selector:
    app: loki

3.3 部署

1
2
kubectl create ns logging
kubectl apply -f /data/k8scnf/loki.yaml

4. Promtail 部署

Promtail 是 Loki 的"客户端",每节点跑一个 DaemonSet。

4.1 ConfigMap

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail
  namespace: logging
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
    positions:
      filename: /tmp/positions.yaml
    clients:
      - url: http://loki:3100/loki/api/v1/push
    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_name]
            target_label: __service__
          - source_labels: [__meta_kubernetes_pod_node_name]
            target_label: __host__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - action: replace
            source_labels: [__meta_kubernetes_namespace]
            target_label: namespace
          - action: replace
            source_labels: [__meta_kubernetes_pod_name]
            target_label: pod

4.2 DaemonSet + RBAC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
apiVersion: v1
kind: ServiceAccount
metadata:
  name: promtail
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: promtail
rules:
  - apiGroups: [""]
    resources: [nodes, nodes/proxy, services, endpoints, pods]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: promtail
subjects:
  - kind: ServiceAccount
    name: promtail
    namespace: logging
roleRef:
  kind: ClusterRole
  name: promtail
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: logging
spec:
  selector:
    matchLabels:
      name: promtail
  template:
    metadata:
      labels:
        name: promtail
    spec:
      serviceAccountName: promtail
      containers:
        - name: promtail
          image: grafana/promtail:2.9.3
          args:
            - -config.file=/etc/promtail/promtail.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/promtail
            - name: docker
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: pods
              mountPath: /var/log/pods
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: promtail
        - name: docker
          hostPath:
            path: /var/lib/docker/containers
        - name: pods
          hostPath:
            path: /var/log/pods

5. 验证与查询

5.1 Loki 健康

1
2
3
4
kubectl -n logging port-forward svc/loki 3100:3100
# 浏览器
http://localhost:3100/ready
# 看到 "ready" 即 OK

5.2 Grafana 配置 Loki 数据源

Grafana → Configuration → Data sources → Add data source → Loki

URL:http://loki:3100

5.3 查日志

Grafana Explore → 数据源选 Loki → 查询:

1
2
3
{namespace="kube-system"}
{namespace="default"} |= "error"
{app="my-app"} | json | line_format "{{.msg}}"

{job="kubernetes-pods"} |= "ERROR" 查所有 ERROR 日志。


6. 常见问题

6.1 loki-0 1/2 CrashLoopBackOff

容器权限问题。yaml 里加 initContainer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
initContainers:
  - name: fix-permissions
    image: busybox:latest
    securityContext:
      privileged: true
      runAsGroup: 0
      runAsUser: 0
    command:
    - sh
    - -c
    - >-
      mkdir -p /data/loki;
      chown 10001:10001 /data -R;
      ls -la /data/
    volumeMounts:
    - mountPath: /data
      name: storage

实测 2.7.3 有 bug,2.6.1 反而正常——怀疑是 Loki 配置变更导致容器启动顺序问题。直接换成 2.6.1。

6.2 no such file or directory: /var/log/pods/...

docker 数据目录迁移过 → Promtail 找不到容器日志。

解决:把 Promtail 的 /var/lib/docker/containers 路径改成实际路径;或重装 docker 到默认路径。

临时方案:重启故障 Pod:

1
kubectl delete po <pod> -n kube-system

6.3 Promtail 一直 Buffer filling up

Promtail 默认 buffer 满了就丢日志。调大:

1
2
3
4
5
6
positions:
  filename: /tmp/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
    batchwait: 1s
    batchsize: 1048576

6.4 日志保留时间太短

1
2
3
4
5
limits_config:
  retention_period: 168h  # 7 天
table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

6.5 helm 部署 RBAC 冲突

1
2
kubectl delete ClusterRole loki-promtail
kubectl delete ClusterRole loki-grafana-clusterrole

7. 小结

Loki + Promtail 是 K8s 时代"轻量级日志方案"的事实标准:

  1. DaemonSet 采集 占资源最少,适合中小集群
  2. Loki 4 微服务(write / read / backend / gateway)支持水平扩展
  3. LogQL 语法像 PromQL,习惯 Prometheus 的人秒上手
  4. 容器权限问题 用 initContainer 修
  5. 保留策略 默认 48h,生产建议 7-30 天

下一步:K8s 实战:Deployment/StatefulSet/DaemonSet 区别 + 探针 + hostNetwork 调度

使用 Hugo 构建
主题 StackJimmy 设计