Arthas 实战：Java 线上问题排查利器

为什么写这篇：在生产环境里，复现一个偶发问题往往要等几天，重启服务会丢现场日志。Arthas 能在不重启、不改代码的前提下，把方法入参、返回值、异常、火焰图、线程栈全部"现场直播"——这是它与 jstack、jmap 最大的区别。
适用读者：需要在 K8s/容器中排查 Java 应用的运维与后端工程师。
前置知识：了解 JVM 基础（堆/线程）、Linux 常用命令、Docker/K8s 基础操作。

1. Arthas 是什么

Arthas 是 Alibaba 开源的 Java 线上诊断工具，通过 Java Attach API 直接连进运行中的 JVM，不需要重启应用、不需要改代码，即可获得：

维度	能看到什么
应用全景	实时 dashboard（线程/内存/GC/Runtime）
方法级观测	入参、返回值、抛出异常、调用耗时
类加载	反编译、查看类加载器、搜索类
性能	profiler 火焰图、CPU 占用 TopN 线程
离线	heap dump、线程快照

When to use：当线上出现"日志看不出但能感知"的问题——比如接口慢但找不到慢在哪、内存偶发飙高、线程莫名 BLOCKED、第三方 jar 包方法名对不上——Arthas 是首选。

2. 快速开始

2.1 三种部署场景

场景	命令骨架
直接跑	`curl -O https://arthas.aliyun.com/arthas-boot.jar && java -jar arthas-boot.jar`
Docker 容器	`docker cp arthas-boot.jar <container>:/ && docker exec -it <container> bash`
K8s Pod	`kubectl cp arthas-boot.jar <ns>/<pod>:/ && kubectl exec -it -n <ns> <pod> -- bash`

Why docker/k8s 这种方式：Arthas 启动后会 attach 到目标 JVM，要求两者在同一 PID 命名空间。直接拷 jar 进容器是最稳的——不需要在镜像里预装，也不用在 Pod spec 加 sidecar。

启动后 arthas-boot 会扫描当前容器/进程里的 Java 进程，交互式让你选：

1
2
3
4
[INFO] Found existing java process, please choose one and input the serial number
* [1]: 12345 /jar/your-app.jar
1
[INFO] Attach process 1 success.

输入序号回车，attach 成功后进入 [arthas@1] 提示符。

2.2 K8s 中要先固定 HPA 副本

生产环境 HPA 滚动扩缩时 attach 目标 pod 可能正好被驱逐。临时把 minReplicas = maxReplicas 锁住副本数：

1
2
3
4
5
6
7
# 暂停 HPA（动态扩缩）
kubectl patch hpa <deploy-name> -n <ns> \
  -p '{"spec":{"minReplicas":1, "maxReplicas":1}}'

# 排查完恢复
kubectl patch hpa <deploy-name> -n <ns> \
  -p '{"spec":{"minReplicas":1, "maxReplicas":2}}'

2.3 离线环境

1
2
3
curl -Lo arthas.zip 'https://arthas.aliyun.com/download/latest_version?mirror=aliyun'
unzip arthas.zip -d arthas
cd arthas && java -jar arthas-boot.jar

3. 核心命令速查

按"打开就用什么"的使用频次排序：

3.1 仪表盘类

1
2
3
4
5
6
7
8
9
dashboard              # 整体面板，每秒刷新
dashboard -n 1         # 只刷一次（脚本里更友好）
thread                 # 所有线程，CPU 高者排前
thread -n 3            # CPU Top3 线程
thread -b              # 找出 synchronized 死锁的线程
thread <tid>           # 单个线程的栈
jvm                    # JVM 信息（启动参数、class loader）
memory                 # 堆/非堆分区域
heapdump /tmp/h.bin    # 类比 jmap -dump

3.2 方法观测类（字节码增强）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 监控方法，5 秒一次统计
monitor -c 5 com.x.Service method

# 看入参+返回值
watch com.x.Service method '{params, returnObj}' -x 3

# 看异常
watch com.x.Service method 'throwExp' -x 3 --exception

# 调用耗时+路径
trace com.x.Service method

# 当前调用链
stack com.x.Service method

# 时间隧道：记录每次调用
tt -t com.x.Service method
tt -l                  # 列出所有记录
tt -i 1001 -p          # 重放某次调用

重要警告：monitor/watch/trace 都通过字节码增强实现，会在方法入口插入切面，对线上有性能开销。用完务必 stop 或 reset，否则会一直生效。

3.3 性能分析

1
2
3
4
5
profiler start --event alloc           # 采样分配
profiler status
profiler stop --file /tmp/p.html       # 火焰图
# 拷出
docker cp <container>:/tmp/p.html .

3.4 反编译

1
2
3
jad com.x.Service method      # 反编译方法
sc com.x.*                    # search class
sm com.x.* method             # search method

4. 典型场景

4.1 CPU 占用突然飙高

三步走：

1
2
3
4
5
6
7
8
# 1) 找出 Top3 忙线程
thread -n 3

# 2) 看那个线程在做什么
thread <tid>

# 3) 顺藤摸瓜找方法
trace com.x.SomeService someHotMethod

或者更直接——火焰图：

1
profiler start --duration 30 --file /tmp/flame.svg --event cpu

30 秒采样后浏览器打开 flame.svg，横轴是 CPU 时间占比、纵轴是调用栈。

4.2 偶发的慢请求

如果某个接口 1 小时才慢一次，普通日志抓不到。用 tt 时空隧道：

1
2
3
4
tt -t com.x.Controller slowApi
tt -l          # 查所有记录
tt -i 1001     # 看第 1001 次
tt -i 1001 -p  # 重放这次调用

4.3 看方法真实入参（反编译后还原）

有时一个方法入参是 ByteBuf（Netty 场景），日志里是 toString() 的乱码。可以用 arthas 调 ByteBufUtil.hexDump：

1
2
3
4
5
// 原方法
public void channelRead(ChannelHandlerContext ctx, Object msg) {
    ByteBuf buf = (ByteBuf) msg;
    // ...
}

1
2
watch com.example.InboundHandler channelRead \
  '@io.netty.buffer.ByteBufUtil@hexDump(params[1])' -b

-b 表示方法调用前触发，-x 3 控制对象展开深度。十六进制打印出来后，再用协议表比对每个字节位。

4.4 容器内 attach 失败

最常见错误是 Permission denied：

1
2
java.io.IOException: Permission denied
    at sun.tools.attach.LinuxVirtualMachine.sendQuitTo(Native Method)

根因：attach 走 /proc/<pid>/root/tmp/.java_pid<uid>，需要 attach 进程与目标进程 UID 一致，且有 /proc 写权限。

解决方案：

场景	方案
Docker	启动时加 `--privileged` 或 `--cap-add=SYS_PTRACE`
K8s	containerd 默认开启；containerd 旧版需 `securityContext.privileged: true`
都不行	重启容器是最后兜底（会丢现场）

5. 常见坑与最佳实践

5.1 字节码增强未清理

watch/trace/monitor 增强过的类，就算 arthas 客户端退出也不会自动 reset。下次 attach 时这些类仍然是增强版，可能导致奇怪的"线上代码和我本地反编译的不一样"。

诊断：

1
2
# 看哪些类被增强过
mc -M

清理：

1
2
reset com.x.EnhancedClass     # 恢复单个类
reset                          # 全部恢复

5.2 attach 完别忘了退出

1
2
quit   # 客户端断开，目标进程上的 arthas 服务继续运行
stop   # 完全停止 arthas 服务

attach 后 arthas 实际是双进程——client（你 ssh 进去的这个）和 agent（在目标 JVM 里）。quit 断 client，agent 还在；stop 才彻底退出。

5.3 profiler 不要长时间跑

profiler start 默认是无限期运行，会持续写采样数据。30 秒到几分钟足够采样，用 --duration 30 限定：

1
profiler start --duration 30 --file /tmp/p.html --event alloc

5.4 安全提示

Arthas 几乎能做任何运行时操作（执行 OGNL、修改 logger 级别、强制 GC），绝不能暴露在公网。如果是远程调试，用 arthas tunnel + 鉴权，隧道地址必须是内网。

小结

一句话总结：Arthas 是一把 Java 线上诊断的瑞士军刀——dashboard 看全景，thread/trace 追热点，watch/tt 录证据，profiler 出火焰图，jad 看真相。

5 个最常用命令：

1
2
3
4
5
dashboard        # 全景
thread -n 3      # CPU Top3
trace 类名 方法   # 方法耗时
watch 类名 方法 '{params,returnObj}'   # 入参返回值
profiler start --duration 30 --file flame.svg  # 火焰图

5 个最容易踩的坑：

attach 容器报 Permission denied → 加 --cap-add SYS_PTRACE
K8s 排查前忘了锁 HPA 副本数 → 用 kubectl patch hpa
watch/trace 用完没 reset → 类被污染，下次启动行为不一致
profiler 没加 --duration 跑了一晚上 → 撑爆磁盘
Arthas 暴露在公网 → 任何人能执行 OGNL，等于拿到 RCE

数据来源：Arthas 官方文档