日期: 2026-04-30 环境: macOS 25.4.0 | OpenClaw 2026.4.26 | Node.js v22.22.0 作者: 云与AI
本次故障表现为主会话(main session)长时间卡死、钉钉 Stream 频繁断连重连、模型 fallback 链全部超时。Gateway 重启后恢复正常,但暴露了模型供应商认证失效、hook 注册异常、向量记忆降级等多层问题。本文记录完整诊断与修复过程,供同类问题参考。
早上 7:14 左右,收到用户反馈后开始排查,发现以下症状并发:
症状 | 详情 |
|---|---|
主会话卡死 | state=processing,age 持续增长(970s → 1270s → 2251s),queueDepth=0 但一直不释放 |
模型超时 | M2.7 → M2.5 → M2.5-highspeed → VL-01 → M2 → M2.1,14 级 fallback 全部超时 |
钉钉断连 | 心跳丢失触发 reconnection,每约 16 分钟一次,WebSocket code 1006 |
bailian API 失效 | 8 个 fallback 模型全部 401 invalid_api_key |
Hook 注册失败 | pg-memory / self-learning 的 Handler 'default' is not a function |
向量记忆降级 | sqlite-vec 不可用,chunks_vec 未更新 |
# Gateway 健康状态
openclaw gateway status
# 通道连接状态
openclaw channels status dingtalk
# Cron 任务状态
openclaw cron list输出显示:
running (pid 36585, state active) ✅enabled, configured, running, connected ✅关键日志路径:
/tmp/openclaw/openclaw-2026-04-30.log~/.openclaw/logs/gateway.err.log~/.openclaw/logs/config-audit.jsonl# 提取最近错误事件
tail -100 ~/.openclaw/logs/gateway.err.log
# 实时追踪模型 fallback 决策
tail -f /tmp/openclaw/openclaw-2026-04-30.log | python3 -c "
import sys, json
for line in sys.stdin:
try:
d = json.loads(line.strip())
msg = str(d.get('1',''))
if 'model-fallback' in msg or 'stuck session' in msg or 'error' in msg.lower():
print(d.get('time','')[11:19], msg[:200])
except: pass
"# 通过 sessions_list API 检查当前会话
sessions_list(includeLastMessage=true, limit=5)
# 发现:sessionId=main, status=running, queueDepth=1, age=持续增长
# 这说明会话虽然标记为 running,但实际卡在 LLM 请求阶段# 直接测试 bailian API(通义千问)
curl -s --max-time 15 -X POST "https://coding.dashscope.aliyuncs.com/v1/chat/completions" \
-H "Authorization: Bearer <api-key>" \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-plus","messages":[{"role":"user","content":"hi"}],"max_tokens":5}'
# 返回: {"code":"invalid_api_key","message":"invalid access token or token expired"}
# 测试 minimax-cn API
curl -s --max-time 15 -X POST "https://api.minimaxi.com/anthropic/v1/messages" \
-H "x-api-key: <key>" \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{"model":"MiniMax-M2.1","messages":[{"role":"user","content":"hi"}],"max_tokens":5}'
# 返回: {"type":"authentication_error","message":"Please carry the API secret key in the Authorization header"}# 查看当前 fallback 链配置
python3 -c "
import json
with open('/Users/cloudmesh/.openclaw/openclaw.json') as f:
data = json.load(f)
fallbacks = data.get('agents', {}).get('defaults', {}).get('model', {}).get('fallbacks', [])
for i, f in enumerate(fallbacks):
print(f'{i+1}. {f}')
"Fallback 链共 17 个,其中 8 个是 bailian(阿里云百炼)模型,全部已失效。
Bailian API Key 过期:
models.providers.bailian.apiKeybailian/* 模型返回 401,触发 fallbackMiniMax API 认证方式变更:
authHeader: true(HTTP header 认证)Authorization: Bearer <secret_key> 格式minimax:cn profile 只有 mode: api_key,没有实际 key正常流程: user message → processing → waiting for LLM → responding → idle
卡死流程: user message → processing → [LLM 超时] → [fallback 超时] → [永远等待]session 卡死原因:
processingabortedLastRun: false 说明没有 abort 信号heartbeatMisses: 112+
heartbeatTriggeredReconnects: 19
socketCloseEvents: 14
runtimeDisconnects: 19钉钉 Stream 连接因心跳丢失触发 reconnection,这通常是因为:
[ERROR] Handler 'default' from pg-memory is not a function
[ERROR] Handler 'default' from self-learning is not a function原因:Skill 的 hook 入口文件导出格式与 Gateway 期望不一致。Gateway 期望 default 是函数,但实际导出的是对象或其他类型。
[memory] chunks_vec not updated — sqlite-vec unavailable. Vector recall degraded.sqlite-vec 扩展未安装,向量检索降级为文本匹配,不影响对话但记忆精度下降。
openclaw gateway restart效果:
重启后日志:
[07:26:26] [INFO] gateway ready
[07:26:26] [INFO] heartbeat: started
[07:26:27] [INFO] [default] DingTalk Stream client connected successfully
[07:26:27] [INFO] [default] DingTalk Stream client connected successfully
[07:27:22] [INFO] ⇄ res ✓ chat.history 26358msmodels.providers.bailian.apiKeyhandler.default 是函数# 定期检查 session 状态
openclaw sessions list | grep -E "stuck|processing"
# 检查模型可用性
curl -s --max-time 10 -X POST "https://api.minimaxi.com/anthropic/v1/messages" \
-H "Authorization: Bearer <key>" \
-H "Content-Type: application/json" \
-d '{"model":"MiniMax-M2.7","messages":[{"role":"user","content":"test"}],"max_tokens":5}'
# 检查 bailian API
curl -s --max-time 10 "https://coding.dashscope.aliyuncs.com/v1/models" \
-H "Authorization: Bearer <key>"# 每小时检查 Gateway 和通道状态
cron:
- name: "health-check"
schedule: "0 * * * *"
payload:
kind: "systemEvent"
text: "健康检查:检查 gateway status、channels、模型可用性"{
"agents":{
"defaults":{
"model":{
"primary":"minimax/MiniMax-M2.7",
"fallbacks":[
"minimax/MiniMax-M2.5",
"minimax/MiniMax-M2.1"
],
"timeoutMs":30000,
"maxRetries":2
}
},
"session":{
"maxProcessingTimeMs":300000,
"autoAbortStuck":true
}
}
}场景 | 命令 |
|---|---|
查 Gateway 状态 | openclaw gateway status |
查通道连接 | openclaw channels status dingtalk |
查 cron 任务 | openclaw cron list |
查活跃 session | openclaw sessions list |
重启 Gateway | openclaw gateway restart |
查详细日志 | tail -f /tmp/openclaw/openclaw-YYYY-MM-DD.log |
查错误日志 | tail -100 ~/.openclaw/logs/gateway.err.log |
验 API 可用性 | curl -s -X POST <endpoint> -H "Authorization: Bearer <key>" ... |
查配置 | openclaw config get agents.defaults.model |
age= 值持续增长是核心信号,说明 LLM 请求卡住了04:44 bailian/kimi-k2.5 → 401 invalid_api_key (第一个失败)
05:00 钉钉 Stream 断连 (attempt 1)
05:16 主会话卡死 age=971s,开始 fallback
05:32 钉钉心跳丢失,触发 reconnection
06:01 钉钉再次 reconnection
06:18 钉钉再次 reconnection
06:34 主会话卡死 age=1977s
06:50 钉钉 reconnection
07:11 主会话卡死 age=2251s,M2.1 也超时
07:25 Gateway 重启开始
07:26 Gateway ready,钉钉连接成功
07:27 会话恢复正常
07:33 用户确认问题已修复