健康检查机制用于监控 Gateway 和 Agent 的运行状态,确保系统稳定可靠。
概述
OpenClaw 提供多层次的健康检查:
- Gateway 服务健康检查
- Agent 状态监控
- 依赖服务检查
- 资源使用监控
Gateway 健康检查
HTTP 端点
Gateway 提供标准健康检查端点:
# 基本健康检查
curl http://localhost:18789/health
# 详细状态
curl http://localhost:18789/health?verbose=true
# 就绪检查
curl http://localhost:18789/ready
# 存活检查
curl http://localhost:18789/alive
响应格式
健康检查返回 JSON 格式:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"uptime": 86400,
"version": "0.1.0",
"components": {
"gateway": "healthy",
"agents": "healthy",
"database": "healthy",
"network": "healthy"
}
}
状态码
200 OK- 系统健康503 Service Unavailable- 系统不可用429 Too Many Requests- 负载过高
配置健康检查
基本配置
{
"gateway": {
"health": {
"enabled": true,
"endpoint": "/health",
"interval": "30s", // 检查间隔
"timeout": "5s", // 超时时间
"retries": 3 // 重试次数
}
}
}
组件检查
{
"gateway": {
"health": {
"checks": {
"agents": {
"enabled": true,
"critical": true // 失败时标记为不健康
},
"database": {
"enabled": true,
"critical": false // 失败时仅警告
},
"network": {
"enabled": true,
"critical": true
},
"disk": {
"enabled": true,
"critical": false,
"threshold": 90 // 磁盘使用率阈值
},
"memory": {
"enabled": true,
"critical": false,
"threshold": 85 // 内存使用率阈值
}
}
}
}
}
CLI 健康检查
基本命令
# 检查整体健康状态
openclaw health
# 检查特定组件
openclaw health --component gateway
openclaw health --component agents
# 详细输出
openclaw health --verbose
# JSON 格式
openclaw health --json
诊断命令
# 运行完整诊断
openclaw doctor
# 检查特定问题
openclaw doctor --check config
openclaw doctor --check permissions
openclaw doctor --check network
# 自动修复
openclaw doctor --fix
Agent 健康监控
Agent 状态
# 查看所有 Agent 状态
openclaw agent list
# 查看特定 Agent
openclaw agent status main
# 检查 Agent 健康
openclaw agent health main
配置 Agent 监控
{
"agents": {
"list": [{
"id": "main",
"health": {
"enabled": true,
"checkInterval": "1m",
"responseTimeout": "30s",
"unhealthyThreshold": 3, // 连续失败次数
"healthyThreshold": 2 // 恢复所需成功次数
}
}]
}
}
依赖服务检查
外部服务
{
"health": {
"dependencies": {
"openai": {
"enabled": true,
"url": "https://api.openai.com/v1/models",
"timeout": "10s",
"critical": false
},
"database": {
"enabled": true,
"type": "postgres",
"critical": true
}
}
}
}
网络连接
# 测试外部连接
openclaw health --check-connectivity
# 测试特定服务
openclaw health --check-service openai
openclaw health --check-service anthropic
资源监控
系统资源
{
"monitoring": {
"resources": {
"enabled": true,
"interval": "30s",
"thresholds": {
"cpu": 80, // CPU 使用率 %
"memory": 85, // 内存使用率 %
"disk": 90, // 磁盘使用率 %
"openFiles": 1000 // 打开文件数
},
"alerts": {
"enabled": true,
"channels": ["log", "webhook"]
}
}
}
}
查看资源使用
# 查看资源使用
openclaw status --resources
# 持续监控
openclaw monitor
# 导出监控数据
openclaw monitor --export metrics.json
Kubernetes 集成
Liveness Probe
livenessProbe:
httpGet:
path: /alive
port: 18789
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Readiness Probe
readinessProbe:
httpGet:
path: /ready
port: 18789
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 2
告警配置
告警规则
{
"alerts": {
"enabled": true,
"rules": [
{
"name": "gateway-unhealthy",
"condition": "health.status != 'healthy'",
"severity": "critical",
"duration": "5m"
},
{
"name": "high-memory",
"condition": "memory.usage > 85",
"severity": "warning",
"duration": "10m"
},
{
"name": "agent-crashed",
"condition": "agent.status == 'crashed'",
"severity": "critical",
"duration": "1m"
}
]
}
}
告警通知
{
"alerts": {
"notifications": {
"slack": {
"enabled": true,
"webhook": "https://hooks.slack.com/...",
"channel": "#openclaw-alerts"
},
"email": {
"enabled": true,
"to": ["ops@example.com"],
"smtp": {
"host": "smtp.gmail.com",
"port": 587
}
},
"webhook": {
"enabled": true,
"url": "https://api.example.com/alerts"
}
}
}
}
日志集成
健康检查结果自动记录到日志:
{
"logging": {
"health": {
"enabled": true,
"level": "info",
"logSuccess": false, // 是否记录成功检查
"logFailure": true // 总是记录失败
}
}
}
故障自动恢复
自动重启
{
"gateway": {
"recovery": {
"enabled": true,
"restartOnUnhealthy": true,
"maxRestarts": 5,
"restartWindow": "1h"
}
}
}
Agent 自动恢复
{
"agents": {
"defaults": {
"recovery": {
"enabled": true,
"restartOnCrash": true,
"backoffStrategy": "exponential",
"maxRetries": 3
}
}
}
}
最佳实践
- 在生产环境始终启用健康检查
- 设置合理的阈值避免误报
- 配置多种告警通道确保及时响应
- 定期审查健康检查日志
更多信息
更多监控和运维最佳实践请参考 官方文档。