6.16 健康检查

健康检查机制用于监控 Gateway 和 Agent 的运行状态,确保系统稳定可靠。

概述

OpenClaw 提供多层次的健康检查:

  • Gateway 服务健康检查
  • Agent 状态监控
  • 依赖服务检查
  • 资源使用监控

Gateway 健康检查

HTTP 端点

Gateway 提供标准健康检查端点:

# 基本健康检查
curl http://localhost:18789/health

# 详细状态
curl http://localhost:18789/health?verbose=true

# 就绪检查
curl http://localhost:18789/ready

# 存活检查
curl http://localhost:18789/alive

响应格式

健康检查返回 JSON 格式:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "uptime": 86400,
  "version": "0.1.0",
  "components": {
    "gateway": "healthy",
    "agents": "healthy",
    "database": "healthy",
    "network": "healthy"
  }
}

状态码

  • 200 OK - 系统健康
  • 503 Service Unavailable - 系统不可用
  • 429 Too Many Requests - 负载过高

配置健康检查

基本配置

{
  "gateway": {
    "health": {
      "enabled": true,
      "endpoint": "/health",
      "interval": "30s",        // 检查间隔
      "timeout": "5s",          // 超时时间
      "retries": 3              // 重试次数
    }
  }
}

组件检查

{
  "gateway": {
    "health": {
      "checks": {
        "agents": {
          "enabled": true,
          "critical": true      // 失败时标记为不健康
        },
        "database": {
          "enabled": true,
          "critical": false     // 失败时仅警告
        },
        "network": {
          "enabled": true,
          "critical": true
        },
        "disk": {
          "enabled": true,
          "critical": false,
          "threshold": 90       // 磁盘使用率阈值
        },
        "memory": {
          "enabled": true,
          "critical": false,
          "threshold": 85       // 内存使用率阈值
        }
      }
    }
  }
}

CLI 健康检查

基本命令

# 检查整体健康状态
openclaw health

# 检查特定组件
openclaw health --component gateway
openclaw health --component agents

# 详细输出
openclaw health --verbose

# JSON 格式
openclaw health --json

诊断命令

# 运行完整诊断
openclaw doctor

# 检查特定问题
openclaw doctor --check config
openclaw doctor --check permissions
openclaw doctor --check network

# 自动修复
openclaw doctor --fix

Agent 健康监控

Agent 状态

# 查看所有 Agent 状态
openclaw agent list

# 查看特定 Agent
openclaw agent status main

# 检查 Agent 健康
openclaw agent health main

配置 Agent 监控

{
  "agents": {
    "list": [{
      "id": "main",
      "health": {
        "enabled": true,
        "checkInterval": "1m",
        "responseTimeout": "30s",
        "unhealthyThreshold": 3,    // 连续失败次数
        "healthyThreshold": 2       // 恢复所需成功次数
      }
    }]
  }
}

依赖服务检查

外部服务

{
  "health": {
    "dependencies": {
      "openai": {
        "enabled": true,
        "url": "https://api.openai.com/v1/models",
        "timeout": "10s",
        "critical": false
      },
      "database": {
        "enabled": true,
        "type": "postgres",
        "critical": true
      }
    }
  }
}

网络连接

# 测试外部连接
openclaw health --check-connectivity

# 测试特定服务
openclaw health --check-service openai
openclaw health --check-service anthropic

资源监控

系统资源

{
  "monitoring": {
    "resources": {
      "enabled": true,
      "interval": "30s",
      "thresholds": {
        "cpu": 80,          // CPU 使用率 %
        "memory": 85,       // 内存使用率 %
        "disk": 90,         // 磁盘使用率 %
        "openFiles": 1000   // 打开文件数
      },
      "alerts": {
        "enabled": true,
        "channels": ["log", "webhook"]
      }
    }
  }
}

查看资源使用

# 查看资源使用
openclaw status --resources

# 持续监控
openclaw monitor

# 导出监控数据
openclaw monitor --export metrics.json

Kubernetes 集成

Liveness Probe

livenessProbe:
  httpGet:
    path: /alive
    port: 18789
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Readiness Probe

readinessProbe:
  httpGet:
    path: /ready
    port: 18789
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  successThreshold: 2

告警配置

告警规则

{
  "alerts": {
    "enabled": true,
    "rules": [
      {
        "name": "gateway-unhealthy",
        "condition": "health.status != 'healthy'",
        "severity": "critical",
        "duration": "5m"
      },
      {
        "name": "high-memory",
        "condition": "memory.usage > 85",
        "severity": "warning",
        "duration": "10m"
      },
      {
        "name": "agent-crashed",
        "condition": "agent.status == 'crashed'",
        "severity": "critical",
        "duration": "1m"
      }
    ]
  }
}

告警通知

{
  "alerts": {
    "notifications": {
      "slack": {
        "enabled": true,
        "webhook": "https://hooks.slack.com/...",
        "channel": "#openclaw-alerts"
      },
      "email": {
        "enabled": true,
        "to": ["ops@example.com"],
        "smtp": {
          "host": "smtp.gmail.com",
          "port": 587
        }
      },
      "webhook": {
        "enabled": true,
        "url": "https://api.example.com/alerts"
      }
    }
  }
}

日志集成

健康检查结果自动记录到日志:

{
  "logging": {
    "health": {
      "enabled": true,
      "level": "info",
      "logSuccess": false,    // 是否记录成功检查
      "logFailure": true      // 总是记录失败
    }
  }
}

故障自动恢复

自动重启

{
  "gateway": {
    "recovery": {
      "enabled": true,
      "restartOnUnhealthy": true,
      "maxRestarts": 5,
      "restartWindow": "1h"
    }
  }
}

Agent 自动恢复

{
  "agents": {
    "defaults": {
      "recovery": {
        "enabled": true,
        "restartOnCrash": true,
        "backoffStrategy": "exponential",
        "maxRetries": 3
      }
    }
  }
}
最佳实践
  • 在生产环境始终启用健康检查
  • 设置合理的阈值避免误报
  • 配置多种告警通道确保及时响应
  • 定期审查健康检查日志
更多信息
更多监控和运维最佳实践请参考 官方文档