Kubernetes集群健康检查全攻略从监控指标到故障维护的完整指南帮助运维人员快速定位问题保障系统高可用

威震华夏关云长 · 发表于 2025-9-18 15:30:17

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？立即注册

x

引言

Kubernetes作为容器编排领域的领导者，已被广泛应用于企业级应用部署和管理。随着集群规模的增长和复杂性的提升，确保集群的健康状态变得至关重要。一个健康的Kubernetes集群不仅能够提供稳定的服务，还能在出现问题时快速恢复，保障业务连续性。本文将全面介绍Kubernetes集群健康检查的方法、工具和最佳实践，帮助运维人员构建完善的监控体系，快速定位并解决问题，确保系统的高可用性。

Kubernetes集群监控基础

核心组件和架构

在深入监控之前，我们需要了解Kubernetes集群的核心组件及其架构。一个典型的Kubernetes集群由控制平面（Control Plane）和工作节点（Worker Nodes）组成。

控制平面组件包括：

• kube-apiserver：集群的统一入口，处理RESTful操作
• etcd：分布式键值存储，保存集群所有状态数据
• kube-scheduler：负责Pod调度决策
• kube-controller-manager：运行控制器进程
• cloud-controller-manager：与云服务提供商交互的组件

工作节点组件包括：

• kubelet：确保容器在Pod中运行
• kube-proxy：维护节点网络规则
• 容器运行时：如Docker、containerd等

了解这些组件有助于我们确定监控的关键点和指标。

常见监控指标

Kubernetes集群的监控指标可以分为以下几类：

• CPU使用率：节点和Pod的CPU消耗情况
• 内存使用率：节点和Pod的内存消耗情况
• 磁盘使用率：节点磁盘空间使用情况
• 网络I/O：节点和Pod的网络流量

• 组件运行状态：各核心组件是否正常运行
• API服务器响应时间：API请求的处理延迟
• etcd性能指标：etcd的读写延迟和存储情况

• Pod状态：运行中、待定、失败、成功等状态的数量
• 节点状态：就绪、未就绪、不可达等状态的数量
• Deployment状态：副本数量、可用副本数量、更新状态等
• PVC状态：绑定状态、容量使用情况等

• HTTP请求成功率：应用服务的HTTP响应状态码统计
• 请求延迟：应用服务的响应时间
• 错误率：应用服务的错误请求比例

监控工具与方案

Kubernetes内置监控能力

Kubernetes提供了一些内置的监控能力，可以帮助我们了解集群的基本状态。

kubectl是Kubernetes的命令行工具，通过它可以获取集群的各种状态信息。

# 查看集群组件状态
kubectl get componentstatuses
# 查看节点状态
kubectl get nodes
# 查看所有命名空间中的Pod状态
kubectl get pods --all-namespaces
# 查看特定资源详细信息
kubectl describe nodes <node-name>
kubectl describe pods <pod-name> -n <namespace>
# 查看资源使用情况
kubectl top nodes
kubectl top pods --all-namespaces

复制代码

Metrics Server是Kubernetes内置的资源使用数据收集器，它可以收集节点和Pod的CPU和内存使用情况。

安装Metrics Server：

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

复制代码

安装完成后，可以使用kubectl top命令查看资源使用情况。

第三方监控工具

Prometheus是一个开源的监控和告警系统，特别适合于Kubernetes环境。Grafana则是一个可视化工具，可以创建丰富的仪表板。

部署Prometheus和Grafana：

可以使用Helm来快速部署：

# 添加Helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安装Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack
# 查看部署状态
kubectl get pods -n default

复制代码

配置Prometheus监控Kubernetes：

Prometheus可以通过ServiceMonitor资源自动发现和监控Kubernetes资源。以下是一个示例ServiceMonitor配置：

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: example-app
namespace: monitoring
spec:
selector:
matchLabels:
app: example-app
endpoints:
- port: web
interval: 30s
path: /metrics

复制代码

Grafana仪表板：

Grafana提供了许多预置的Kubernetes监控仪表板，可以直接导入使用。一些常用的仪表板ID包括：

• 315：Kubernetes Cluster Monitoring
• 6417：Kubernetes Compute Resources / Pod
• 11074：Kubernetes API Server

除了Prometheus和Grafana，还有其他一些优秀的监控工具：

1. Datadog：提供全面的云原生监控解决方案
2. Sysdig：专注于容器安全和性能监控
3. New Relic：全栈可观测性平台
4. Elastic Stack (ELK)：用于日志收集、搜索和分析

健康检查策略

Pod健康检查

Kubernetes提供了两种Pod健康检查机制：存活探针（Liveness Probe）和就绪探针（Readiness Probe）。

存活探针用于确定容器是否正在运行。如果探针失败，kubelet会杀死容器，并根据重启策略进行重启。

存活探针示例：

apiVersion: v1
kind: Pod
metadata:
name: liveness-pod
spec:
containers:
- name: liveness-container
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5

复制代码

就绪探针用于确定容器是否准备好接收流量。如果探针失败，Pod将从服务的负载均衡中移除。

就绪探针示例：

apiVersion: v1
kind: Pod
metadata:
name: readiness-pod
spec:
containers:
- name: readiness-container
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/ready; sleep 30; rm -rf /tmp/ready; sleep 600
readinessProbe:
exec:
command:
- cat
- /tmp/ready
initialDelaySeconds: 5
periodSeconds: 5

复制代码

启动探针用于确定容器应用是否已经启动。在启动探针成功之前，其他探针不会生效。

启动探针示例：

apiVersion: v1
kind: Pod
metadata:
name: startup-pod
spec:
containers:
- name: startup-container
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- sleep 60; touch /tmp/started; sleep 600
startupProbe:
exec:
command:
- cat
- /tmp/started
failureThreshold: 30
periodSeconds: 10

复制代码

节点健康检查

节点健康检查是确保Kubernetes集群稳定运行的重要环节。Kubernetes通过kubelet和节点控制器来监控节点状态。

# 查看节点状态
kubectl get nodes
# 查看节点详细信息
kubectl describe node <node-name>
# 查看节点上的资源使用情况
kubectl top node <node-name>

复制代码

节点问题检测器是一个守护程序，用于监控节点的健康状况，并以事件的形式报告问题。

部署节点问题检测器：

apiVersion: v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
containers:
- name: node-problem-detector
image: k8s.gcr.io/node-problem-detector:v0.8.7
command:
- "/bin/node-problem-detector"
- "--logtostderr"
- "--system-log-monitors=/config/kernel-monitor.json,/config/docker-monitor.json"
volumeMounts:
- name: log
mountPath: /var/log
- name: config
mountPath: /config
volumes:
- name: log
hostPath:
path: /var/log/
- name: config
configMap:
name: node-problem-detector-config

复制代码

集群组件健康检查

集群组件的健康检查对于确保整个集群的正常运行至关重要。

# 查看控制平面组件状态
kubectl get componentstatuses
# 检查API服务器连接
kubectl cluster-info
# 检查etcd健康状态
kubectl get --raw='/healthz?verbose'
# 检查各个命名空间中的Pod状态
kubectl get pods --all-namespaces -o wide

复制代码

Kubernetes控制平面组件提供了健康检查端点，可以通过HTTP请求来检查其状态：

# API服务器健康检查
curl -k https://<api-server-ip>:6443/healthz
# etcd健康检查
curl -L http://<etcd-ip>:2379/health
# 调度器健康检查
curl -L http://<scheduler-ip>:10251/healthz
# 控制器管理器健康检查
curl -L http://<controller-manager-ip>:10252/healthz

复制代码

日志管理与分析

日志收集策略

在Kubernetes环境中，日志收集是故障排查的重要手段。由于Pod是短暂的，日志需要被持久化和集中管理。

Kubernetes默认会将容器的标准输出和错误输出保存到节点的/var/log/containers目录下。可以通过以下命令查看Pod日志：

# 查看Pod日志
kubectl logs <pod-name> -n <namespace>
# 查看特定容器的日志
kubectl logs <pod-name> -c <container-name> -n <namespace>
# 查看最近几行日志
kubectl logs --tail=100 <pod-name> -n <namespace>
# 查看并跟踪日志输出
kubectl logs -f <pod-name> -n <namespace>
# 查看之前实例的日志（Pod重启后）
kubectl logs --previous <pod-name> -n <namespace>

复制代码

使用DaemonSet在每个节点上部署日志收集代理是常见的做法。以下是使用Fluentd作为日志收集代理的示例：

apiVersion: v1
kind: ServiceAccount
metadata:
name: fluentd
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluentd
rules:
- apiGroups: [""]
resources:
- namespaces
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluentd
roleRef:
kind: ClusterRole
name: fluentd
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: fluentd
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
labels:
k8s-app: fluentd-logging
version: v1
spec:
selector:
matchLabels:
k8s-app: fluentd-logging
template:
metadata:
labels:
k8s-app: fluentd-logging
version: v1
spec:
serviceAccount: fluentd
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch-logging"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
- name: FLUENT_ELASTICSEARCH_SCHEME
value: "http"
- name: FLUENTD_SYSTEMD_CONF
value: "disable"
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers

复制代码

日志分析工具

EFK（Elasticsearch + Fluentd + Kibana）是Kubernetes环境中常用的日志分析解决方案。

部署Elasticsearch：

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: kube-system
spec:
serviceName: elasticsearch
replicas: 1
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
ports:
- containerPort: 9200
name: http
protocol: TCP
- containerPort: 9300
name: transport
protocol: TCP
env:
- name: discovery.type
value: single-node
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 100m
memory: 1Gi
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: kube-system
spec:
ports:
- port: 9200
name: http
selector:
app: elasticsearch

复制代码

部署Kibana：

apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:7.10.1
ports:
- containerPort: 5601
env:
- name: ELASTICSEARCH_HOSTS
value: http://elasticsearch:9200
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: kube-system
spec:
ports:
- port: 5601
protocol: TCP
targetPort: 5601
selector:
app: kibana
type: LoadBalancer

复制代码

Loki是一个受Prometheus启发的日志聚合系统，与Grafana集成良好，提供了轻量级的日志解决方案。

部署Loki：

apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: monitoring
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /data/loki/index
cache_location: /data/loki/cache
cache_ttl: 24h
shared_store: filesystem
filesystem:
directory: /data/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
compactor:
working_directory: /data/loki/compactor
shared_store: filesystem
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: loki
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:2.4.0
args:
- -config.file=/etc/loki/loki.yaml
ports:
- containerPort: 3100
name: http
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /data
volumes:
- name: config
configMap:
name: loki-config
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: loki
namespace: monitoring
spec:
ports:
- port: 3100
name: http
selector:
app: loki

复制代码

部署Promtail（日志收集代理）：

apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: monitoring
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker: {}
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: app
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: node
- source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- source_labels:
- __meta_kubernetes_pod_label_app_kubernetes_io_component
target_label: component
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: monitoring
spec:
selector:
matchLabels:
app: promtail
template:
metadata:
labels:
app: promtail
spec:
serviceAccount: promtail
containers:
- name: promtail
image: grafana/promtail:2.4.0
args:
- -config.file=/etc/promtail/promtail.yaml
ports:
- containerPort: 9080
name: http
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
- name: pods
mountPath: /var/log/pods
readOnly: true
volumes:
- name: config
configMap:
name: promtail-config
- name: containers
hostPath:
path: /var/lib/docker/containers
- name: pods
hostPath:
path: /var/log/pods

复制代码

故障诊断与排查

常见故障类型

Kubernetes集群中常见的故障类型包括：

节点故障可能由硬件问题、系统崩溃或网络问题引起。表现为节点状态为NotReady。

排查步骤：

# 检查节点状态
kubectl get nodes
# 查看节点详细信息
kubectl describe node <node-name>
# 检查节点上的kubelet日志
ssh <node-name> 'journalctl -u kubelet -f'
# 检查节点资源使用情况
kubectl top node <node-name>

复制代码

Pod故障可能由资源不足、镜像问题、配置错误等引起。表现为Pod状态为Pending、CrashLoopBackOff或Error。

排查步骤：

# 检查Pod状态
kubectl get pods -n <namespace>
# 查看Pod详细信息
kubectl describe pod <pod-name> -n <namespace>
# 查看Pod日志
kubectl logs <pod-name> -n <namespace>
# 如果Pod重启，查看前一实例日志
kubectl logs --previous <pod-name> -n <namespace>
# 检查Pod事件
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

复制代码

网络故障可能导致Pod间通信失败、服务不可访问等问题。

排查步骤：

# 检查Pod IP
kubectl get pods -n <namespace> -o wide
# 检查服务状态
kubectl get svc -n <namespace>
# 检查端点状态
kubectl get endpoints -n <namespace>
# 进入Pod测试网络连接
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
# 在Pod内执行
ping <target-ip>
nslookup <service-name>
curl <service-url>
# 检查网络插件状态
kubectl get pods -n kube-system | grep -E 'calico|flannel|weave|cilium'

复制代码

存储故障可能导致PVC无法绑定、Pod无法挂载卷等问题。

排查步骤：

# 检查PVC状态
kubectl get pvc -n <namespace>
# 检查PV状态
kubectl get pv
# 查看PVC详细信息
kubectl describe pvc <pvc-name> -n <namespace>
# 检查存储类
kubectl get storageclass
# 检查存储提供者状态
kubectl get pods -n kube-system | grep -E 'csi|provisioner'

复制代码

故障排查流程

一个系统化的故障排查流程可以帮助快速定位和解决问题。

首先确定问题是影响整个集群、单个节点还是特定应用。

# 检查集群整体状态
kubectl cluster-info
# 检查节点状态
kubectl get nodes
# 检查核心组件状态
kubectl get pods -n kube-system

复制代码

收集与问题相关的日志、事件和指标。

# 查看集群事件
kubectl get events --all-namespaces --sort-by='.metadata.creationTimestamp'
# 查看资源使用情况
kubectl top nodes
kubectl top pods --all-namespaces
# 收集节点信息
kubectl get nodes -o wide

复制代码

根据收集的信息分析可能的原因。

# 检查Pod状态
kubectl get pods --all-namespaces -o wide
# 检查未调度的Pod
kubectl get pods --all-namespaces --field-selector spec.nodeName=
# 检查失败的Pod
kubectl get pods --all-namespaces --field-selector status.phase=Failed

复制代码

根据分析结果实施相应的解决方案。

# 重启Pod
kubectl delete pod <pod-name> -n <namespace>
# 扩容Deployment
kubectl scale deployment <deployment-name> --replicas=<new-replica-count> -n <namespace>
# 排空节点
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data

复制代码

验证问题是否已解决。

# 检查Pod状态
kubectl get pods -n <namespace>
# 检查服务可用性
kubectl get svc -n <namespace>
curl <service-url>
# 检查应用日志
kubectl logs <pod-name> -n <namespace>

复制代码

实用排查命令和工具

# 查看集群信息
kubectl cluster-info
# 查看资源详细信息
kubectl describe <resource-type> <resource-name>
# 查看资源YAML配置
kubectl get <resource-type> <resource-name> -o yaml
# 查看资源标签
kubectl get <resource-type> --show-labels
# 根据标签选择资源
kubectl get <resource-type> -l <label-key>=<label-value>
# 查看资源注解
kubectl get <resource-type> <resource-name> -o jsonpath='{.metadata.annotations}'
# 查看Pod容器镜像
kubectl get pods <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}{"\n"}'
# 查看Pod资源请求和限制
kubectl get pods <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}{"\n"}'

复制代码

# 创建调试Pod
kubectl run debug-pod --image=busybox -- sleep 3600
# 进入Pod
kubectl exec -it debug-pod -- /bin/sh
# 在Pod内安装网络工具
# 在Pod内执行
apt-get update && apt-get install -y iputils-ping dnsutils curl telnet
# 测试DNS解析
nslookup kubernetes.default
nslookup <service-name>.<namespace>.svc.cluster.local
# 测试网络连接
ping <target-ip>
curl <target-url>
telnet <target-host> <target-port>
# 检查路由表
ip route

复制代码

# 查看节点资源分配
kubectl describe nodes | grep -i "allocated resources"
# 查看命名空间资源使用情况
kubectl describe quota -n <namespace>
# 查看资源限制范围
kubectl get limitrange -n <namespace>
# 查看Pod资源使用历史（需要Metrics Server）
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory

复制代码

自动化运维与告警

告警策略设计

有效的告警策略可以帮助运维人员及时发现并处理问题，避免小问题演变成大故障。

建议设计以下告警级别：

1. 紧急（Critical）：影响核心业务，需要立即处理核心服务不可用数据丢失风险安全漏洞
2. 核心服务不可用
3. 数据丢失风险
4. 安全漏洞
5. 高（High）：影响业务功能，需要尽快处理非核心服务不可用性能严重下降资源即将耗尽
6. 非核心服务不可用
7. 性能严重下降
8. 资源即将耗尽
9. 中（Medium）：潜在风险，需要关注资源使用率过高错误率增加备份失败
10. 资源使用率过高
11. 错误率增加
12. 备份失败
13. 低（Low）：信息性告警，需要记录配置变更计划内维护性能轻微下降
14. 配置变更
15. 计划内维护
16. 性能轻微下降

紧急（Critical）：影响核心业务，需要立即处理

• 核心服务不可用
• 数据丢失风险
• 安全漏洞

高（High）：影响业务功能，需要尽快处理

• 非核心服务不可用
• 性能严重下降
• 资源即将耗尽

中（Medium）：潜在风险，需要关注

• 资源使用率过高
• 错误率增加
• 备份失败

低（Low）：信息性告警，需要记录

• 配置变更
• 计划内维护
• 性能轻微下降

使用Prometheus配置告警规则的示例：

groups:
- name: kubernetes-apps
rules:
- alert: PodCrashLooping
expr: kube_pod_container_status_restarts_total > 2
for: 15m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping (namespace {{ $labels.namespace }})"
description: "Pod {{ $labels.pod }} (namespace {{ $labels.namespace }}) has restarted {{ $value }} times in the last 15 minutes."
- alert: PodNotReady
expr: sum by (namespace, pod) (kube_pod_status_ready{condition="false"}) == 1
for: 10m
labels:
severity: high
annotations:
summary: "Pod {{ $labels.pod }} is not ready (namespace {{ $labels.namespace }})"
description: "Pod {{ $labels.pod }} (namespace {{ $labels.namespace }}) has been in a non-ready state for more than 10 minutes."
- name: kubernetes-resources
rules:
- alert: NodeMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: high
annotations:
summary: "Node memory usage is high (instance {{ $labels.instance }})"
description: "Node memory usage is above 85% (current value: {{ $value }}%)"
- alert: NodeDiskUsage
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
for: 5m
labels:
severity: high
annotations:
summary: "Node disk usage is high (instance {{ $labels.instance }})"
description: "Node disk usage is above 85% (current value: {{ $value }}%)"

复制代码

自动化恢复机制

自动化恢复可以减少人工干预，提高问题解决速度。

Kubernetes默认提供了Pod自动重启机制，可以通过配置重启策略来控制：

apiVersion: v1
kind: Pod
metadata:
name: restart-pod
spec:
restartPolicy: OnFailure # 可选值：Always、OnFailure、Never
containers:
- name: restart-container
image: busybox
command: ["sh", "-c", "exit 1"]

复制代码

HPA可以根据CPU使用率或其他指标自动调整Deployment的副本数量：

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70

复制代码

VPA可以自动调整Pod的资源请求和限制：

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: app-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: "100m"
memory: "50Mi"
maxAllowed:
cpu: "1"
memory: "500Mi"
controlledResources: ["cpu", "memory"]

复制代码

使用Kubernetes Operator或自定义控制器可以实现更复杂的自动化恢复逻辑。以下是一个简单的自定义控制器示例，用于自动重启失败的Pod：

package main
import (
"context"
"fmt"
"time"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
func main() {
// 创建Kubernetes客户端
config, err := rest.InClusterConfig()
if err != nil {
panic(err.Error())
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err.Error())
}
// 监控所有命名空间中的Pod
for {
pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
if err != nil {
panic(err.Error())
}
// 检查每个Pod的状态
for _, pod := range pods.Items {
if pod.Status.Phase == corev1.PodFailed {
fmt.Printf("发现失败的Pod: %s/%s\n", pod.Namespace, pod.Name)
// 删除失败的Pod，让Deployment或StatefulSet重新创建
err := clientset.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
if err != nil {
fmt.Printf("删除Pod失败: %v\n", err)
} else {
fmt.Printf("已删除失败的Pod: %s/%s\n", pod.Namespace, pod.Name)
}
}
}
// 等待一段时间后再次检查
time.Sleep(30 * time.Second)
}
}

复制代码

最佳实践与案例分析

健康检查最佳实践

• 存活探针：用于检测应用是否存活，设置适当的初始延迟和检查间隔
• 就绪探针：用于检测应用是否准备好接收流量，避免过早将流量导向未就绪的Pod
• 启动探针：对于启动时间较长的应用，使用启动探针避免在应用启动期间被存活探针杀死

apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
spec:
replicas: 3
template:
spec:
containers:
- name: app-container
image: myapp:1.0
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

复制代码

• 监控集群基础设施（节点、网络、存储）
• 监控Kubernetes组件（API服务器、调度器、控制器管理器等）
• 监控应用性能和可用性
• 设置合理的告警阈值和通知策略

# Prometheus监控配置示例
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

复制代码

• 集中收集所有组件和应用的日志
• 使用结构化日志格式，便于查询和分析
• 设置日志保留策略，平衡存储成本和查询需求
• 实现日志告警，及时发现异常模式

# Fluentd日志收集配置示例
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*_{{.Release.Namespace}}_{{.Chart.Name}}-*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
format json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch-logging
port 9200
index_name fluentd
type_name _doc
</match>

复制代码

• 使用自动扩缩容应对负载变化
• 配置自动故障恢复机制
• 实现自动化部署和回滚
• 建立自愈系统，减少人工干预

# 自动扩缩容配置示例
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max

复制代码

真实故障案例分析

问题描述：集群中的多个Pod一直处于Pending状态，无法调度到任何节点。

排查过程：

1. 检查Pod状态：

kubectl get pods --all-namespaces | grep Pending

复制代码

1. 查看Pod详细信息，发现事件中显示”0/3 nodes are available: 3 Insufficient cpu”或”3 Insufficient memory”。
2. 检查节点资源使用情况：

查看Pod详细信息，发现事件中显示”0/3 nodes are available: 3 Insufficient cpu”或”3 Insufficient memory”。

检查节点资源使用情况：

kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

复制代码

1. 发现节点CPU和内存使用率均超过90%，剩余资源不足以调度新的Pod。

解决方案：

1. 短期解决方案：手动扩容节点或删除不必要的Pod释放资源。

# 删除测试或非关键Pod
kubectl delete pod <pod-name> -n <namespace>
# 手动扩容节点（如果使用云服务）
# 例如，在AWS上增加节点组大小
aws autoscaling set-desired-capacity --auto-scaling-group-name <asg-name> --desired-capacity <new-size>

复制代码

1. 长期解决方案：配置集群自动扩缩容（Cluster Autoscaler）设置资源请求和限制，避免资源浪费实施Pod优先级和抢占机制，确保关键应用优先获得资源
2. 配置集群自动扩缩容（Cluster Autoscaler）
3. 设置资源请求和限制，避免资源浪费
4. 实施Pod优先级和抢占机制，确保关键应用优先获得资源

• 配置集群自动扩缩容（Cluster Autoscaler）
• 设置资源请求和限制，避免资源浪费
• 实施Pod优先级和抢占机制，确保关键应用优先获得资源

# 集群自动扩缩容配置示例
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["watch", "list", "get"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create","list","watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cluster-autoscaler-status"]
verbs: ["delete","get","update","watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.22.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --balance-similar-node-groups
- --expander=priority
- --skip-nodes-with-system-pods=false
- --cloud-provider=aws
- --nodes=<min-nodes>:<max-nodes>:<node-group-name>

复制代码

问题描述：Kubernetes API服务器响应时间显著增加，kubectl命令执行缓慢，Pod创建和更新操作超时。

排查过程：

1. 检查API服务器状态：

kubectl get componentstatuses

复制代码

1. 检查API服务器日志：

kubectl logs -n kube-system <kube-apiserver-pod-name>

复制代码

1. 检查etcd集群状态：

kubectl get pods -n kube-system | grep etcd
kubectl exec -it <etcd-pod-name> -n kube-system -- etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint health

复制代码

1. 发现etcd集群响应时间过长，部分请求超时。
2. 检查etcd性能指标：

发现etcd集群响应时间过长，部分请求超时。

检查etcd性能指标：

kubectl exec -it <etcd-pod-name> -n kube-system -- etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out=table

复制代码

1. 发现etcd数据库大小过大，导致性能下降。

解决方案：

1. 短期解决方案：压缩etcd历史数据，减少数据库大小。

# 在etcd节点上执行
ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key defrag

复制代码

1. 长期解决方案：增加etcd节点资源（CPU、内存、磁盘）优化etcd配置，调整心跳间隔和选举超时实施etcd监控和告警，提前发现性能问题定期备份和压缩etcd数据
2. 增加etcd节点资源（CPU、内存、磁盘）
3. 优化etcd配置，调整心跳间隔和选举超时
4. 实施etcd监控和告警，提前发现性能问题
5. 定期备份和压缩etcd数据

• 增加etcd节点资源（CPU、内存、磁盘）
• 优化etcd配置，调整心跳间隔和选举超时
• 实施etcd监控和告警，提前发现性能问题
• 定期备份和压缩etcd数据

# etcd优化配置示例
apiVersion: v1
kind: Pod
metadata:
name: etcd
namespace: kube-system
spec:
containers:
- name: etcd
image: k8s.gcr.io/etcd:3.4.13-0
command:
- etcd
- --name=etcd-0
- --data-dir=/var/lib/etcd
- --listen-client-urls=https://0.0.0.0:2379
- --advertise-client-urls=https://<node-ip>:2379
- --listen-peer-urls=https://0.0.0.0:2380
- --initial-advertise-peer-urls=https://<node-ip>:2380
- --initial-cluster=etcd-0=https://<node-ip>:2380
- --initial-cluster-token=my-etcd-token
- --initial-cluster-state=new
- --client-cert-auth=true
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --peer-client-cert-auth=true
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
# 优化参数
- --heartbeat-interval=300
- --election-timeout=2000
- --max-snapshots=5
- --max-wals=5
- --quota-backend-bytes=4294967296
- --auto-compaction-retention=1
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 1000m
memory: 4Gi
volumeMounts:
- name: etcd-data
mountPath: /var/lib/etcd
volumes:
- name: etcd-data
hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate

复制代码

问题描述：集群中的Pod无法相互通信，服务无法访问，应用功能异常。

排查过程：

1. 检查Pod状态和网络配置：

kubectl get pods --all-namespaces -o wide

复制代码

1. 检查服务状态：

kubectl get svc --all-namespaces
kubectl get endpoints --all-namespaces

复制代码

1. 检查网络插件状态：

kubectl get pods -n kube-system | grep -E 'calico|flannel|weave|cilium'

复制代码

1. 检查网络插件日志：

kubectl logs -n kube-system <network-plugin-pod-name>

复制代码

1. 创建测试Pod进行网络连通性测试：

kubectl run test-pod --image=busybox -- sleep 3600
kubectl exec -it test-pod -- /bin/sh
# 在Pod内执行
ping <target-pod-ip>
nslookup <service-name>

复制代码

1. 发现网络插件Pod异常重启，节点间网络配置不正确。

解决方案：

1. 短期解决方案：重启网络插件Pod，修复配置问题。

# 重启网络插件Pod
kubectl delete pod -n kube-system -l k8s-app=<network-plugin-name>
# 检查网络配置
# 对于Calico
kubectl get ippools.crd.projectcalico.org -o yaml
# 对于Flannel
kubectl get configmap -n kube-system kube-flannel-cfg -o yaml

复制代码

1. 长期解决方案：实施网络插件监控和告警定期检查网络配置和状态准备网络插件故障的应急响应计划考虑使用多网络插件或网络策略提高可靠性
2. 实施网络插件监控和告警
3. 定期检查网络配置和状态
4. 准备网络插件故障的应急响应计划
5. 考虑使用多网络插件或网络策略提高可靠性

• 实施网络插件监控和告警
• 定期检查网络配置和状态
• 准备网络插件故障的应急响应计划
• 考虑使用多网络插件或网络策略提高可靠性

# 网络插件监控配置示例
apiVersion: v1
kind: ConfigMap
metadata:
name: network-plugin-monitoring
namespace: monitoring
data:
network-rules.yml: |
groups:
- name: network-plugin
rules:
- alert: NetworkPluginPodDown
expr: up{job="network-plugin"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Network plugin pod is down"
description: "Network plugin pod {{ $labels.pod }} has been down for more than 5 minutes."
- alert: NetworkConnectivityFailure
expr: probe_success{job="network-connectivity"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Network connectivity test failed"
description: "Network connectivity test between pods failed for more than 5 minutes."

复制代码

总结与展望

Kubernetes集群健康检查是确保系统高可用性的关键环节。通过本文的介绍，我们了解了从监控指标到故障维护的完整流程，包括监控工具的选择、健康检查策略的实施、日志管理与分析、故障诊断与排查、自动化运维与告警，以及最佳实践和案例分析。

有效的健康检查体系应该包括：

1. 全面的监控覆盖：从基础设施到应用层的全方位监控
2. 合理的健康检查策略：根据应用特点设计合适的探针
3. 集中的日志管理：实现日志的收集、存储、分析和告警
4. 系统化的故障排查：建立标准化的故障处理流程
5. 智能化的自动化运维：减少人工干预，提高故障恢复速度

随着云原生技术的发展，Kubernetes集群健康检查也在不断演进。未来，我们可以期待以下发展趋势：

1. AI驱动的预测性维护：利用机器学习算法分析监控数据，预测潜在问题
2. 更精细的故障定位：通过分布式追踪等技术，实现请求级别的故障定位
3. 自适应的健康检查：根据系统负载和状态动态调整检查策略
4. 跨集群的健康管理：在多云和混合云环境下实现统一的健康检查
5. 更强大的自愈能力：实现更复杂的自动化故障恢复机制

作为运维人员，我们需要不断学习和实践，构建完善的健康检查体系，确保Kubernetes集群的稳定运行，为业务提供可靠的基础设施支持。通过本文提供的方法和工具，相信大家能够更好地管理和维护Kubernetes集群，快速定位并解决问题，保障系统的高可用性。

	通知：关于部分勋章领取条件及购买价格调整的通知	05-18 21:22
	通知：本站资源由网友上传分享，如有违规等问题请到版务模块进行投诉，资源失效请在帖子内回复要求补档，会尽快处理！	10-23 09:31

活动公告

Kubernetes集群健康检查全攻略从监控指标到故障维护的完整指南帮助运维人员快速定位问题保障系统高可用

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

浏览过的版块

塔罗

立华奏

站长推荐 /1

友情链接

Tencent QQ