当前位置:网站首页>ETCD Single-Node Fault Emergency Recovery

ETCD Single-Node Fault Emergency Recovery

2022-08-11 07:04:00 !Nine thought & & gentleman!

系列文章目录

ETCDContainerized to build clusters



前言

生产环境中,经常遇到etcd集群出现单节点故障或者集群故障.针对这两种情况,进行故障修复.本文介绍etcd的单节点故障时,Emergency recovery manual


一、总体恢复流程

由于etcd的raft协议,The number of failed nodes that the entire cluster can tolerate is (n-1)/ 2,So in the event of a single node failure,A single cluster is still available,It will not affect the reading and writing of the business.
整体的恢复流程如下

集群member rmove异常节点
Abnormal nodes delete dirty data and rebuild
集群member add节点
集群完成数据同步并恢复

二、Detailed recovery instructions

2.1 环境信息

使用本地的vmstation创建3个虚拟机,信息如下

节点名称节点IP节点配置操作系统Etcd版本Docker版本
etcd1192.168.82.1281c1g 20gCentOS7.4v3.513.1
etcd2192.168.82.1291c1g 20gCentOS7.4v3.513.1
etcd3192.168.82.1301c1g 20gCentOS7.4v3.513.1

假设etcd2节点异常,And the local data has been corrupted.

2.2 The cluster deletes the abnormal node

通过member removeCommand to delete abnormal nodes,At this point the entire cluster has only 2个节点,不会触发master重新选主,集群正常运行.

查看当前集群状态

export ETCDCTL_API=3
export ETCD_ENDPOINTS=192.168.92.128:2379,192.168.92.129:2379,192.168.92.130:2379
etcdctl --endpoints=$ETCD_ENDPOINTS --write-out=table member list
etcdctl --endpoints=$ETCD_ENDPOINTS --write-out=table endpoint status

在这里插入图片描述

2.2 Delete abnormal node data

2.2.1 删除异常member

docker stop etcd2

2.2.2 删除数据
由于数据通过-v /data/etcd:/data/etcd的方式挂载,Therefore delete the corresponding data,会清理etcd数据.

 rm -rf /data/etcd/*

2.3 Re-add nodes to the cluster

通过如下命令,Add the abnormal node to the cluster,Wait for the corresponding node to start,Cluster data synchronization and master selection are automatically completed

export ETCDCTL_API=3
export ETCD_ENDPOINTS=192.168.92.128:2379,192.168.92.129:2379,192.168.92.130:2379
etcdctl --endpoints=$ETCD_ENDPOINTS member add etcd2 --peer-urls=http://192.168.92.129:2380

在这里插入图片描述

2.4 启动节点

2.4.1 The complete startup script is

[[email protected] ~]# 
[[email protected] ~]# cat start_etcd.sh 
 /bin/sh

name="etcd2"
host="192.168.92.129"
cluster="etcd1=http://192.168.92.128:2380,etcd2=http://192.168.92.129:2380,etcd3=http://192.168.92.130:2380"

docker run -d --privileged=true  -p 2379:2379  -p 2380:2380 -v /data/etcd:/data/etcd   --name $name --net=host  quay.io/coreos/etcd:v3.5.0   /usr/local/bin/etcd --name $name   --data-dir /data/etcd   --listen-client-urls http://$host:2379  --advertise-client-urls http://$host:2379 --listen-peer-urls http://$host:2380   --initial-advertise-peer-urls http://$host:2380   --initial-cluster $cluster  --initial-cluster-token tkn   --initial-cluster-state existing   --log-level info   --logger zap   --log-outputs stderr

注意,由于etcd的数据已经被删除,So when the current node restarts,Get data from other nodes,因此需要调整参数–initial-cluster-state,从new改成existing

--initial-cluster-state existing

2.4.2 查看日志

docker logs 8bf31834f8ce

2.4 Wait for the cluster data to finish syncing and recover

查看当前集群的member信息

export ETCDCTL_API=3
export ETCD_ENDPOINTS=192.168.92.128:2379,192.168.92.129:2379,192.168.92.130:2379
etcdctl --endpoints=$ETCD_ENDPOINTS --write-out=table member list
etcdctl --endpoints=$ETCD_ENDPOINTS --write-out=table endpoint status

在这里插入图片描述


总结

Because the overall cluster has multiple copies,So when a single node is abnormal,It does not cause the entire cluster to be abnormal,It can be recovered as long as the corresponding node is started normally and the data is synchronized.

原网站

版权声明
本文为[!Nine thought & & gentleman!]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/223/202208110516537339.html