Linux上watchdog故障检测及自动恢复

Linux的watchdog是一种用于检测和恢复计算机故障的机制。通常由计时器实现，计算机必须定期向watchdog发送信息，以防止计时器超时。如果计算机由于故障而无法发送信息，则watchdog将在超时后触发恢复或重启系统等操作。

Linux的watchdog可以分为硬件和软件，随着虚拟化和云服务的崛起目前主要使用软件watchdog，在这里介绍的是软件watchdog。

软件watchdog是运行在操作系统中的程序，具有以下特点：

依赖于操作系统的运行，因此可能会受到软件故障的影响。
可以配置超时时间，通常从几秒到几分钟不等，默认是60秒。
当计时器超时，可触发各种操作，例如记录日志、发送告警、执行脚本、重启系统等。

故障检测及自动恢复

Linux内核提供了一个名为“watchdog”的模块，用于实现watchdog的功能。该模块可以通过/dev/watchdog字符设备进行访问。用户程序可以通过打开/dev/watchdog设备文件并定期向其中写入数据。如果程序在超时时间内没有向设备写入数据，watchdog将触发恢复或重启系统等操作。

watchdog可对系统进行如下检测(如man watchdog.conf命令结果中所述)，如果检测失败，则在一分钟后重启系统。watchdog启动后，可使用内核的系统重置功能，在出现预定义的严重问题(检测失败)时重启系统。每隔一分钟通过/dev/watchdog设备文件通知内核延迟系统重启；反之，内核没有收到通知将重启系统。watchdog可检测的有如下项目。

负载阈值：可设置为 1分钟、5分钟和 15分钟间隔。
最小可用虚拟内存大小(阈值)
指定文件允许的统计调用返回时间(默认为1分钟)，当系统依赖NFS时，应有效。
对pid文件中指定的进程进行检测。
ping指定的IP地址并检查是否存在无法到达的情况。
监控指定接口上的传入流量。

watchdog可检测的具体内容，查看RedHat的配置 Watchdog。

安装watchdog

接下来Rocky Linux9.3为例，介绍安装及使用watchdog的方法。

Rocky Linux默认没有安装watchdog软件包，执行下面的命令安装watchdog软件包及依赖包。

# dnf install watchdog -y

watchdog的配置文件

watchdog的配置文件是 /etc/watchdog.conf。修改配置文件之后，需要重启watchdog使参数生效。

为了watchdog设备驱动程序访问”文件”，编辑/etc/watchdog.conf 文件，取消以下行的注释。

watchdog-device                = /dev/watchdog

在负载较高的情况下，watchdog可能会被换出(swap out)而导致系统重启，因此应在配置文件watchdog.conf中将参数realtime设定为yes。

realtime                = yes

在发生内核panic时，Linux内核默认不会重启系统。这是由内核参数 kernel.panic=0 设置的(内核panic后禁用自动重启)。在使用watchdog时，我们希望在发生内核panic时重启系统，并继续让watchdog进行检测，因此将kernel.panic参数从Holt改为Reboot(0以外的值)。建议在使用watchdog时设置kernel.panic=60(发生内核panic后60秒重启系统)。

watchdog的默认配置文件内容如下。

# cat /etc/watchdog.conf
# ====================================================================
# Configuration for the watchdog daemon. For more information on the
# parameters in this file use the command 'man watchdog.conf'
# ====================================================================

# =================== The hardware timer settings ====================
#
# For this daemon to be effective it really needs some hardware timer
# to back up any reboot actions. If you have a server then see if it
# has IPMI support. Otherwise for Intel-based machines try the iTCO_wdt
# module, otherwise (or if that fails) then see if any of the following
# module load and work:
#
# it87_wdt it8712f_wdt w83627hf_wdt w83877f_wdt w83977f_wdt
#
# If all else fails then 'softdog' is better than no timer at all!
# Or work your way through the modules listed under:
#
# /lib/modules/`uname -r`/kernel/drivers/watchdog/
#
# To see if they load, present /dev/watchdog, and are capable of
# resetting the system on time-out.

# Uncomment this to use the watchdog device driver access "file".

#watchdog-device                = /dev/watchdog

# Uncomment and edit this line for hardware timeout values that differ
# from the default of one minute.

#watchdog-timeout       = 60

# If your watchdog trips by itself when the first timeout interval
# elapses then try uncommenting the line below and changing the
# value to 'yes'.

#watchdog-refresh-use-settimeout        = auto

# If you have a buggy watchdog device (e.g. some IPMI implementations)
# try uncommenting this line and setting it to 'yes'.

#watchdog-refresh-ignore-errors = no

# ====================== Other system settings ========================
#
# Interval between tests. Should be a couple of seconds shorter than
# the hardware time-out value.

#interval               = 1

# The number of intervals skipped before a log message is written (i.e.
# a multiplier for 'interval' in terms of syslog messages)

#logtick        = 1

# Directory for log files (probably best not to change this)

#log-dir                = /var/log/watchdog

# Email address for sending the reboot reason. This needs sendmail to
# be installed and properly configured. Maybe you should just enable
# syslog forwarding instead?

#admin                  = root

# Lock the daemon in to memory as a real-time process. This greatly
# decreases the chance that watchdog won't be scheduled before your
# machine is really loaded.

realtime                = yes
priority                = 1

# ====================== How to handle errors  =======================
#
# If you have a custom binary/script to handle errors then uncomment
# this line and provide the path. For 'v1' test binary files they also
# handle error cases.
# With enforcing SELinux policy please use the /usr/libexec/watchdog/scripts/

# or /etc/watchdog.d/ for your test-binary and repair-binary configuration.
#repair-binary          = /usr/sbin/repair
#repair-timeout         = 60

# The retry-timeout and repair limit are used to handle errors in a
# more robust manner. Errors must persist for longer than this to
# action a repair or reboot, and if repair-maximum attempts are
# made without the test passing a reboot is initiated anyway.

#retry-timeout          = 60
#repair-maximum         = 1

# Configure the delay on reboot from sending SIGTERM to all processes
# and to following up with SIGKILL for any that are ignoring the polite
# request to stop.

#sigterm-delay          = 5

# ====================== User-specified tests ========================
#
# Specify the directory for auto-added 'v1' test programs (any executable
# found in the 'test-directory should be listed).

#test-directory = /etc/watchdog.d

# Specify any v0 custom tests here. Multiple lines are permitted, but
# having any 'v1' programs/scripts discovered in the 'test-directory' is
# the better way.

#test-binary            =

# Specify the time-out value for a test error to be reported.

#test-timeout           = 60

# ====================== Typical tests ===============================
#
# Specify any IPv4 numeric addresses to be probed.
# NOTE: You should check you have permission to ping any machine before
# using it as a test. Also remember if the target goes down then this
# machine will reboot as a result!

#ping                   = 172.16.0.1
#ping                   = 192.168.1.1

# Set the number of ping attempts in each 'interval' of time. Default
# is 3 and it completes on the first successful ping.
# NOTE: Round-trip delay has to be less than 'interval' / 'ping-count'
# for test success, but this is unlikely to be exceeded except possibly
# on satellite links (very unlikely case!).

#ping-count             = 3

# Specify any network interface to be checked for activity.

#interface              = eth0

# Specify any files to be checked for presence, and if desired, checked
# that they have been updated more recently than 'change' seconds.

#file                   = /var/log/syslog
#change                 = 1407

# Uncomment to enable load average tests for 1, 5 and 15 minute
# averages. Setting one of these values to '0' disables it. These
# values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher
# than 25 in most cases).

#max-load-1             = 24
#max-load-5             = 18
#max-load-15            = 12

# Check available memory on the machine.
#
# The min-memory check is a passive test from reading the file
# /proc/meminfo and computed from MemFree + Buffers + Cached
# If this is below a few tens of MB you are likely to have problems.
#
# The allocatable-memory is an active test checking it can be paged
# in to use.
#
# Maximum swap should be based on normal use, probably a large part of
# available swap but paging 1GB of swap can take tens of seconds.
#
# NOTE: This is the number of pages, to get the real size, check how
# large the pagesize is on your machine (typically 4kB for x86 hardware).

#min-memory             = 1
#allocatable-memory     = 1
#max-swap = 0

# Check for over-temperature. Typically the temperature-sensor is a
# 'virtual file' under /sys and it contains the temperature in
# milli-Celsius. Usually these are generated by the 'sensors' package,
# but take care as device enumeration may not be fixed.

#temperature-sensor     =
#max-temperature        = 90

# When using custom service pid check with custom service
# systemd unit file please be aware the "Requires="
# does dependent service deactivation.
# Using "Before=watchdog.service" or "Before=watchdog-ping.service"
# in the custom service unit file may be the desired operation instead.
# See man 5 systemd.unit for more details.
#
# Check for a running process/daemon by its PID file. For example,
# check if rsyslogd is still running by enabling the following line:

#pidfile                = /var/run/rsyslogd.pid

在这里介绍了watchdog的功能及安装方法。