硬盘监控工具smartctl

2019-08-19

前言

为了更好的运维存储系统，必须对现有硬盘的硬件信息做良好的监控，不要等到硬盘坏了，才开始维护，要在它快怀时就做好还盘准备，不然分布式存储系统的数据处理事件会很烦人。

关注ssd磁盘的情况，所以针对ssd寿命的相关监控都有加上（Media_Wearout_Indicator的剩余擦写次数），看起来，集群流转写入一次大概会降低2-6%的擦写寿命，坏块很少，在擦写次数足够多的情况下，坏块几乎不会产生。

监控频率及内容

每半小时或1小时监测一下硬盘信息，找出寿命将近、读写速度不佳的硬盘。
监测内容：状态、读写速度、坏块、寿命、容量。
推荐使用ansible-tow来跑任务，或者预警方式。

安装监测工具

1	# yum install smartmontools

监测硬盘信息

# smartctl -a /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.19.8-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile
Device Model:     WDC WD10JPVX-22JC3T0
Serial Number:    WD-WX21A16EUYJL
LU WWN Device Id: 5 0014ee 606838394
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Aug 19 10:20:36 2019 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED  <-全面评测 通过

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (18600) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 209) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   185   182   021    Pre-fail  Always       -       1750
  4 Start_Stop_Count        0x0032   001   001   000    Old_age   Always       -       476678
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20480
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       86
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       59
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       873911
194 Temperature_Celsius     0x0022   114   109   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl命令

  查询目标硬盘的错误信息汇总
# smartctl -l error   /dev/sdb

  查看目标硬盘的详细信息
# smartctl -A   /dev/sdb
# smartctl -a   /dev/sdb

  开启目标硬盘德 SMART支持
# smartctl --smart=on --offlineauto=on --saveauto=on /dev/sdb

  监测smart的健康状态
# smartctl -H /dev/sdb
SMART overall-health self-assessment test result: PASSED <-这里如果是FAILED必须马上换盘。
  
  显示目标硬盘全部信息，最全的。
# smartctl -x /dev/sdb

  查询sdb1硬盘的剩余寿命（重要`Media_Wearout_Indicator`）
# smartctl -a -d sat+megaraid,5 /dev/sdb1 | egrep "ID#|Reallocated_Sector_Ct|Media_Wearout_Indicator|Available_Reservd_Space"

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 

1、Media_Wearout_Indicator: 使用耗费，100为没有任何耗费; 
表示SSD上NAND的擦写次数的程度，初始值为100，随着擦写次数的增加，开始线性递减，递减速度按照擦写次数从0到最大的比例。一旦这个值降低到 1，就不再降了，同时表示SSD上面已经有NAND的擦写次数到达了最大次数。这个时候建议需要备份数据，以及更换SSD。
2、Reallocated_Sector_Ct: 出厂后产生的坏块个数, 初始值为100，如果有坏块，从1开始增加，每4个坏块增加1
3、Available_Reservd_Space: SSD上剩余的保留空间, 初始值为100，表示100%，阀值为10，递减到10表示保留空间已经不能再减少

测试硬盘的读写速度

# hdparm -Tt /dev/sdb
/dev/sdb:
 Timing cached reads:   14788 MB in  1.99 seconds = 7417.68 MB/sec
 Timing buffered disk reads: 336 MB in  3.00 seconds = 111.96 MB/sec

监测硬盘坏块

  以读数据的方式监测目标硬盘坏块
# badblocks -s -v /dev/sdb
Checking blocks 0 to 976762583
Checking for bad blocks (read-only test):   0.71% done, 1:04 elapsed. (0/0/0 errors)
  
  以读数据方式扫描目标sdb全盘，并记录输出到文件。
# badblocks -s -v -o /root/badblocks.log /dev/sdb
Checking blocks 0 to 976762583
Checking for bad blocks (read-only test):   1.16% done, 1:47 elapsed. (0/0/0 errors)

复制要坏的硬盘或者分区

  整盘复制
# dd if=/dev/vdf of=/dev/vdh bs=4M

  复制/dev/vdf4 到 /dev/vdh4 已每次4M速度复制分区。
# dd if=/dev/vdf4 of=/dev/vdh4 bs=4M
 
  网络克隆目标盘到远程机器的硬盘上。
# dd if=/dev/sda bs=4M|ssh 192.168.1.13 "dd of=/dev/sdc conv=fdatasync"
# dd if=/dev/hdc5 |ssh 218.X.X.X dd of=/root/hdc5 

  备份远程机器硬盘到本机
# ssh 218.X.X.X dd if=/dev/hdc5 | dd of=/root/hdc5

osd删除脚本

# cat  delete_osd.sh
#!/bin/bash

sudo ceph osd out $1
sleep 2
sudo systemctl stop ceph-osd@$1.service
sleep 2
sudo ceph osd crush remove osd.$1
sleep 2
sudo ceph auth del osd.$1
sleep 2
sudo ceph osd rm $1
sleep 2
if [ -d "/var/lib/ceph/osd/ceph-$1" ];then
    sudo umount /var/lib/ceph/osd/ceph-$1
    sleep 2
    sudo rm -rf /var/lib/ceph/osd/ceph-$1
fi
# osd删除后，查找到osd对应的pv、vg和lv，进行了手工删除


lvremove -y /dev/ceph-3aa53c5f-1cec-4be8-93d2-e74c4c8cadc7/osd-block-8962ee76-8f69-42b4-ae5c-66284ced7152
lvremove -y /dev/ceph-077ea2ea-9610-420b-a941-c0b0676210b9/osd-block-44a63df1-96ce-46e1-a529-ae9bcb515e51
lvremove -y /dev/ceph-c04b176c-f77d-4335-badf-d9d7f4cc4938/osd-block-f3ce3a48-228a-4746-8bb8-ad2cc7cd3989

vgremove -y ceph-c04b176c-f77d-4335-badf-d9d7f4cc4938
vgremove -y ceph-077ea2ea-9610-420b-a941-c0b0676210b9
vgremove -y ceph-3aa53c5f-1cec-4be8-93d2-e74c4c8cadc7

pvremove -y /dev/sdck
pvremove -y /dev/sdcp
pvremove -y /dev/sdcy

vgscan --cache
pvscan --cache
lvscan --cache

新增OSD调整一下恢复速度，避免影响业务。

1	sudo ceph tell osd.* injectargs '--osd_recovery_max_single_start 8 --osd_recovery_sleep_hdd 0 --osd_recovery_max_active 8 --osd_max_backfills 8'