3ware 3DM RAID1死亡

先日より立て続けにエラーメールが出ていたのだけど

3ware 3DM alert -- host: www.example.com

WARNING: Drive sector ECC error corrected on port 1 on controller ID:0. (0x23)

source: www.example.com

本日下記2通が来て完全にお亡くなりに

3ware 3DM alert -- host: www.example.com

WARNING: Drive timeout encountered on port 1 on controller ID:0. Check cables and drives for media errors. (0x9)

source: www.example.com

3ware 3DM alert -- host: www.example.com

ERROR: Disk Array Unit 0 on controller ID:0 is degraded and no longer fault tolerant. Check log for drive errors. (0x2)

Fault tolerant disk arrays become degraded or incomplete when they cannot write to or read from a member drive. The array's data may be read and new data may be written to the array, however, the array is still NO LONGER FAULT TOLERANT.

When an array is degraded or incomplete you have three courses of action:

1. Replace the suspected drive and REBUILD the array using 3DM.See the Configure page.

2. BRING THE SYSTEM DOWN and check cabling and connections. Reboot the system and attempt to REBUILD THE ARRAY using the 3ware Disk Array Configuration Utility BIOS extension.

3. DO NOTHING and continue operating with the array functional but not fault tolerant.

NOTE: Please examine the 3DM ALARMS page fr more information regarding the cause of the failure.



source: www.example.com

logwatchではこんな感じ

--------------------- Kernel Begin ------------------------


2 Time(s): 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.

---------------------- Kernel End -------------------------

やっててよかったRAID1。寿命は4年と1月でした。

nagios-nsca-client cron check example install

cronの結果をnscaに飛ばすためにnagios-nsca-client libmcryptを入れる

nagios-nsca-client redhat9
http://packages.sw.be/nagios-nsca/nagios-nsca-client-2.7.2-2.rh9.rf.i386.rpm

libmcrypt redhat9
http://apt.sw.be/redhat/9/en/i386/RPMS.dag/libmcrypt-2.5.7-1.dag.rh90.i386.rpm

# rpm -Uvh libmcrypt-2.5.7-1.dag.rh90.i386.rpm
# rpm -Uvh nagios-nsca-client-2.7.2-2.rh9.rf.i386.rpm

send_nscaの設定する nscaの受けと同じ設定に
vi /etc/nagios/send_nsca.cfg

submit_check_result作る
vi /etc/nagios/submit_check_result

#!/bin/sh

central_server="example.jp"

# Arguments:
#  $1 = host_name (Short name of host that the service is
#       associated with)
#  $2 = svc_description (Description of the service)
#  $3 = state_string (A string representing the status of
#       the given service - "OK", "WARNING", "CRITICAL"
#       or "UNKNOWN")
#  $4 = plugin_output (A text string that should be used
#       as the plugin output for the service checks)
#

# Convert the state string to the corresponding return code
return_code=-1

case "$3" in
    OK)
        return_code=0
        ;;
    WARNING)
        return_code=1
        ;;
    CRITICAL)
        return_code=2
        ;;
    UNKNOWN)
        return_code=-1
        ;;
    [0-2])
        return_code=$3
        ;;
esac

# pipe the service check info into the send_nsca program, which
# in turn transmits the data to the nsca daemon on the central
# monitoring server

/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" | /usr/sbin/send_nsca $central_server -c /etc/nagios/send_nsca.cfg

動作確認 submit_check_result

# chmod 700 submit_check_result
# ./submit_check_result remote 'service name' 0 'OK test'
1 data packet(s) sent to host successfully.

nscaに結果を送信するためのscript。afbackupの結果を通信する

#!/bin/sh
# change nagios setting host
SERVER="example.com"
/usr/local/backup/client/bin/incr_backup
#/usr/sbin/incr_backup
if [ $? -ne 0 ]; then
#  echo "Error";
        OUTPUT="BACKUP is Critical"
        STATE=2;
else
#  echo "OK";
        OUTPUT="BACKUP is OK"
        STATE=0;
fi
/etc/nagios/submit_check_result $SERVER PASSIVE_CRON_AFBACKUP $STATE "$OUTPUT"

nscaサーバーのservices.cfg

define service{
use                     generic-service
host_name               example.com
check_period            none
service_description     CRON_AFBACKUP
check_command           service-is-stale
check_freshness         1
freshness_threshold     90000
max_check_attempts      1
active_checks_enabled   0
}

24+1時間(3600*25=90000)更新されない場合はエラーになる

nscaサーバーのcommands.cfg

define command{
    command_name    service-is-stale
    command_line    $USER1$/check_dummy 2 'CRITICAL: Service results are stale!'
}


nagios nscaインストールの参考に
http://www.on-sky.net/~hs/misc/?NSCA+Howto

ST3250310NS firmware Upgrade

ST3250310NSの06ファームウェアがでていたのでUpgradeした

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963&NewLang=en

ST3250310NS 9CA152-301, 302, 303, 501, 502, 503
http://www.seagate.com/staticfiles/support/downloads/firmware/ES2SN06C-1D2DMoose.iso

CDRに焼いてCDより起動して
Sで型番とファームの確認
AはST3250310NSのUpgrade(250G)
BはST3500320NSのUpgrade(500G)
なのでSで確認して該当ボタンを押す
全ての型番は9CA152-303のOEM版だった

ネットワーク温度計設置とか

年末にいろいろしたので覚書

[データセンターの熱管理編]サーバーの間にすき間を空けてはいけない
http://itpro.nikkeibp.co.jp/article/COLUMN/20081113/319208/

[データセンターの熱管理編]ラックとサーバーの間にスペースを空けてはいけない
http://itpro.nikkeibp.co.jp/article/COLUMN/20081113/319210/


の記事を参考に自社ラックの改善をした。

1.ネットワーク温度計をラック上部に設置
http://www.espectc.com/jigyou/web-shop/RT-RS-12N.htm
センサーは上部の排気FANの間に放り込んだ。(まずは現状の把握ということで)


2.背面最上部に設置してあったスイッチが、排気の邪魔をしていると判断して2,3段下げた。
ラック内温度が1度程度変わった感じ(smartdの取れるマシンのhdd温度より)
排気盤の半分塞いでたらそれはね


3.APCの1U 19インチToolless Blanking Panelを設置予定(購入済み)
http://www.apc.com/products/family/index.cfm?id=328


4.側面からの熱がまわりこみそうなのでそこは発泡スチロールとかでふさぐとか考え中


これは数年前から導入済み。結構効果あるような
Rack Air Distribution Unit 2U 100V 50/60HZ
http://www.apc.com/products/family/index.cfm?id=107
フィルターが1年交換なんでそろそろ変えないといけないような