When I noticed the symptoms, I immediately made an incremental backup and rebooted to run fsck. Predictably, fsck reported a few nasty filesystem errors, but it mostly had to do with some unmodified files checked into version control and some system fonts.
Anyway, this experience prompted me to finally setup smartmontools so I can keep track of drive errors and identify if the failure is a controller problem, a connection problem, or a drive problem.
There are many ways to configure smartmontools/smartd, and you can read about them in the man page and across the Internet. These instructions are, in my opinion, the ones that require the least modifications to what Ubuntu sets up for you by default.
So first, do the usual sudo apt-get install smartmontools .
This installs postfix, which I configured as an Internet Site, and left with most of the defaults filled in.
The package comes with an init script that lives at /etc/init.d/smartmontools and is set to run at the defaults runlevel when installed. However, if you try running it manually via sudo /etc/init.d/smartmontools , you will notice that it prints nothing and fails to start a smartd background process. It turns out that you need to edit /etc/default/smartmontools and uncomment the lines they have there, as shown below:
# Defaults for smartmontools initscript (/etc/init.d/smartmontools)
# This is a POSIX shell fragment
# List of devices you want to explicitly enable S.M.A.R.T. for
# Not needed (and not recommended) if the device is monitored by smartd
# uncomment to start smartd on system startup
# uncomment to pass additional options to smartd on startup
Especially without start_smartd=yes, nothing will work!
Now, before we're off to the races, we want to go check out /etc/smartd.conf, where most of the real configuration we care about happens. Here is the relevant section of my configuration, with comments about each option.
# DEVICESCAN: Scan for ATA and SCSI devices. All lines after this one are
# -o on: Enable online data collection.
# -S on: Enable automatic attribute autosave.
# -s (S/../.././02|L/../../6/03): Do a short self-test every night between
# 2-3am and a long self-test every Sunday between 3-4am.
# -H: If the device says it's not healthy, send mail. This occurs if any
# prefail attributs are past their thresholds.
# -l selftest: Send mail if, since the last check, a self test has found
# additional errors.
# -l error: Send mail if the error log has new errors.
# -f: Send mail if the disk has "failed", ie it's total usage is above the
# threshold set by the manufacturer. Indication of age, not check failures.
# -m: Address to which to send email.
# -M...: Script to run to send mail and do other things.
# This line is broken up for posting, but I would remove it from the actual
# config file.
DEVICESCAN -o on -S on -s (S/../.././02|L/../../6/03) -H -l selftest -l error -f \
-m firstname.lastname@example.org -M exec /usr/share/smartmontools/smartd-runner
The idea behind this configuration is to be a little bit less verbose than the default of -a, which monitors changes in various parameters, but still generate an email when things go wrong. The results of the self-tests are stored on the hard drive, so if you're having trouble with a drive later, you can check out its bill of health and test record with smartctl -a.