|
The Infamous IDE Tape Drive
Problem with Kernel 2.4
This is a long tale that I tell to illustrate a) how
frustrating it has been for me to debug some problems in Linux and b) the low
quality of the information that is generally out there to help you do so.
One of my customers had a problem with their nightly tape
backups. The linux server performed the backups using a Seagate Travan 10/20G
tape drive and the Amanda backup software. Compiling, installing, and
configuring amanda was complex and painful, but once done, the system ran with
Red Hat Linux version 7.0 trouble-free for months. Then, in order to support
Windows XP Samba clients, I needed to upgrade to a newer version of Samba, which
in turn required me to upgrade the OS. I chose to go to RH 8.0, since 9 had not
been officially released at that time.
After the upgrade, we started having problems with
backups. It appeared that one or more of the tapes in the rotation had gone bad.
With some of the tapes, the nightly backup would complete, but with others,
either the tape would not be recognized or the backup would fail partway
through. I tried recomping the Amanda package (in fact I did this right after
the upgrade, before attempting to run backups again) but this did not help.
Although it seemed strange that the tapes would choose the week after I had
upgraded Linux to fail, I cheerfully bought new tapes and put them into the
rotation.
Now Amanda insists that its tapes have a header written
onto the tape itself; the header identifies the backup set of the tape, as well
as the sequence number of the tape within the backup set. When a new tape is put
into the rotation, one first runs a program (called rmtape) to delete the
information in Amanda's database pertaining to the old tape. Then another
program (called label) will write the header onto the tape and add the
tape into the tape database.
With the new tapes, the label operation succeeded.
Another utility, check, which reads the tape header to make sure the tape
is valid and in the right sequence, would also succeed - if done right after the
label operation. However, the backup operation which next used
that tape would always fail partway through; subsequent check operations
would not recognize the tape.
I had been seeing some error messages in the logs for the
customer's server:
ide-tape: ht0: I/O error, pc = 34,...
These showed up occasionally even before the upgrade.
Their increased frequency afterwards caused me to suspect that the hardware was
having a problem. In addition, doing a backup caused other server processes,
such as the mail server and even the terminal, to be unresponsive. I had not
noticed this before, but could not swear it had not been so.
So at this point I tried swapping in a known good tape
drive from an identical server which was also running 8.0. When I say
"known good", it means that I could label and check tapes, as well as
run a trial backup which succeeded. Unfortunately, this made no difference. I
then tried different cables, then new cables, and, ultimately, yet another tape
drive. I tried changing jumper positions on the drive for DMA and compression. I
tried changing which IDE bus the drive was on, and its position on that bus (for
IDE2, that is.) All with no luck.
Most of my efforts over the next weeks focused on trying
to discover what was different between my own server and the one at the
customer. They both were identical Dell PowerEdge computers, bought in the same
order; they both ran RedHat 8.0, although at home I installed from scratch while
at the customer I had upgraded from 7.0. They both ran the same version of
amanda, which was compiled from source on each server separately. So I spent a
lot of time trying to find a configuration difference between the two builds of
amanda, or a configuration difference or driver difference between the two
Linuxes. Eventually I thought to try bringing one of the tapes which failed to
my own server to see if it failed there, too.
I tried out the tape from the customer in my server
with my test backup, and it worked fine. It was only then that I thought about
the fact that my test backup only did about 20 MB, while the customer's backup uses over
a gigabyte. So I increased the size of the test backup by including a share from
my development Windows XP box. This time, the backup failed after about 200
megabytes. Now before moving to RH 8.0, I had done regular backups of over 10 Gb at a time on my home server.
After the upgrade, I did not continue the backup process because at that same
time I acquired a Windows 2000 server box and was using the backup on it. Now
knowing that both my server and the customer's ahd developed a backup problem after
the upgrade, I went
looking for a problem with the newer version of Linux.
I first installed a newer version of Amanda from the rawhide
RPM, in case there was some incompatibility with the old version. This didn't help.
Then I had another brilliant insight, which was to search for the error messages
on Google. I had previously searched for them on RedHat with no success (I often
find searches there unhelpful for one reason or another.)
When I searched for the messages on Google, there were many hits.
Most of them were discussion group messages reading something like:
<Joe User> "I upgraded my Linux kernel from 2.2 to 2.4 following
all of the correct procedures. After the upgrade, I find I can't get the tape drive to do anything
useful. Does anyone know what's wrong?"
These messages start showing up sometime in 2000. There
was usually no reply to them at all; or if there was, it was something useless
like "Dude, did you check your configuration file settings?" But from
late in 2002 I found a couple of responses that said "well, I don't use the
ide-tape driver for tape drives any more; instead, I use the ide-scsi
driver" with no implication that there was anything broken in the ide-tape
driver, although it used to work.
So I tried loading up the ide-scsi driver instead of the
ide-tape one (which gets loaded automatically when linux boots and detects the
tape drive):
rmmod ide-tape
mod-probe ide-scsi
When I did this on my home server, my test backup then
succeeded. But then when I did the same thing on the customer's server, none of
the tape commands worked at all any more.
Looking at the system log, I see that when ide-scsi loads
it identifies the tape drive and assigns it to device "st0" (where
"nst0" then is the non-auto-rewind version of the device). So then I
tried:
Mt -f /dev/nst0 status
which returned:
SCSI 2 tape drive:
File number=-1, block number=-1,
partition=0.
Tape block size 0 bytes. Density code
0x0 (default).
Soft error count since last status=0
General status bits on (10000):
IM_REP_EN
The "status" command returned some info, but it
said that the tape had a zero blocksize. I then tried
mt -f /dev/nst0 rewind
which returned:
mt: The device is offline (not powered on, no tape ?).
Eventually I figured out that the "mt eject"
command that I had in the backup script, which never used to do anything, now
(with the new driver) turns off the tape drive. So, looking for the options for
"mt" I see "load" is one of the commands. I did that, and
then another "status" command, which did:
SCSI 2 tape drive:
File number=-1, block number=-1,
partition=0.
Tape block size 512 bytes. Density
code 0x47 (TR-5).
Soft error count since last status=0
General status bits on (1010000):
ONLINE IM_REP_EN
Now the backup completely succeeds, and also it doesn't
load down the server very much at all.
(Update 23 March 2004)
After several weeks, I realized (with a sinking sensation)
that backups were still occasionally failing, even when the tape was "known
good". I tried switching the drivers back to ide-tape, but that had
the same symptoms as before (no-duh!). My initial Google search turned up a
reference to a utility called hdparm which can be used to query and set various
parameters for IDE and/or SCSI devices. Since I also had discovered a boot-time
kernel log message referring to DMA being disabled, and the hdparm command
indicated that the I/O mode for the drive was set to DMA mode 2, I tried using
hdparm to set the device to various PIO modes. Some of these required a hard
reset to recover from, and none worked.
I finally found, via Google, this
reference. It suggests some additional options to specify (via the mt utility).
Specifically:
/bin/mt -f /dev/st0 stsetoptions no-blklimits scsi2logical
I'll let you know later how this worked out for me. The
initial flush of one of my failed backups is proceeding as I write this.
|