The Big M

ayn Idea, Inc.

1125 Ashland Avenue
Dayton, OH 45420
Voice:
Fax:

1.937.558.5495
1.505.217.2213

Main Networks Solutions Products Topics

The Infamous IDE Tape Drive Problem with Kernel 2.4

This is a long tale that I tell to illustrate a) how frustrating it has been for me to debug some problems in Linux and b) the low quality of the information that is generally out there to help you do so.

One of my customers had a problem with their nightly tape backups. The linux server performed the backups using a Seagate Travan 10/20G tape drive and the Amanda backup software. Compiling, installing, and configuring amanda was complex and painful, but once done, the system ran with Red Hat Linux version 7.0 trouble-free for months. Then, in order to support Windows XP Samba clients, I needed to upgrade to a newer version of Samba, which in turn required me to upgrade the OS. I chose to go to RH 8.0, since 9 had not been officially released at that time. 

After the upgrade, we started having problems with backups. It appeared that one or more of the tapes in the rotation had gone bad. With some of the tapes, the nightly backup would complete, but with others, either the tape would not be recognized or the backup would fail partway through. I tried recomping the Amanda package (in fact I did this right after the upgrade, before attempting to run backups again) but this did not help. Although it seemed strange that the tapes would choose the week after I had upgraded Linux to fail, I cheerfully bought new tapes and put them into the rotation.

Now Amanda insists that its tapes have a header written onto the tape itself; the header identifies the backup set of the tape, as well as the sequence number of the tape within the backup set. When a new tape is put into the rotation, one first runs a program (called rmtape) to delete the information in Amanda's database pertaining to the old tape. Then another program (called label) will write the header onto the tape and add the tape into the tape database.

With the new tapes, the label operation succeeded. Another utility, check, which reads the tape header to make sure the tape is valid and in the right sequence, would also succeed - if done right after the label operation. However, the backup operation which next used that tape would always fail partway through; subsequent check operations would not recognize the tape.

I had been seeing some error messages in the logs for the customer's server:

ide-tape: ht0: I/O error, pc = 34,...

These showed up occasionally even before the upgrade. Their increased frequency afterwards caused me to suspect that the hardware was having a problem. In addition, doing a backup caused other server processes, such as the mail server and even the terminal, to be unresponsive. I had not noticed this before, but could not swear it had not been so.  

So at this point I tried swapping in a known good tape drive from an identical server which was also running 8.0. When I say "known good", it means that I could label and check tapes, as well as run a trial backup which succeeded. Unfortunately, this made no difference. I then tried different cables, then new cables, and, ultimately, yet another tape drive. I tried changing jumper positions on the drive for DMA and compression. I tried changing which IDE bus the drive was on, and its position on that bus (for IDE2, that is.) All with no luck.

Most of my efforts over the next weeks focused on trying to discover what was different between my own server and the one at the customer. They both were identical Dell PowerEdge computers, bought in the same order; they both ran RedHat 8.0, although at home I installed from scratch while at the customer I had upgraded from 7.0. They both ran the same version of amanda, which was compiled from source on each server separately. So I spent a lot of time trying to find a configuration difference between the two builds of amanda, or a configuration difference or driver difference between the two Linuxes. Eventually I thought to try bringing one of the tapes which failed to my own server to see if it failed there, too.

I tried out the tape from the customer  in my server with my test backup, and it worked fine. It was only then that I thought about the fact that my test backup only did about 20 MB, while the customer's backup uses over a gigabyte. So I increased the size of the test backup by including a share from my development Windows XP box. This time, the backup failed after about 200 megabytes. Now before moving to RH 8.0, I had done regular backups of over 10 Gb at a time on my home server. After the upgrade, I did not continue the backup process because at that same time I acquired a Windows 2000 server box and was using the backup on it. Now knowing that both my server and the customer's ahd developed a backup problem after the upgrade, I went looking for a problem with the newer version of Linux. 

I first installed a newer version of Amanda from the rawhide RPM, in case there was some incompatibility with the old version. This didn't help. Then I had another brilliant insight, which was to search for the error messages on Google. I had previously searched for them on RedHat with no success (I often find searches there unhelpful for one reason or another.)

When I searched for the messages on Google, there were many hits. Most of them were discussion group messages reading something like:

<Joe User> "I upgraded my Linux kernel from 2.2 to 2.4 following all of the correct procedures. After the upgrade, I find I can't get the tape drive to do anything useful. Does anyone know what's wrong?"

These messages start showing up sometime in 2000. There was usually no reply to them at all; or if there was, it was something useless like "Dude, did you check your configuration file settings?" But from late in 2002 I found a couple of responses that said "well, I don't use the ide-tape driver for tape drives any more; instead, I use the ide-scsi driver" with no implication that there was anything broken in the ide-tape driver, although it used to work.

So I tried loading up the ide-scsi driver instead of the ide-tape one (which gets loaded automatically when linux boots and detects the tape drive):

rmmod ide-tape

mod-probe ide-scsi

When I did this on my home server, my test backup then succeeded. But then when I did the same thing on the customer's server, none of the tape commands worked at all any more.

Looking at the system log, I see that when ide-scsi loads it identifies the tape drive and assigns it to device "st0" (where "nst0" then is the non-auto-rewind version of the device). So then I tried:

Mt -f /dev/nst0 status

which returned:

SCSI 2 tape drive:

File number=-1, block number=-1, partition=0.

Tape block size 0 bytes. Density code 0x0 (default).

Soft error count since last status=0

General status bits on (10000):

IM_REP_EN

The "status" command returned some info, but it said that the tape had a zero blocksize. I then tried

mt -f /dev/nst0 rewind

which returned:

mt: The device is offline (not powered on, no tape ?).

Eventually I figured out that the "mt eject" command that I had in the backup script, which never used to do anything, now (with the new driver) turns off the tape drive. So, looking for the options for "mt" I see "load" is one of the commands. I did that, and then another "status" command, which did:

SCSI 2 tape drive:

File number=-1, block number=-1, partition=0.

Tape block size 512 bytes. Density code 0x47 (TR-5).

Soft error count since last status=0

General status bits on (1010000):

ONLINE IM_REP_EN

Now the backup completely succeeds, and also it doesn't load down the server very much at all.

(Update 23 March 2004)

After several weeks, I realized (with a sinking sensation) that backups were still occasionally failing, even when the tape was "known good".  I tried switching the drivers back to ide-tape, but that had the same symptoms as before (no-duh!). My initial Google search turned up a reference to a utility called hdparm which can be used to query and set various parameters for IDE and/or SCSI devices. Since I also had discovered a boot-time kernel log message referring to DMA being disabled, and the hdparm command indicated that the I/O mode for the drive was set to DMA mode 2, I tried using hdparm to set the device to various PIO modes. Some of these required a hard reset to recover from, and none worked.

I finally found, via Google, this reference. It suggests some additional options to specify (via the mt utility). Specifically:

/bin/mt -f /dev/st0 stsetoptions no-blklimits scsi2logical

I'll let you know later how this worked out for me. The initial flush of one of my failed backups is proceeding as I write this.

 

 

Mayn Idea and the M-Light logo are trademarks of Mayn Idea Inc.
Copyright (C) 2009 Mayn Idea, Inc.
All rights reserved.

Last modified 01 Aug 2009