2009-09-24

RAID1 problems: access beyond end of device

After getting my RAID1 arrays running on my Ubuntu server, I moved the case back into the crawlspace under the stairs and powered it on. Later I started up an SSH but the connection was denied. Great; I had to lug the thing back into the office to connect a monitor and keyboard. The boot process stopped due to a failed fsck with output similar to the following:
e2fsck 1.41.4 (27-Jan-2009)
The filesystem size (according to the superblock) is 244190000 blocks
The physical size of the device is 244189984 blocks
Either the superblock or the partition table is likely to be corrupt!
I had no idea what would cause this, so I took a look at /var/log/messages and I noticed some distressing errors:
Sep 14 10:31:42 hierax kernel: [48196.013846] attempt to access beyond end of device
Sep 14 10:31:42 hierax kernel: [48196.013855] md1: rw=1, want=390716680, limit=390716672
Sep 14 10:31:42 hierax kernel: [48196.013864] lost page write due to I/O error on md0
Trusty Google directed me to a post at pith.org, which documents the problem for a system where the root partition is mounted from a RAID array. From the Software-RAID HOWTO:
When we created the raid device, the physical partion became slightly smaller because a second superblock is stored at the end of the partition. If you reboot the system now, the reboot will fail with an error indicating the superblock is corrupt.
I wish I had known this when I first created the arrays, but it didn't seem to be a major problem, and since my root filesystem isn't RAIDed, the procedure for correcting the problem was a bit simpler for me. Skimming the article for relevant info, I hit on the following command to have fsck do a badblocks scan and repair the errors non-destructively.
sudo e2fsck -cc /dev/md0 # the long way
While the process was running, I started reading the comments to the post, finding one from Paul that was very helpful. He explains that since all the bad blocks are at the end of the partition, you can run badblocks on just the end to get a list of blocks, then run e2fsck on those alone. I killed the process and took the short-cut. Partial usage:
badblocks [-b block_size] [-o output_file] device [last_block [first_block]]
Figure out my block size:
tune2fs -l /dev/md0 | grep "Block size"
Scan just the last 15 blocks:
badblocks -b 4096 -n -o ~/tmp/badblocks_md0.list /dev/md0 244189999 244189984
The -n option tells badblocks to use non-destructive read-write mode. The STDOUT was 15 lines of the following:
badblocks: Invalid argument during seek
I don't know what this means, and it worried me a bit, but I persevered. Time to mark the bad blocks.
halcyon@hierax:~$ sudo e2fsck -l /home/halcyon/tmp/badblocks_md0.list /dev/md0
e2fsck 1.41.4 (27-Jan-2009)
The filesystem size (according to the superblock) is 244190000 blocks
The physical size of the device is 244189984 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort? no

/dev/md0: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 60948514: 244189985 244189986 244189987 24418
9988 244189989 244189990 244189991 244189992 244189993 244189994 244189995 24418
9996 244189997 244189998 244189999
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 1 inodes containing multiply-claimed blocks.)

File /pics/fluffy_kittens.jpg (inode #60948514, mod time Mon Sep 29 21:29:56 2008)
  has 15 multiply-claimed block(s), shared with 1 file(s):
         (inode #1, mod time Wed Sep 16 07:37:06 2009)
Clone multiply-claimed blocks? yes

Error reading block 244189985 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189985 (Invalid argument).  Ignore error? yes

Error reading block 244189986 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189986 (Invalid argument).  Ignore error? yes

Error reading block 244189987 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189987 (Invalid argument).  Ignore error? yes

Error reading block 244189988 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189988 (Invalid argument).  Ignore error? yes

Error reading block 244189989 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189989 (Invalid argument).  Ignore error? yes

Error reading block 244189990 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189990 (Invalid argument).  Ignore error? yes

Error reading block 244189991 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189991 (Invalid argument).  Ignore error? yes

Error reading block 244189992 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189992 (Invalid argument).  Ignore error? yes

Error reading block 244189993 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189993 (Invalid argument).  Ignore error? yes

Error reading block 244189994 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189994 (Invalid argument).  Ignore error? yes

Error reading block 244189995 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189995 (Invalid argument).  Ignore error? yes

Error reading block 244189996 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189996 (Invalid argument).  Ignore error? yes

Error reading block 244189997 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189997 (Invalid argument).  Ignore error? yes

Error reading block 244189998 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189998 (Invalid argument).  Ignore error? yes

Error reading block 244189999 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Error writing block 244189999 (Invalid argument).  Ignore error? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #8 (2347, counted=2331).
Fix? yes

Free blocks count wrong for group #7452 (65520, counted=0).
Fix? yes

Free blocks count wrong (70293658, counted=70228122).
Fix? yes


/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 96909/61054976 files (7.1% non-contiguous), 173961878/244190000 blocks
The next step was to resize the filesystem.
halcyon@hierax:~$ sudo resize2fs /dev/md0
resize2fs 1.41.4 (27-Jan-2009)
Resizing the filesystem on /dev/md0 to 244189984 (4k) blocks.
The filesystem on /dev/md0 is now 244189984 blocks long.
And finally to verify that the problem was corrected.
halcyon@hierax:~$ sudo e2fsck -f /dev/md0
e2fsck 1.41.4 (27-Jan-2009)
Pass 1: Checking inodes, blocks, and sizes
...
No warning about possibly corrupted superblock!

No comments:

Post a Comment