LVM Problems

26 November 2001

Original problem

I returned from Thanksgiving vacation to find that my primary system no longer boots. The immediate problem was that prior to leaving for the weekend, I used the vgmerge command to combine a volume group consisting of a single PV on a software RAID5 device (/dev/md0) with the volume group containing my root volume, comprising /dev/sde3 and /dev/sde4. Apparently I neglected to make sure this configuration would boot before I left. The initrd didn't contain the RAID modules necessary to set things up, so of course vgscan failed to find and assemble the volume group.

Unfortunately, after fixing the RAID problem, the vgscan command still fails to recognize any volume groups. Everything is now present and accounted for, and pvscan sees all the component PVs.

At this point, my only access to the system is through the minimal (busybox) initrd environment. I've got a complete set of LVM user space tools, built from CVS on 26-Nov-2001. The kernel was built using the LVM CVS as of 19-Nov-2001.

Here's some relevant command output:

Help?

First update: bug in tools/lib/pv_read_all_pv_of_vg.c

I found what appears to be a bug in the pv_read_all_pv_of_vg() function. Details can be found in the lvm-devel mailing list; the relevant message is here.

Unfortunately, while that appears to have solved the problem of vgscan not recognizing my volume group, I'm still dead in the water. vgscan is now reporting the following error:

vgscan -- found inactive volume group "flowers-vg-0"
vgscan -- only found 1995 of 2169 LEs for LV /dev/flowers-vg-0/data0 (1)
vgscan -- ERROR "vg_read_with_pv_and_lv(): allocated LE of LV" can't get data of volume group "flowers-vg-0" from physical volume(s)

Here is the complete debug output of running vgscan after applying the patches decribed above.

The "data0" volume exists entirely on /dev/md0...so at this point, if there were an easy way to temporarily disable the "data0" volume, that would at least let me access my system again.

27 November 2001

Morning

Andreas Dilger noticed an inconsistency in the on-disk data for /dev/md0 and suggested the following hack to copy the LV table from sde3 to md0:

dd if=/dev/md0 of=/root/md0.sav bs=1k count=160  # make a backup
dd if=/dev/sde3 of=/dev/md0 bs=1 skip=40960 seek=39424 count=83968

I have made this change.

The output of pvdata -avP has been requested, and is now avalable here for each PV.

I have also made available my patch to pv_read_all_pv_of_vg.c that removes a redundant loop and fixes the problem I reported yesterday.

Later that day...

By disabling the check that was generating the above error, I got one step further -- but vgscan was still failing in the consistency check. So I disabled the consistency check...

int pv_check_consistency_all_pv ( vg_t *vg) {
   int p = 0;
   int pe = 0;
   int pe_count = 0;
   int ret = 0;

   debug_enter ( "pv_check_consistency_all_pv -- CALLED\n");

   goto pv_check_consistency_all_pv_end;

   /* ... */

pv_check_consistency_all_pv_end:
   debug_leave ( "pv_check_consistency_all_pv -- LEAVING with ret: %d\n", ret);
   return ret;

...and was finally able to boot the system in the first time since leaving last Wednesday. On the bright side, I have access to my system again. On the other hand, the data0 volume (which resides on md0) appears to be missing.

Andreas points at vgmerge as the culprit responsible for wiping out the PE entries, and provided a method for manually fixing the PE table.

  • Andreas' message on correcting the problem.
  • The source that implements the solution.
  • This does appear to correct the on-disk data. Running pvdata -avP /dev/md0 results in:

    --- List of physical extents ---
    
    PE: 00000  LV: 002  LE: 00000
    PE: 00001  LV: 002  LE: 00001
    PE: 00002  LV: 002  LE: 00002
    
    [...]
    
    PE: 00172  LV: 002  LE: 00172
    PE: 00173  LV: 002  LE: 00173
    PE: 00174  LV: 002  LE: 00174
    

    Unfortunately, while vgscan runs without any errors, and I am able to access the logical volumes root and iso, data0 remains inaccessible (although a device node is created for it).

    It's alive!

    Very odd. I rebuilt my initrd, making sure that I was using the same version of the libraries and binaries as on my running system, rebooted...and now things appear to be working!

    I posted a summary of my experience to the linux-lvm mailing list.


    lars@larsshack.org