Categories

Thursday, 20 February 2014

Lustre Troubleshooting

Hi,

 Last Week, we faced an issue in which our KVM infrastructure, we are using lustre share for storing the disk images of VMs. We have few critical vms running in one of the nodes and those machines get into paused state showing that disk space is full on nodes

Exact error message is given below

"block I/O error in device 'drive-ide0-0-1': No space left on device (28)"

Upon checking the hard drive disk space inside that vm in single usermode, I can see that it had almost 2 TB of free space.

So I believe it was some issue with disk image. So I ran fsck on the disk image in single user mode. But still it didn't worked. Finally I confirmed that it was some issue with the lustre file system, when one more VM running from lustre getinto paused state with the same error.

Finally I began to trouble shoot for lustre errors. Below are the steps that I have done on debugging the lustre

I checked the lustre logs and found the error as


"(vvp_io.c:1018:vvp_io_commit_write()) Write page 369920 of inode ffff8820120601b8 failed -28".


On lustre documentation the error code -28 is "The file system is out-of-space or out of inodes"(http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTroubleshooting.html)


Also I checked the command to list the disk space of all of the OSTs in lustre

lfs df -h 


UUID                       bytes        Used   Available Use% Mounted on
klust01-MDT0000_UUID        1.2T      933.1M        1.1T   0% /klust01[MDT:0]
klust01-OST0000_UUID        9.1T        8.6T      202.8M 100% /klust01[OST:0]
klust01-OST0001_UUID        9.1T        5.8T        2.8T  68% /klust01[OST:1]
klust01-OST0002_UUID        9.1T        5.5T        3.1T  64% /klust01[OST:2]
klust01-OST0003_UUID        9.1T        5.8T        2.9T  67% /klust01[OST:3]
klust01-OST0004_UUID        9.1T        6.5T        2.2T  75% /klust01[OST:4]
klust01-OST0005_UUID        9.1T        6.6T        2.0T  76% /klust01[OST:5]
klust01-OST0006_UUID        9.1T        7.9T      796.0G  91% /klust01[OST:6]
klust01-OST0007_UUID        9.1T        6.4T        2.2T  74% /klust01[OST:7]
klust01-OST0008_UUID        9.1T        6.8T        1.8T  79% /klust01[OST:8]
klust01-OST0009_UUID        9.1T        6.4T        2.2T  74% /klust01[OST:9]
klust01-OST000a_UUID        9.1T        8.6T       48.2M 100% /klust01[OST:10]
klust01-OST000b_UUID        9.1T        7.9T      733.8G  92% /klust01[OST:11]

filesystem summary:       109.1T       82.8T       20.8T  80% /klust01


In above the OST 0 and OST 10 is filledup to 100%.

On checking I can see that both the VM disks are located in OST10  whose disk is filled up.

Below is the command to check in which OST the image locates

lfs getstripe /var/test.img

lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_stripe_offset:  8
    obdidx         objid        objid         group
         8           1815776         0x1bb4e0                 0
        10           1728679         0x1a60a7                 0


 So finally we deleted some unused data from ost 10 and then rebooted the OST main node, which fixed the issue.

Cheers 
Syamkumar M 
 
   







1 comment:

  1. Hello,
    I have a little question for you.

    I have read in more post that use lustre for storage repository have a very big problem in performance, because while qemu write and read with 4k block, lustre use a 512k.

    Where is your experience?

    ReplyDelete

Ad