Hi,
Last Week, we faced an issue in which our KVM infrastructure, we are using lustre share for storing the disk images of VMs. We have few critical vms running in one of the nodes and those machines get into paused state showing that disk space is full on nodes
Exact error message is given below
"block I/O error in device 'drive-ide0-0-1': No space left on device (28)"
Upon checking the hard drive disk space inside that vm in single usermode, I can see that it had almost 2 TB of free space.
So I believe it was some issue with disk image. So I ran fsck on the disk image in single user mode. But still it didn't worked. Finally I confirmed that it was some issue with the lustre file system, when one more VM running from lustre getinto paused state with the same error.
Finally I began to trouble shoot for lustre errors. Below are the steps that I have done on debugging the lustre
I checked the lustre logs and found the error as
"(vvp_io.c:1018:vvp_io_commit_write()) Write page 369920 of inode ffff8820120601b8 failed -28".
On lustre documentation the error code -28 is "The file system is out-of-space or out of inodes"(http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTroubleshooting.html)
Also I checked the command to list the disk space of all of the OSTs in lustre
lfs df -h
UUID bytes Used Available Use% Mounted on
klust01-MDT0000_UUID 1.2T 933.1M 1.1T 0% /klust01[MDT:0]
klust01-OST0000_UUID 9.1T 8.6T 202.8M 100% /klust01[OST:0]
klust01-OST0001_UUID 9.1T 5.8T 2.8T 68% /klust01[OST:1]
klust01-OST0002_UUID 9.1T 5.5T 3.1T 64% /klust01[OST:2]
klust01-OST0003_UUID 9.1T 5.8T 2.9T 67% /klust01[OST:3]
klust01-OST0004_UUID 9.1T 6.5T 2.2T 75% /klust01[OST:4]
klust01-OST0005_UUID 9.1T 6.6T 2.0T 76% /klust01[OST:5]
klust01-OST0006_UUID 9.1T 7.9T 796.0G 91% /klust01[OST:6]
klust01-OST0007_UUID 9.1T 6.4T 2.2T 74% /klust01[OST:7]
klust01-OST0008_UUID 9.1T 6.8T 1.8T 79% /klust01[OST:8]
klust01-OST0009_UUID 9.1T 6.4T 2.2T 74% /klust01[OST:9]
klust01-OST000a_UUID 9.1T 8.6T 48.2M 100% /klust01[OST:10]
klust01-OST000b_UUID 9.1T 7.9T 733.8G 92% /klust01[OST:11]
filesystem summary: 109.1T 82.8T 20.8T 80% /klust01
In above the OST 0 and OST 10 is filledup to 100%.
On checking I can see that both the VM disks are located in OST10 whose disk is filled up.
Below is the command to check in which OST the image locates
lfs getstripe /var/test.img
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_stripe_offset: 8
obdidx objid objid group
8 1815776 0x1bb4e0 0
10 1728679 0x1a60a7 0
So finally we deleted some unused data from ost 10 and then rebooted the OST main node, which fixed the issue.
Cheers
Syamkumar M
Last Week, we faced an issue in which our KVM infrastructure, we are using lustre share for storing the disk images of VMs. We have few critical vms running in one of the nodes and those machines get into paused state showing that disk space is full on nodes
Exact error message is given below
"block I/O error in device 'drive-ide0-0-1': No space left on device (28)"
Upon checking the hard drive disk space inside that vm in single usermode, I can see that it had almost 2 TB of free space.
So I believe it was some issue with disk image. So I ran fsck on the disk image in single user mode. But still it didn't worked. Finally I confirmed that it was some issue with the lustre file system, when one more VM running from lustre getinto paused state with the same error.
Finally I began to trouble shoot for lustre errors. Below are the steps that I have done on debugging the lustre
I checked the lustre logs and found the error as
"(vvp_io.c:1018:vvp_io_commit_write()) Write page 369920 of inode ffff8820120601b8 failed -28".
On lustre documentation the error code -28 is "The file system is out-of-space or out of inodes"(http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTroubleshooting.html)
Also I checked the command to list the disk space of all of the OSTs in lustre
lfs df -h
UUID bytes Used Available Use% Mounted on
klust01-MDT0000_UUID 1.2T 933.1M 1.1T 0% /klust01[MDT:0]
klust01-OST0000_UUID 9.1T 8.6T 202.8M 100% /klust01[OST:0]
klust01-OST0001_UUID 9.1T 5.8T 2.8T 68% /klust01[OST:1]
klust01-OST0002_UUID 9.1T 5.5T 3.1T 64% /klust01[OST:2]
klust01-OST0003_UUID 9.1T 5.8T 2.9T 67% /klust01[OST:3]
klust01-OST0004_UUID 9.1T 6.5T 2.2T 75% /klust01[OST:4]
klust01-OST0005_UUID 9.1T 6.6T 2.0T 76% /klust01[OST:5]
klust01-OST0006_UUID 9.1T 7.9T 796.0G 91% /klust01[OST:6]
klust01-OST0007_UUID 9.1T 6.4T 2.2T 74% /klust01[OST:7]
klust01-OST0008_UUID 9.1T 6.8T 1.8T 79% /klust01[OST:8]
klust01-OST0009_UUID 9.1T 6.4T 2.2T 74% /klust01[OST:9]
klust01-OST000a_UUID 9.1T 8.6T 48.2M 100% /klust01[OST:10]
klust01-OST000b_UUID 9.1T 7.9T 733.8G 92% /klust01[OST:11]
filesystem summary: 109.1T 82.8T 20.8T 80% /klust01
In above the OST 0 and OST 10 is filledup to 100%.
On checking I can see that both the VM disks are located in OST10 whose disk is filled up.
Below is the command to check in which OST the image locates
lfs getstripe /var/test.img
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_stripe_offset: 8
obdidx objid objid group
8 1815776 0x1bb4e0 0
10 1728679 0x1a60a7 0
So finally we deleted some unused data from ost 10 and then rebooted the OST main node, which fixed the issue.
Cheers
Syamkumar M
Hello,
ReplyDeleteI have a little question for you.
I have read in more post that use lustre for storage repository have a very big problem in performance, because while qemu write and read with 4k block, lustre use a 512k.
Where is your experience?