Categories

Monday 24 February 2014

Fix Sungrid Engine state 'E' (error)

Hi All,

I got an error state named E on some of the nodes running sun grid engine. The error message is getting while I issue the command

qstat -f 

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
m1.large@machine1.com BIP   0/0/1          -NA-     lx26-amd64    au
---------------------------------------------------------------------------------
m1.large@machine2.com BIP   0/0/1          -NA-     lx26-amd64    au
---------------------------------------------------------------------------------
m1.large@machine3.com BIP   0/0/1          0.05   lx26-amd64    E
---------------------------------------------------------------------------------
m1.large@machine4.com BIP   0/0/1          -NA-     lx26-amd64    au
---------------------------------------------------------------------------------



As shown above you can see that the machine3 is in E state. On the sungrid engine documentation E state  is an error state.



You will get the exact reason for this error state by issuing the command

qstat -f -explain E

  The E error state will not clear till the node is rebooted or restarting the grid engine. 

Because of this error, no job will hook in to that node and job will be in queue wait state.

Once you resolved the issues related to that node, you can issue the command 

# qmod -c '*'

This will clear the E state of the node and now job will hook into it.

Cheers
Syamkumar.M


No comments:

Post a Comment

Ad