Troubleshooting Defunct (Zombie) Processes on Linux
Normally a zombie process means that the process has died but remains in the process table because the parent hasn’t called wait() to “reap” the process and retrieve the return code. If you kill the parent, the zombie process becomes parented by init (process 1) and init reaps it. But, the problem we were having was clearly not this. Killing the parent did nothing. And the defunct agent appeared to be holding onto resources. It was not possible to start a new agent since the defunct one continued to hold a socket open, which would never be true with the usual meaning of a defunct process. The only workaround was to reboot the system.
The key to the real answer is that the agent uses multiple threads with the POSIX threads library. Individual threads and processes under Linux are both viewed as tasks to the process management code. Threads are implemented using one task designated as “thread group leader” and a “thread group id” present in each task_struct. By default, the ps utility displays just the thread group leader, hiding the other tasks.
Working at the OpenFabrics Alliance interoperability event last week at UNH-IOL we once again experienced the defunct problem. This time I attacked it in earnest and discovered the “ps aumx” option (”m” being the critical one which displays threads individually). This showed the thread group leader in “Z” state, and then the key: another threads was stuck in the “D” uninterruptible sleep state. In this state the process is running in kernel mode yet cannot be interrupted by any signal, including SIGKILL (signal 9, which normally cannot be ignored). Thus, this thread is unkillable, until whatever condition it is waiting for in kernel mode is cleared. The only solution is to reboot the system.
When the SIGINT (or SIGKILL or whatever) was delivered to the agent, all the tasks (threads) received the signal. Yet one couldn’t exit because it was in “D” and the thread group leader remains around in the deceptive “Z” state, in this case indicating that the process is still around. In fact, this was confirmed by running ps aumx on the agent after a suspected hang but before attempting to kill the agent. This time there were the usual numerous threads, one of which was listed in state “D”.
So how to debug from here? It was possible to use the Magic SysRq Key to obtain a listing of the current tasks on the system. This displayed a stack trace showing the execution context of each task running in kernel mode, including the agent task stuck in uninterruptible sleep. Using this it was possible to determine what the task was doing which caused it to slip into a coma.
I wanted to make sure this got written up because I spent way to much time looking on Google and finding pages that described the usual meaning of “defunct” processes but didn’t touch on this deceptive alternate meaning. So hopefully now it’ll be found! Extra thanks to Professor Robert Russell who helped me troubleshoot all of this.