[libvirt] two nasty races

I found root cause on two nasty races today, but ran out of time to write the patches until next week. 1. We have a TOCTTOU race when moving processes between cgroups. virCgroupMoveTask can fail if listing the threads in the parent group picks up multiple threads, then one of those threads exits before we then write to the emulator subgroup to relocate those threads; the chain reaction from this failure is that the domain fails to start. qemu spawns short-lived threads all the time (possibly due to glibc using threads under the hood to implement aio as qemu first probes a disk image); with a 3-second-delay hook script, I was able to hit this race about 5% of the time when repeatedly starting multiple domains in parallel. The fix should be ignoring any ESRCH failures when writing during virCgroupAddTAskStrController, but I plan to test that before posting my patch. Present at least since 0.10.2 (Fedora 18); probably earlier (but I didn't check how far back). 2. We have a (rare) deadlock due to the use of a non-async-signal-safe getpwuid_r in between fork and exec. I was extremely surprised when I hit it in real life while trying to debug the first race; but sure enough, the backtrace confirms that glibc grabs a mutex inside getpwuid_r, and if some other thread in the parent process is interrupted in the middle of getpwuid_r while we are forking a child in our thread, then our child is hung; and since the parent depends on a handshake with the child, the parent's worker thread was also hung. I think the fix to this is to call getpwuid_r prior to fork, and change virSetUIDGID to take the list of supplemental groups as a parameter rather than trying to learn it on the fly. Again, something I plan to test once I have the patch written. Present at least since commit f42cf7cb (Dec 2010) with the introduction of virSetUIDGID, and probably further back. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 05/17/2013 10:05 PM, Eric Blake wrote:
I found root cause on two nasty races today, but ran out of time to write the patches until next week.
2. We have a (rare) deadlock due to the use of a non-async-signal-safe getpwuid_r in between fork and exec.
I was also surprised to learn today that getgid() is required to be async-signal-safe, but the more powerful getregid() is not. And guess which one we use :( But I think it is a bug in POSIX, so I filed this: http://austingroupbugs.net/view.php?id=699 Also, some good news - we have previously remarked on this list that strlen() and friends are surprisingly not required to be async-signal-safe, but we went ahead and used them anyways. Thankfully, sanity has prevailed, and the next version of POSIX will require them to be async-signal-safe: http://austingroupbugs.net/view.php?id=692 -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Fri, May 17, 2013 at 10:05:25PM -0600, Eric Blake wrote:
I found root cause on two nasty races today, but ran out of time to write the patches until next week.
1. We have a TOCTTOU race when moving processes between cgroups. virCgroupMoveTask can fail if listing the threads in the parent group picks up multiple threads, then one of those threads exits before we then write to the emulator subgroup to relocate those threads; the chain reaction from this failure is that the domain fails to start. qemu spawns short-lived threads all the time (possibly due to glibc using threads under the hood to implement aio as qemu first probes a disk image); with a 3-second-delay hook script, I was able to hit this race about 5% of the time when repeatedly starting multiple domains in parallel. The fix should be ignoring any ESRCH failures when writing during virCgroupAddTAskStrController, but I plan to test that before posting my patch. Present at least since 0.10.2 (Fedora 18); probably earlier (but I didn't check how far back).
It is worse than that - you can't simply ignore ESRCH. In the same way that existing processes can exit, new processes can come into live. The virCgroupMoveTask method doesn't take any of this into account. The real problem is that we should not be trying to move QEMU after it has been started - we should aim to get it into the right place right away.
2. We have a (rare) deadlock due to the use of a non-async-signal-safe getpwuid_r in between fork and exec. I was extremely surprised when I hit it in real life while trying to debug the first race; but sure enough, the backtrace confirms that glibc grabs a mutex inside getpwuid_r, and if some other thread in the parent process is interrupted in the middle of getpwuid_r while we are forking a child in our thread, then our child is hung; and since the parent depends on a handshake with the child, the parent's worker thread was also hung. I think the fix to this is to call getpwuid_r prior to fork, and change virSetUIDGID to take the list of supplemental groups as a parameter rather than trying to learn it on the fly. Again, something I plan to test once I have the patch written. Present at least since commit f42cf7cb (Dec 2010) with the introduction of virSetUIDGID, and probably further back.
Hmm, interesting. Nasty issue. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/20/2013 03:26 AM, Daniel P. Berrange wrote:
It is worse than that - you can't simply ignore ESRCH. In the same way that existing processes can exit, new processes can come into live. The virCgroupMoveTask method doesn't take any of this into account.
Are you talking about the potential for reusing a pid? Remember, cgroups is Linux only, and on Linux, we know the kernel behavior - pids (well, really thread ids) are assigned sequentially to the next available opening, and reusing a tid requires cycling through the entire pid space, something that can't happen in a short amount of time. The problem of pid recycling is more prevalent on other systems, like mingw, where pids are not assigned sequentially, and really can be reused the very next instruction after a previous process pid has been reclaimed. But since we know Linux does not have that behavior, is it really a problem for us? Ignoring ESRCH will solve the problem of a thread that dies between read and write, but like you said, won't solve the problem between a thread being created between read and write. For that, you'd have to repeatedly read the source file until it is empty.
The real problem is that we should not be trying to move QEMU after it has been started - we should aim to get it into the right place right away.
It would also be nice if the kernel would allow a rename(old, new) as a way to atomically move a group of thread ids between two cgroups, instead of forcing us to use the racy read/write interfaces. But I don't know enough about kernel development to even know where to propose this idea. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Mon, May 20, 2013 at 10:09:20AM -0600, Eric Blake wrote:
On 05/20/2013 03:26 AM, Daniel P. Berrange wrote:
It is worse than that - you can't simply ignore ESRCH. In the same way that existing processes can exit, new processes can come into live. The virCgroupMoveTask method doesn't take any of this into account.
Are you talking about the potential for reusing a pid? Remember, cgroups is Linux only, and on Linux, we know the kernel behavior - pids (well, really thread ids) are assigned sequentially to the next available opening, and reusing a tid requires cycling through the entire pid space, something that can't happen in a short amount of time.
The problem of pid recycling is more prevalent on other systems, like mingw, where pids are not assigned sequentially, and really can be reused the very next instruction after a previous process pid has been reclaimed. But since we know Linux does not have that behavior, is it really a problem for us?
Ignoring ESRCH will solve the problem of a thread that dies between read and write, but like you said, won't solve the problem between a thread being created between read and write. For that, you'd have to repeatedly read the source file until it is empty.
No, i mean that the code does pids = read(tasks file in old group) write(pids, tasks file in new group) In between step 1 and 2, more tasks can have been added to the old group. So you'll not have moved everything out of the old group when you're done. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Eric Blake