On Tue, Oct 16, 2018 at 05:57:17PM -0400, John Ferlan wrote:

> On 10/11/18 4:13 AM, Bingsong Si wrote:

> > On CentOS 6, udev_monitor_receive_device will block until the socket becomes

> Is this really CentOS6 only or just where you've seen it?

> > readable, udevEventHandleThread will hold the lock all the time and

> > udevEventHandleCallback hard to get the lock, will block the event poll.

> > To fix this, set dataReady to false after receive an udev event.

> >

> > Signed-off-by: Bingsong Si <owen.si@ucloud.cn>

> > ---

> > src/node_device/node_device_udev.c | 5 +----

> > 1 file changed, 1 insertion(+), 4 deletions(-)

> >

> I've CC'd Erik since he wrote and perhaps remembers all the "gotchas" he

> discovered in the udev callback code.

> I wonder if this has to do with the EAGAIN and EWOULDBLOCK @errno checks

> done in the !device loop that are different in "older" (much older) code.

> Although I have this very vague recollection that there was some problem

> with centos6 that was fixed by some OS patch. Hopefully Erik remembers

> (and maybe we should log it in the code at this point ;-)) - I did do

> some searching, but came up empty.

Remembering a year old issue, let me tell you, my head hurts :) (and we probably

should put a note somewhere, so that we don't have to dig out dinosaurs

again)...the only thing I remember is that there was a reason why I did things

this way and not the way this patch is proposing, and indeed I then found this:

https://www.redhat.com/archives/libvir-list/2017-September/msg00683.html

TL;DR:

The scheduler comes into play here. The problem I had was that the event loop

could be scheduled (and it in fact was) earlier than the handler thread here.

What that essentially means is that by the time the thread actually handled the

event and read the data from the monitor, the event loop fired the very same

event, simply because the data hadn't been retrieved from the socket at that

point yet.

This was mainly connected to the design flaw of that specific version of patch

series. With the current design, setting dataReady immediately after reading the

data or after encoutering the first EAGAIN doesn't matter and the scheduler

wouldn't have an impact either way, that's true. However, with CentOS 6 the

scheduler would still come into play even with your patch (it was much more

noticeable the more devices you had in/added into the system), you'd still

remain blocking on the recv call. The correct fix would be more

complex and IIRC it would involve pulling the monitor object out of the private

data lockable object and would need to be guarded by a separate lock (I haven't

thought about it much though, so I might be wrong).

That said, we already dropped upstream support for CentOS 6, so I'm

not really keen on "fixing" anything, unless the currently supported platforms

suffer from a related issue which would require code changes in which case we

could merge a patch like this upstream. You should upgrade your platform to a

newer CentOS if you want to rely on features provided by new(ish) libvirt.

Erik