On Fri, Oct 08, 2021 at 03:13:27PM +0200, Michal Prívozník wrote:
On 10/8/21 2:23 PM, Daniel P. Berrangé wrote:
> On Fri, Oct 08, 2021 at 01:56:24PM +0200, Michal Prívozník wrote:
>> On 10/8/21 1:36 PM, Daniel P. Berrangé wrote:
> So, ACK to this patch as a quick fix, but it feels like we
> ultimately have a more serious problem.
>
> I wonder if we should temporarily stop watching events on
> the incoming server if we get EMFILE first time on accept,
> and set a timer to re-enable events 60 seconds later. Rinse,
> repeat.
Yes, good idea. But I worry that 60 seconds may be too long. Or
needlessly short - depending on usage. Nevertheless, let us start with
60 seconds and we can fine tune later.
Once we're in this out of FDs situation, we're in a bad mess no
matter what, so the length of timeout isn't a big issue IMHO.
Essentially we just need to stop burning CPU gratuitously.
Even 5 seconds is long enough to achieve that goal, so just
pick a number you're happy with - I don't feel strongly about
it being 60 seconds, it was just an arbitrary number i picked.
So are you okay with me pushing this patch and sending this ^^ as a
follow up?
Yes, consider is Reviewed-by: Daniel P. Berrangé <berrange(a)redhat.com>
>
>
>>>> * Each event source is assigned a priority. The default priority,
>>>> * %G_PRIORITY_DEFAULT, is 0. Values less than 0 denote higher
priorities.
>>>> * Values greater than 0 denote lower priorities. Events from high
priority
>>>> * sources are always processed before events from lower priority
sources.
>>>>
>>>> and per g_idle_add() documentation:
>>>>
>>>> * Adds a function to be called whenever there are no higher priority
>>>> * events pending to the default main loop. The function is given the
>>>> * default idle priority, %G_PRIORITY_DEFAULT_IDLE.
>>>>
>>>> Now, because we did not accept() the client we are constantly
>>>> seeing POLLIN on the main socket and thus the removal of the
>>>> client socket won't ever happen.
>>>>
>>>> The fix is to set at least the same priority as other sources,
>>>> but since we want to just close an FD, let's give it the highest
>>>> priority and call it before handling other events.
>>>>
>>>> This issue can be easily reproduced, for instance:
>>>>
>>>> # ulimit -S -n 40 (tweak this number if needed)
>>>> # ./src/libvirtd
>>>>
>>>> from another terminal:
>>>>
>>>> # for ((i=0; i<100; i++)); do virsh list & done; virsh list
>>>>
>>>> The last `virsh list` must not get stuck.
>>>
>>> The bug below describes libvirt core dumps rather than
>>> hangs. Do you know why it dumped ?
>>
>> Because of glib. We called some glib function which internally wanted to
>> open a pipe which failed. Thus glib aborted. I don't remember the exact
>> function, but I can look it up if you want.
>
> Oh, so this patch is not actually fixing the core dump scenario
> then. Even after we fix the bug that prevents clients disconnecting,
> it will still dump if this API is called at a point where we are in
> a EMFILE state ?
Yes. But I am afraid there's no way around it. Found the stack trace:
Stack trace of thread 52231:
#0 0x000003ffa90be9e4 raise (libc.so.6)
#1 0x000003ffa96590ee g_log_default_handler (libglib-2.0.so.0)
#2 0x000003ffa965935e g_logv (libglib-2.0.so.0)
#3 0x000003ffa965955a g_log (libglib-2.0.so.0)
#4 0x000003ffa969dd70 g_wakeup_new (libglib-2.0.so.0)
#5 0x000003ffa964eb38 g_main_context_new (libglib-2.0.so.0)
#6 0x000003ffa9946886 vir_event_thread_init (libvirt.so.0)
#7 0x000003ffa95b5ff8 g_type_create_instance (libgobject-2.0.so.0)
#8 0x000003ffa95978cc g_object_new_internal (libgobject-2.0.so.0)
#9 0x000003ffa95990de g_object_new_with_properties (libgobject-2.0.so.0)
#10 0x000003ffa9599ab6 g_object_new (libgobject-2.0.so.0)
#11 0x000003ffa9946b78 virEventThreadNew (libvirt.so.0)
#12 0x000003ff6c2c8bd8 qemuProcessQMPNew (libvirt_driver_qemu.so)
#13 0x000003ff6c1f57b8 virQEMUCapsInitQMPSingle (libvirt_driver_qemu.so)
#14 0x000003ff6c1f59ec virQEMUCapsNewForBinaryInternal
(libvirt_driver_qemu.so)
#15 0x000003ff6c1f5c92 virQEMUCapsNewData (libvirt_driver_qemu.so)
#16 0x000003ffa99510f4 virFileCacheValidate (libvirt.so.0)
#17 0x000003ffa995133c virFileCacheLookup (libvirt.so.0)
#18 0x000003ff6c1f5e4a virQEMUCapsCacheLookup (libvirt_driver_qemu.so)
#19 0x000003ff6c1f621e virQEMUCapsCacheLookupCopy (libvirt_driver_qemu.so)
#20 0x000003ff6c2c4a44 qemuProcessInit (libvirt_driver_qemu.so)
#21 0x000003ff6c2c67f2 qemuProcessStart (libvirt_driver_qemu.so)
#22 0x000003ff6c26358c qemuDomainObjStart.constprop.55
(libvirt_driver_qemu.so)
#23 0x000003ff6c2639c8 qemuDomainCreateWithFlags (libvirt_driver_qemu.so)
#24 0x000003ffa9b77902 virDomainCreate (libvirt.so.0)
#25 0x000002aa34ad2d52 remoteDispatchDomainCreateHelper (libvirtd)
#26 0x000003ffa9a6094a virNetServerProgramDispatch (libvirt.so.0)
#27 0x000003ffa9a665b6 virNetServerHandleJob (libvirt.so.0)
#28 0x000003ffa99a0d32 virThreadPoolWorker (libvirt.so.0)
#29 0x000003ffa99a02d6 virThreadHelper (libvirt.so.0)
#30 0x000003ffa85880c6 start_thread (libpthread.so.0)
#31 0x000003ffa917bd36 thread_start (libc.so.6)
So my bad, it failed opening eventfd (in frame #4 g_wakeup_new()) but
anyway, I don't think we can mitigate this. But I think there was
another stacktrace attached to BZ where it failed in glib in pipe().
Doesn't really matter though. It's glib - we accepted abort policy when
adopting it.
Yeah, we're basically doomed with FD exhaustion. If a host got into
this state once, its likely todo it again, so the only real fix is
to re-configure the host to have sufficient FDs. Anything we do to
cope with FD exhaustion is just a "nice to have", not a real fix
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|