On Tue, 12 Jun 2018 11:17:15 +0200
Michal Privoznik <mprivozn(a)redhat.com> wrote:
On 06/12/2018 12:36 AM, Eduardo Habkost wrote:
> CCing libvir-list.
>
> On Mon, Jun 11, 2018 at 11:29:24PM +0200, Igor Mammedov wrote:
>> On Mon, 11 Jun 2018 16:06:07 -0300
>> Eduardo Habkost <ehabkost(a)redhat.com> wrote:
>>
>>> On Mon, Jun 11, 2018 at 03:16:25PM +0200, Igor Mammedov wrote:
>>>> On Fri, 8 Jun 2018 10:21:05 -0300
>>>> Eduardo Habkost <ehabkost(a)redhat.com> wrote:
>>>>
>>>>> On Thu, Jun 07, 2018 at 02:00:09PM +0200, Igor Mammedov wrote:
>>>>>> When using --daemonize, the initial lead process will fork a
child and
>>>>>> then wait to be notified that setup is complete via a pipe,
before it
>>>>>> exits. When using --preconfig there is an extra call to
main_loop()
>>>>>> before the notification is done from os_setup_post(). Thus the
parent
>>>>>> process won't exit until the mgmt application connects to
the monitor
>>>>>> and tells QEMU to leave the RUN_STATE_PRECONFIG. The mgmt
application
>>>>>> won't connect to the monitor until daemonizing has completed
though.
>>>>>>
>>>>>> This is a chicken and egg problem, leading to deadlock at
startup.
>>>>>>
>>>>>> The only viable way to fix this is to call os_setup_post()
before
>>>>>> the early main_loop() call when --preconfig is used. This has
the
>>>>>> downside that any errors from this point onwards won't be
handled
>>>>>> well by the mgmt application, because it will think QEMU has
started
>>>>>> successfully, so not be expecting an abrupt exit. Moving as much
user
>>>>>> input validation as possible to before the main_loop() call
might help,
>>>>>> but mgmt application should stop assuming that QEMU has started
>>>>>> successfuly and use other means to collect errors from QEMU
(logfile).
>>>>>>
>>>>>> Signed-off-by: Daniel P. Berrangé <berrange(a)redhat.com>
>>>>>> Signed-off-by: Igor Mammedov <imammedo(a)redhat.com>
>>>>>> ---
>>>>>> v5:
>>>>>> * use original Daniel's patch [1], but addapt it to apply
on top of
>>>>>> "[PATCH v3 1/2] cli: Don't run early event loop if
no --preconfig was specified"
>>>>>> with extra comment and massage commit message a little bit.
>>>>>> v6:
>>>>>> * hide os_setup_post_done flag inside of os_setup_post() as it
was in v4
>>>>>>
>>>>>> CC: berrange(a)redhat.com
>>>>>> CC: mreitz(a)redhat.com
>>>>>> CC: pbonzini(a)redhat.com
>>>>>> CC: ehabkost(a)redhat.com
>>>>>> CC: ldoktor(a)redhat.com
>>>>>> CC: eblake(a)redhat.com
>>>>>> ---
>>>>>> os-posix.c | 6 ++++++
>>>>>> vl.c | 6 ++++++
>>>>>> 2 files changed, 12 insertions(+)
>>>>>>
>>>>>> diff --git a/os-posix.c b/os-posix.c
>>>>>> index 9ce6f74..0246195 100644
>>>>>> --- a/os-posix.c
>>>>>> +++ b/os-posix.c
>>>>>> @@ -309,8 +309,14 @@ void os_daemonize(void)
>>>>>>
>>>>>> void os_setup_post(void)
>>>>>> {
>>>>>> + static bool os_setup_post_done;
>>>>>> int fd = 0;
>>>>>>
>>>>>> + if (os_setup_post_done) {
>>>>>> + return;
>>>>>> + }
>>>>>> + os_setup_post_done = true;
>>>>>> +
>>>>>> if (daemonize) {
>>>>>> if (chdir("/")) {
>>>>>> error_report("not able to chdir to /:
%s", strerror(errno));
>>>>>> diff --git a/vl.c b/vl.c
>>>>>> index fa44138..457ff2a 100644
>>>>>> --- a/vl.c
>>>>>> +++ b/vl.c
>>>>>> @@ -4578,6 +4578,12 @@ int main(int argc, char **argv, char
**envp)
>>>>>> parse_numa_opts(current_machine);
>>>>>>
>>>>>> /* do monitor/qmp handling at preconfig state if requested
*/
>>>>>> + if (!preconfig_exit_requested && is_daemonized())
{
>>>>>> + /* signal parent QEMU to exit, libvirt treats it as a
sign
>>>>>> + * that monitor socket is ready to accept connections
>>>>>> + */
>>>>>> + os_setup_post();
>>>>>> + }
>>>>>
>>>>> I was looking at the daemonize logic, and noticed it we have a
>>>>> huge amount of code between this line and the next
>>>>> os_setup_post() call that could either:
>>>>>
>>>>> * call exit() and/or error_report(); or
>>>> logging would work to the extent mentioned in commit message,
>>>> i.e. it' would work fine when log file is used otherwise it
>>>> errors will go to /dev/null
>>>>
>>>> so it should be more or less fine on this point
>>>
>>> My worry is that most users of error_report() involve an exit()
>>> call too.
>>>
>>> Once we have an active monitor, we must never call exit()
>>> directly. Even qmp_quit() doesn't call exit() directly.
>> Is there any reason why exit() can't be called?
>
> QMP clients don't expect the QMP socket to be closed except when
> using the 'quit' command.
Libvirt views HANGUP on monitor socket as qemu process dying
unexpectedly so it starts cleanup (which involves 'kill -9' of qemu).
So if
we exit(1) there is a chance to get SIGKILL before exit(1) completes.
Do we care about it at this point?
(there are places when QEMU calls exit(1) at runtime on unrecoverable error)
>>>>> * be unable to finish machine initialization
because of
>>>>> chdir("/"), change_root(), or change_process_uid().
>>>> this one really no go.
>>>> I see 2 options here,
>>>>
>>>> * move init code that opens files to early stage (before preconfig
monitor)
>>>> or split it to open files early.
>>>> (I've spotted several obvious places
fwcfg/vnc/replay_start/migration)
>>>> but there might be code somewhere in callbacks that would do it too,
>>>> so it rather risky to go this route.
>>>> (I'd do this anyways one place at the time using sanitizing
>>>> initialization sequence pretext.)
>>>
>>> We might have QMP commands that take file paths as input, so is
>>> this really an option?
>> I'd think that in future we would want to enable object_add in preconfig
>> to create backends at runtime, so yes we can't do chroot at this point
>>
>>
>>>> * split out signaling part that tells parent process to exit into
>>>> separate helper that's called once before/from main_loop().
>>>> This option seems low risk and additionally error output to
>>>> stderr will work as it does currently (until os_setup_post())
>>>
>>> My assumption is that separating the chdir()/stdout/stderr logic
>>> from the fork/daemonize/exit steps wouldn't be possible without
>>> breaking expectations about -daemonize.
>> it's already separated and that's what creates one side of problem.
>
> Is it? Right now '$QEMU -daemonize' will never call exit(0)
> before the child process it spawned did the
> chdir()/stdout/stderr/etc trick.
>
>
>> What I suggest is to leave it as is and move out only
>> len = write(daemon_pipe, &status, 1);
>> part of os_setup_post() to sync with parent process. That shouldn't
>> affect daemonizing flow on QEMU side and would let libvirt reuse parent's
>> exit as sync point to detect moment when monitor is available.
>> (patch is in testing, I'll post it tomorrow if it doesn't break tests)
>
> This will affect the daemonizing flow, won't it? It will make
> QEMU exit(0) before the child does the chdir()/stderr/stdout/etc
> cleanup. A well-behaved daemon shouldn't do this.
>
> This is probably not a problem for libvirt (which only uses
> -daemonize as a sync point for QMP), but possibly a problem for
> other users of -daemonize.
I think the proper behaviour is to exit(0) after chdir()/stderr/... AND
monitor is set up. Then, if any caller started qemu with -preconfig they
know they can start talking on monitor, set up whatever it is they want
and issue 'cont' finally (or what is the right command to exit preconfig
state). This way nothing changes for callers not using -preconfig.
As pointed out
earlier we need -preconfig stay before chdir/chroot/chuid/stderr/stdout
are called, so it would have the same access rights/permissions for configuration
as CLI options.
>> In worst case if we can't do the later in QEMU, mgmt
would have to cope with
>> monitor in preconfig mode without relying on parent exit(0) sync point.
>> (a typical daemon would fork/chroot and co in one place and clients would use
>> other means to detect socket availability other than watching parent process
>> exiting)
>
> Do we really need to make -daemonize and -preconfig work
> together? libvirt uses -daemonize only for its initial
> capability probing, which shouldn't require -preconfig at all.
>
> Even on that case, I wonder why libvirt doesn't simply create a
> server socket and waits for QEMU to connect instead of using
> -daemonize as a sync point.
>
because libvirt views qemu as well behaved daemon. Should anything go
wrong libvirt reads qemu's stderr and reports error to upper layers.
We can
keep daemonizing flow in QEMU as it's now.
But Eduardo's idea about libvirt created socked + letting QEMU connect to it
has a merit. It should fix current deadlock issue with as monitor
won't be depending on lead exit event.
Can we do this way on libvirt side when --preconfig is in use
(it might even be fine for normal flow without -preconfig)?
Michal