On Thu, Apr 30, 2020 at 07:36:54PM +0200, Christian Ehrhardt wrote:
On Thu, Apr 30, 2020 at 7:23 PM Daniel P. Berrangé
<berrange(a)redhat.com> wrote:
>
> On Thu, Apr 30, 2020 at 06:28:08PM +0200, Christian Ehrhardt wrote:
> > On Thu, Apr 30, 2020 at 5:10 PM Daniel P. Berrangé <berrange(a)redhat.com>
wrote:
> > >
> > > On Thu, Apr 30, 2020 at 04:58:25PM +0200, Christian Ehrhardt wrote:
> > > > On Thu, Apr 30, 2020 at 2:51 PM Daniel P. Berrangé
<berrange(a)redhat.com> wrote:
> > > > Well it seems I have a reproducible symptom and a fix, but not the
> > > > explanation why the latter fixes the former.
> > > > I'll need to dive into some debug & analysis myself to
explain it better.
> > > > I'll be back here once I got time to do that in depth check.
> > > >
> > > > Until then whoever is affected (should be everyone) can give it a
> > > > thought as well.
> > > > Repro is as easy as
> > > > One console:
> > > > $ journalctl -f -u libvirt-guests
> > > > Other console:
> > > > $ systemctl stop libvirt-guests
> > > > $ systemctl start libvirt-guests
> > > >
> > > > I see it with 245.4-4ubuntu3 (18.04) I'm not seeing it on
> > > > 237-3ubuntu10.39 (20.04).
> > > > Maybe it is a systemd bug after all?
> > > > I'd be interested to hear if that is/isn't clobbering output
for
> > > > anyone else and what your systemd versions are?
> > >
> > > FWIW, it works correctly on Fedora 31 with systemd 243.
> >
> > Eoan with 242-7ubuntu3.7 is good as well.
> > I might need to try to get some interim versions from somewhere.
>
> I've reproduced on Fedora 33 rawhide with systemd 245 - the first
> place where it lists running guests is screwed up slightly:
>
> Apr 30 17:16:13 libvirt-fedora-rawhide systemd[1]: Stopping Suspend/Resume Running
libvirt Guests...
> Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69903]: Running guests on
default URI:
> Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Runningcore2
> Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69934]: Suspending guests
on default URI...
> Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69892]: SSuspending core1:
...
> Apr 30 17:16:14 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Suspending core1:
done
> Apr 30 17:16:14 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Suspending core2:
...
> Apr 30 17:16:15 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Suspending core2:
done
> Apr 30 17:16:15 libvirt-fedora-rawhide systemd[1]: libvirt-guests.service:
Succeeded.
> Apr 30 17:16:15 libvirt-fedora-rawhide systemd[1]: Stopped Suspend/Resume Running
libvirt Guests.
>
> On resume it is even worse
>
> Apr 30 17:19:40 libvirt-fedora-rawhide systemd[1]: Starting Suspend/Resume Running
libvirt Guests...
> Apr 30 17:19:40 libvirt-fedora-rawhide libvirt-guests.sh[70041]: Resuming guests on
default URI...
> Apr 30 17:19:40 libvirt-fedora-rawhide libvirt-guests.sh[70030]: R
> Apr 30 17:19:40 libvirt-fedora-rawhide libvirt-guests.sh[70048]: R
> Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70048]: esuming guest
core1:
> Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70079]: esum
> Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70030]: e
> Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70086]: e
> Apr 30 17:19:42 libvirt-fedora-rawhide libvirt-guests.sh[70086]: esuming guest
core2:
> Apr 30 17:19:42 libvirt-fedora-rawhide libvirt-guests.sh[70119]: esum
> Apr 30 17:19:42 libvirt-fedora-rawhide libvirt-guests.sh[70030]: e
> Apr 30 17:19:42 libvirt-fedora-rawhide systemd[1]: Finished Suspend/Resume Running
libvirt Guests.
>
>
> It is possible it isn't systemd related - could be other packages that
> are co-incidentally affecting it.
Since launchpad holds all former builds still ready for download I
could easily test a few versions.
I up/downgraded just the following packages on an otherwise unmodified system:
- libnss-systemd
- libpam-systemd
- libsystemd0
- systemd
- systemd-container
- systemd-sysv
Those versions I did check:
245.4-4ubuntu3 bad
245.2-1ubuntu1 bad
244.3-1ubuntu1 bad
244.1-0ubuntu3 bad (bad on retry)
243-3ubuntu1 good (good (on retry)
242-7ubuntu3 good
Retry means that I went and installed from good -> bad -> good and the
behavior was the same.
So none of the installs fixed it to then be good forever. It was
consistently good <244 and bad >=244.
Based on that, a possibly interesting change between v243..v244 was
this:
commit 09d0b46ab61bebafe5bdc1be95ee153dfb13d6bc
Author: Lorenz Bauer <lmb(a)cloudflare.com>
Date: Mon Nov 4 16:35:46 2019 +0000
journal: refresh cached credentials of stdout streams
journald assumes that getsockopt(SO_PEERCRED) correctly identifies the
process on the remote end of the socket. However, this is incorrect
according to man 7 socket:
The returned credentials are those that were in effect at the
time of the call to connect(2) or socketpair(2).
This becomes a problem when a new process inherits the stdout stream
from a parent. First, log messages from the child process will
be attributed to the parent. Second, the struct ucred used by journald
becomes invalid as soon as the parent exits. Further sendmsg calls then
fail with ENOENT. Logs for the child process then vanish from the journal.
Fix this by using recvmsg on the stdout stream, and refreshing the cached
struct ucred if SCM_CREDENTIALS indicate a new process.
but I can't see an obvious mistake the code which would cause the
behaviour we see.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|