On Mon, Sep 22, 2025 at 2:59 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Mon, Sep 22, 2025 at 11:30:46 +0800, Yong Huang wrote:
> On Fri, Sep 19, 2025 at 8:23 PM Peter Krempa <pkrempa@redhat.com> wrote:
>
> > On Fri, Sep 19, 2025 at 17:09:07 +0800, yong.huang@smartx.com wrote:
> > > From: Hyman Huang <yong.huang@smartx.com>
> > >
> > > When saving the domain status in the path of the virDomainObjSave
> > > function, Libvirtd will omit the private data if the format
> > > interface is not provided.
> > >
> > > When the Libvirtd service is starting up and the driver has not
> > > yet been initialized, this behavior is justified.
> > >
> > > However, when the service is shutting down, the driver is being
> > > finalized and the interface is being released, a migration job
> > > or another job may call the qemuDomainSaveStatus and save the
> > > domain status at the same time. For the latter, this behavior
> >
> > I had another look and `virQEMUDriverPrivateDataCallbacks` is a global
> > variable in the qemu driver which is just populated and never changed.
> >
> > This struct is then copied (by assignment)  into the virDomainXMLOption
> > instance that is used inside the qemu driver. Thus while this does have
> > a private copy of the function pointers. The virDomainXMLOption
> > instance lives the whole time the virt drivers live, so it's only ever
> > cleared in virStateCleanup, which in the daemons happens right before
> > exit.
> >
> > Now looking at the formatter code (let's use your patch as an example):
> >
> > ---
> >  src/conf/domain_conf.c | 21 ++++++++++++++++++---
> >  1 file changed, 18 insertions(+), 3 deletions(-)
> >
> > diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
> > index 281846dfbe..74af08e584 100644
> > --- a/src/conf/domain_conf.c
> > +++ b/src/conf/domain_conf.c
> > @@ -29542,9 +29542,24 @@ virDomainObjFormat(virDomainObj *obj,
> >                                obj->deprecations[i]);
> >      }
> > =20
> > -    if (xmlopt->privateData.format &&
> > -        xmlopt->privateData.format(&buf, obj) < 0)
> > -        return NULL;
> > +    if (xmlopt->privateData.format) {
> > +        if (xmlopt->privateData.format(&buf, obj) < 0)
> >
> > Here you see that the code assumes that 'xmlopt' is *not NULL*. Inside
> > of XMLopt you have only pointers that don't change.
> >
> > As of such the 'format' callback being NULL is only if something
> > overwrote the data inside 'xmlopt', which we don't do anywhere because
> > it doesn't make sense.
> >
> > The other possibility is that 'xmlopt' is used after free, thus was
> > randomly overwritten, but that would be a much different kind of bug and
> > your fix wouldn't fix that at all.
> >
> >
> > > causes the XML to be saved without private information (such as
> > > monitor path and qemuCaps), which is required for the Libvirtd
> > > service to manage a VM during startup. As a result, after restarting
> > > the Libvirtd service, a running VM that is being migrated previously
> > > might escape management.
> > >
> > > Thus, when formatting the status XML for a running virtual machine,
> > > we need to presume that the "privateData.format" interface is present.
> > >
> > > Add a thorough check in the virDomainObjSave path to make sure that
> > > private data in the status XML file always exists for the running VM
> > > so that we won't lose it after restart the Libvirtd service.
> >
> > Now this outlines the implications of losing the private data but
> > doesn't really give any information how this happened.
> >
> > Can you please clarify:
> >  1) What did you do to trigger this bug? What was the state of the host?
> >  2) How did you figure out that what you claim happened?
> >  3) Can you reproduce the bug?
> >
> > If you can reproduce the bug please make sure to collect debug logs of
> > libvirtd, and potentially consider running the daemon inside e.g.
> > valgrind to see if it reproduces with such heavy instrumentation.
> >
> >
> 1. Add a debuginfo print patch on Libvirtd, if privateData.format does not
> exist, log it:

[...]

> @@ -630,9 +631,14 @@ virDomainObjListLoadAllConfigs(virDomainObjListPtr
> doms,
>              virDomainObjEndAPI(&dom);
>          } else {
>              VIR_ERROR(_("Failed to load config for domain '%s'"),
> entry->d_name);
> +            if (liveStatus && liveStatusErrorFatal) {
> +                ret = -1;
> +                goto end;
> +            }
>          }
>      }
>
> +end:
>      VIR_DIR_CLOSE(dir);
>      virObjectRWUnlock(doms);

[..]


>
> 3. Launch the migration and use "systemctl restart libvirt" to restart
> Libvirtd
> once after migration enters the perform phase.

[...]

Okay so my understanding from your description is that an (early
startup) failure in virDomainObjListLoadAllConfigs() (and surrounding
code) can result in the daemon shutting down before the threads handling
the already loaded(? ... impossible to tell with the abbreviated log
below) domains terminate? Right?

Yes, in our productized Libvirt, an early failure in virDomainObjListLoadAllConfigs()

can result in the daemon shutting down. 

In the upstream Libvirt, the daemon started up successfully but failed to manage the VM

(The virDomainObjListLoadAllConfigs returns an error since the missing private data

in status XML).

 

Thus the other threads trigger a use-after-free on the driver object?

Anyways I think it's clear now that just checking if the callbacks are
present doesn't make sense.

Additionally there's now an upstream issue https://gitlab.com/libvirt/libvirt/-/issues/814
which seems to claim a use-after-free on a different code path but still
triggered by the cleanup code freeing private data.

Unfortunately I didn't get any logs or backtrace there either.

I'll look into the shutdown code path and see if I can figure it out.

>
>
> 4. Search the log message:
>
> $ cat /var/log/zbs/libvirtd.log |egrep "PrivateData formatter driver does
> not exist|remoteDispatchDomainMigratePerform3Params"
> 2025-09-22 03:06:12.517+0000: 1124258: debug : virThreadJobSet:94 : Thread
> 1124258 (rpc-worker) is now running job
> remoteDispatchDomainMigratePerform3Params

This log indicate that 1124258 thread now execute the remoteDispatchDomainMigratePerform3Params

> 2025-09-22 03:06:12.517+0000: 1124258: debug :
> remoteDispatchDomainMigratePerform3ParamsHelper:8804 :
> server=0x556317979660 client=0x55631799eff0 msg=0x55631799c010
> rerr=0x7f08c688b9c0 args=0x7f08a800a820 ret=0x7f08a80053b0
> 2025-09-22 03:06:21.959+0000: 1124258: warning : virDomainObjFormat:30190 :
> PrivateData formatter driver does not exist

In the execution path of remoteDispatchDomainMigratePerform3Params, it enters the code and the

warning message is logged, while the following warning message is never logged in a successful migration:


+    if (!xmlopt->privateData.format) {
+        VIR_WARN("PrivateData formatter driver does not exist");
+    }

The following info shows the backtrace of virDomainObjFormat in an successful migration:

#0  virDomainObjFormat (obj=obj@entry=0x7fa3342598e0, xmlopt=0x7fa3341c54b0, flags=flags@entry=313) at ../../src/conf/domain_conf.c:30166
#1  0x00007fa395ae8684 in virDomainObjSave (obj=obj@entry=0x7fa3342598e0, xmlopt=<optimized out>, statusDir=0x7fa33412aec0 "/run/libvirt/qemu") at ../../src/conf/domain_conf.c:30375
#2  0x00007fa37dde816c in qemuDomainObjSaveStatus (driver=driver@entry=0x7fa334115690, obj=obj@entry=0x7fa3342598e0) at ../../src/qemu/qemu_domain.c:9669
#3  0x00007fa37ddefb0d in qemuDomainObjSetJobPhase (driver=driver@entry=0x7fa334115690, obj=obj@entry=0x7fa3342598e0, phase=phase@entry=3) at ../../src/qemu/qemu_domain.c:9727
#4  0x00007fa37de37900 in qemuMigrationJobSetPhase (driver=driver@entry=0x7fa334115690, vm=vm@entry=0x7fa3342598e0, phase=phase@entry=QEMU_MIGRATION_PHASE_PERFORM3)
    at ../../src/qemu/qemu_migration.c:5747
#5  0x00007fa37de41470 in qemuMigrationSrcPerformPeer2Peer3 (flags=<optimized out>, useParams=<optimized out>, bandwidth=0, migParams=0x7fa374011b80, nbdPort=0, migrate_disks_detect_zeroes=0x0,
    migrate_disks=0x0, nmigrate_disks=0, listenAddress=<optimized out>, graphicsuri=0x0, uri=0x7fa37401ccf0 "tcp://10.123.82.12:49152", dname=0x0, persist_xml=0x0, xmlin=<optimized out>,
    vm=0x7fa3342598e0, dconnuri=0x7fa374014350 "qemu+tls://10.123.82.12/system?no_verify=1", dconn=0x7fa3740148b0, sconn=0x7fa37400eef0, driver=0x7fa334115690) at ../../src/qemu/qemu_migration.c:4523
#6  qemuMigrationSrcPerformPeer2Peer (v3proto=<synthetic pointer>, resource=0, dname=0x0, flags=<optimized out>, migParams=0x7fa374011b80, nbdPort=0, migrate_disks_detect_zeroes=0x0, migrate_disks=0x0,
    nmigrate_disks=0, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dconnuri=0x7fa374014350 "qemu+tls://10.123.82.12/system?no_verify=1", persist_xml=0x0, xmlin=<optimized out>,
    vm=0x7fa3342598e0, sconn=0x7fa37400eef0, driver=0x7fa334115690) at ../../src/qemu/qemu_migration.c:4835
#7  qemuMigrationSrcPerformJob (driver=driver@entry=0x7fa334115690, conn=conn@entry=0x7fa37400eef0, vm=vm@entry=0x7fa3342598e0,
    xmlin=xmlin@entry=0x7fa374016570 "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<domain type=\"kvm\" xmlns:qemu=\"http://libvirt.org/schemas/domain/qemu/1.0\"><name>88030027-a5ce-4c19-aacc-5f1decf0b647</name><uuid>6c677d6f-d140-48ca-b4ae-aff7fa7"..., persist_xml=persist_xml@entry=0x0, dconnuri=0x7fa374014350 "qemu+tls://10.123.82.12/system?no_verify=1", uri=0x7fa374003a30 "tcp://10.123.82.12",
    graphicsuri=0x0, listenAddress=0x0, nmigrate_disks=0, migrate_disks=0x0, migrate_disks_detect_zeroes=0x0, nbdPort=0, migParams=0x7fa374011b80, cookiein=0x0, cookieinlen=0, cookieout=0x7fa3871578c8,
    cookieoutlen=0x7fa3871578bc, flags=143899, dname=0x0, resource=0, v3proto=<optimized out>) at ../../src/qemu/qemu_migration.c:4911
#8  0x00007fa37de4215b in qemuMigrationSrcPerform (driver=driver@entry=0x7fa334115690, conn=0x7fa37400eef0, vm=0x7fa3342598e0,
    xmlin=0x7fa374016570 "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<domain type=\"kvm\" xmlns:qemu=\"http://libvirt.org/schemas/domain/qemu/1.0\"><name>88030027-a5ce-4c19-aacc-5f1decf0b647</name><uuid>6c677d6f-d140-48ca-b4ae-aff7fa7"..., persist_xml=0x0, dconnuri=dconnuri@entry=0x7fa374014350 "qemu+tls://10.123.82.12/system?no_verify=1", uri=0x7fa374003a30 "tcp://10.123.82.12", graphicsuri=0x0,
    listenAddress=0x0, nmigrate_disks=0, migrate_disks=0x0, migrate_disks_detect_zeroes=0x0, nbdPort=0, migParams=0x7fa374011b80, cookiein=0x0, cookieinlen=0, cookieout=0x7fa3871578c8,
    cookieoutlen=0x7fa3871578bc, flags=143899, dname=0x0, resource=0, v3proto=true) at ../../src/qemu/qemu_migration.c:5101
#9  0x00007fa37de73f98 in qemuDomainMigratePerform3Params (dom=0x7fa374013bc0, dconnuri=0x7fa374014350 "qemu+tls://10.123.82.12/system?no_verify=1", params=<optimized out>, nparams=<optimized out>,
    cookiein=0x0, cookieinlen=0, cookieout=0x7fa3871578c8, cookieoutlen=0x7fa3871578bc, flags=143899) at ../../src/qemu/qemu_driver.c:12742
#10 0x00007fa395bde0f8 in virDomainMigratePerform3Params (domain=domain@entry=0x7fa374013bc0, dconnuri=0x7fa374014350 "qemu+tls://10.123.82.12/system?no_verify=1", params=0x7fa374010510, nparams=5,
    cookiein=0x0, cookieinlen=0, cookieout=0x7fa3871578c8, cookieoutlen=0x7fa3871578bc, flags=143899) at ../../src/libvirt-domain.c:4989
#11 0x000055c9014bfb1e in remoteDispatchDomainMigratePerform3Params (server=0x55c9033c4660, msg=0x55c9033ef470, ret=0x7fa3740142c0, args=0x7fa374011ee0, rerr=0x7fa3871579c0, client=<optimized out>)
    at ../../src/remote/remote_daemon_dispatch.c:5736
#12 remoteDispatchDomainMigratePerform3ParamsHelper (server=0x55c9033c4660, client=<optimized out>, msg=0x55c9033ef470, rerr=0x7fa3871579c0, args=0x7fa374011ee0, ret=0x7fa3740142c0)
    at ./remote/remote_daemon_dispatch_stubs.h:8805
#13 0x00007fa395b791ad in virNetServerProgramDispatchCall (msg=0x55c9033ef470, client=0x55c9033ee120, server=0x55c9033c4660, prog=0x55c9033ceb90) at ../../src/rpc/virnetserverprogram.c:430
#14 virNetServerProgramDispatch (prog=0x55c9033ceb90, server=server@entry=0x55c9033c4660, client=0x55c9033ee120, msg=0x55c9033ef470) at ../../src/rpc/virnetserverprogram.c:302
#15 0x00007fa395b7e142 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x55c9033c4660) at ../../src/rpc/virnetserver.c:137
#16 virNetServerHandleJob (jobOpaque=0x55c9033dff40, opaque=0x55c9033c4660) at ../../src/rpc/virnetserver.c:154
#17 0x00007fa395a92330 in virThreadPoolWorker (opaque=<optimized out>) at ../../src/util/virthreadpool.c:163
#18 0x00007fa395a91967 in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:233
#19 0x00007fa394ed1f1b in ?? () from /usr/lib64/libpthread.so.0
#20 0x00007fa394de92e0 in clone () from /usr/lib64/libc.so.6

 
> 2025-09-22 03:06:25.141+0000: 1124258: warning : virDomainObjFormat:30190 :
> PrivateData formatter driver does not exist
> 2025-09-22 03:06:25.141+0000: 1124258: warning : virDomainObjFormat:30190 :
> PrivateData formatter driver does not exist
> 2025-09-22 03:06:25.153+0000: 1124258: warning : virDomainObjFormat:30190 :
> PrivateData formatter driver does not exist 
> 2025-09-22 03:06:25.153+0000: 1124258: debug : virThreadJobClear:119 :
> Thread 1124258 (rpc-worker) finished job
> remoteDispatchDomainMigratePerform3Params with ret=-1

This log is so abbreviated that it's useless. Please post the full thing
somewhere.

:( Since we focus on the shutdown code of Libvirtd, getting the backtrace is not easy, so I added
the debug patch. 
 

Additionally if you can reproduce this without the patch I'd be
interested in that log as well.

Yes, I reproduce this with Libvirt 6.2.0, the latest version in the upstream uses the same logic and
I assume that it also has this issue and reproducing is not that hard.
 

>
> Migration rpc-worker routine print the message
>
> 5. Here we see the status XML:
>
> <domstatus state='running' reason='migrated' pid='1092202'>
>   <taint flag='custom-argv'/>
>   <taint flag='custom-ga-command'/>
>   <domain type='kvm' id='1' xmlns:qemu='
> http://libvirt.org/schemas/domain/qemu/1.0'>
>     <name>88030027-a5ce-4c19-aacc-5f1decf0b647</name>
>     <uuid>6c677d6f-d140-48ca-b4ae-aff7fa7e4c50</uuid>
>     <description>SmartX</description>
>     <maxMemory slots='255' unit='KiB'>4194304000</maxMemory>
> ...
> It does not contain private data.
>
> 6. Finally, in our case, Libvirtd service can not startup; And in the
> upstream case, the behavior
>     is that VM escapes from managing after startup logically.
>
>
> --
> Best regards



--
Best regards