[libvirt-users] Bug? virGetLastError() returns NULL after virDomainCreate() if Domain running already

Hi, As the subject line indicates, when I'm calling virDomainCreate() to start a defined domain, and that command fails because the domain is already running at that point, virGetLastError() will return NULL, instead of a proper pointer to an error code. Libvirt will, however, print an error to the console: libvir: QEMU error : Requested operation is not valid: domain is already running This looks like a bug to me, and I'm fairly certain that earlier version would return an error code of VIR_ERR_OPERATION_INVALID at that point. Or maybe that was just virDomainDestroy(), and I'm just blindly assuming that virDomainCreate() will work analogously... Anyway, no error code is bad, because I can't tell what happened. This is version 0.9.8. I couldn't find anything about this issue in the changelogs for 0.9.9 and 0.9.10, so I'm assuming the issue has not been addressed since. Guido

On 02/22/2012 07:06 AM, Guido Winkelmann wrote:
Hi,
As the subject line indicates, when I'm calling virDomainCreate() to start a defined domain, and that command fails because the domain is already running at that point, virGetLastError() will return NULL, instead of a proper pointer to an error code. Libvirt will, however, print an error to the console:
libvir: QEMU error : Requested operation is not valid: domain is already running
Any return of NULL without a last error set is a bug. I'm trying to chase this down, and hope to have a patch soon.
This looks like a bug to me, and I'm fairly certain that earlier version would return an error code of VIR_ERR_OPERATION_INVALID at that point. Or maybe that was just virDomainDestroy(), and I'm just blindly assuming that virDomainCreate() will work analogously...
Anyway, no error code is bad, because I can't tell what happened.
If the create xml matches the already running xml, we could perhaps argue that this should return the already-running domain instead of NULL. But that seems like it might be more confusing, as we must not allow a second create attempt with xml that doesn't match what is currently running (you have to use hotplug apis to alter a running domain xml, not re-creation). -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 02/22/2012 11:16 AM, Eric Blake wrote:
On 02/22/2012 07:06 AM, Guido Winkelmann wrote:
Hi,
As the subject line indicates, when I'm calling virDomainCreate() to start a defined domain, and that command fails because the domain is already running at that point, virGetLastError() will return NULL, instead of a proper pointer to an error code. Libvirt will, however, print an error to the console:
libvir: QEMU error : Requested operation is not valid: domain is already running
Any return of NULL without a last error set is a bug. I'm trying to chase this down, and hope to have a patch soon.
I can't reproduce this. I tested on libvirt 0.9.10, starting with an inactive domain 'dom' and using 'virsh dumpxml dom > dom.xml; virsh create dom.xml; virsh create dom.xml', and the second create gave me a proper error message of "Requested operation is not valid: domain is already active as 'dom'". Can you post the code snippet you are using to get this situation? Could it be that you have a different name and/or UUID in the xml from the domain already running (my test obviously reused the same name and uuid from what is already running). -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Am Mittwoch, 22. Februar 2012, 11:26:50 schrieb Eric Blake:
On 02/22/2012 11:16 AM, Eric Blake wrote:
On 02/22/2012 07:06 AM, Guido Winkelmann wrote:
Hi,
As the subject line indicates, when I'm calling virDomainCreate() to start a defined domain, and that command fails because the domain is already running at that point, virGetLastError() will return NULL, instead of a proper pointer to an error code. Libvirt will, however, print an error to the console:
libvir: QEMU error : Requested operation is not valid: domain is already running
Any return of NULL without a last error set is a bug. I'm trying to chase this down, and hope to have a patch soon.
I can't reproduce this. I tested on libvirt 0.9.10, starting with an inactive domain 'dom' and using 'virsh dumpxml dom > dom.xml; virsh create dom.xml; virsh create dom.xml', and the second create gave me a proper error message of "Requested operation is not valid: domain is already active as 'dom'".
I cannot reproduce it with virsh either. The problem occured with the C API, maybe virsh does some extra testing before start a domain? Also note that my case did not involve redefining the domain with a new XML definition, just starting a domain that had already been defined before.
Can you post the code snippet you are using to get this situation?
Unfortunately, the problem occurs inside a fairly large, and unfortunately not Open-Source, C++ program. I will try to reduce it to a minimal testcase as soon as I get around to it.
Could it be that you have a different name and/or UUID in the xml from the domain already running (my test obviously reused the same name and uuid from what is already running).
No. I'm essentially just using virDomainLookupByName() and then calling virDomainCreate() on the returned value. I'm not even touching any uuids. Guido

Am Donnerstag, 23. Februar 2012, 13:40:03 I wrote:
Am Mittwoch, 22. Februar 2012, 11:26:50 schrieb Eric Blake:
Can you post the code snippet you are using to get this situation?
Unfortunately, the problem occurs inside a fairly large, and unfortunately not Open-Source, C++ program. I will try to reduce it to a minimal testcase as soon as I get around to it.
Okay, I wrote a minimal testcase that does just the operations that lead to the problem, only to find out that this won't reproduce the bug either :-\ That means either I'm inadvertently triggering some rare corner case here who knows where in my code, or the bug is really in my code. I don't think it's the latter, though, because the relevant part of my code looks like this: if(virDomainCreate(vserver.get_virdomain().get()) != 0) { // TODO clean up pointer virErrorPtr error = virGetLastError(); if(error != 0) { syslog(LOG_ERR, "Could not start vserver, error code: %d", error->code); // VIR_ERR_OPERATION_INVALID means the vm was already in a started state if(error->code != VIR_ERR_OPERATION_INVALID) throw std::runtime_error("Could not start vserver"); } else { syslog(LOG_ERR, "Could not start vserver, no error code from libvirt"); throw std::runtime_error("Could not start vserver, no error code from libvirt"); } } There's nothing in there between calling virDomainCreate() and virGetLastError(). There are no other threads, either. Anyway, I've fired up gdb now to see if I can find out where the error object gets lost. Guido

Am Freitag, 24. Februar 2012, 18:21:37 schrieb Guido Winkelmann:
Am Donnerstag, 23. Februar 2012, 13:40:03 I wrote:
Am Mittwoch, 22. Februar 2012, 11:26:50 schrieb Eric Blake:
Can you post the code snippet you are using to get this situation?
Unfortunately, the problem occurs inside a fairly large, and unfortunately not Open-Source, C++ program. I will try to reduce it to a minimal testcase as soon as I get around to it.
Okay, I wrote a minimal testcase that does just the operations that lead to the problem, only to find out that this won't reproduce the bug either :-\
That means either I'm inadvertently triggering some rare corner case here who knows where in my code, or the bug is really in my code. I don't think it's the latter, though, because the relevant part of my code looks like this:
if(virDomainCreate(vserver.get_virdomain().get()) != 0) { // TODO clean up pointer virErrorPtr error = virGetLastError(); if(error != 0) { syslog(LOG_ERR, "Could not start vserver, error code: %d", error->code); // VIR_ERR_OPERATION_INVALID means the vm was already in a started state if(error->code != VIR_ERR_OPERATION_INVALID) throw std::runtime_error("Could not start vserver"); } else { syslog(LOG_ERR, "Could not start vserver, no error code from libvirt"); throw std::runtime_error("Could not start vserver, no error code from libvirt"); } }
There's nothing in there between calling virDomainCreate() and virGetLastError(). There are no other threads, either.
Anyway, I've fired up gdb now to see if I can find out where the error object gets lost.
I found out what is happening here: I am using a C++ wrapper class for virDomainPtr, to make it fit better with the RAII principles in that language. That means that, in the above code snippet, even though it's not actually visible because it happens in the constructor and destructor of a temporary object, the virDomainPtr for domain gets copied once, virDomainRef() gets called on it, and after the attempt to start the domain, virDomainFree() gets called. The problem is that virDomainFree() will unconditionally call virResetLastError(). With that info, I could work around the problem, however, I still think that, since virDomainFree() is a reference-counting free, it should only call virResetLastError() when the reference count has actually reached 0. In fact, it should probably even check whether the last error even belongs to the domain being deleted. Guido

On 02/28/2012 03:57 AM, Guido Winkelmann wrote:
There's nothing in there between calling virDomainCreate() and virGetLastError(). There are no other threads, either.
Anyway, I've fired up gdb now to see if I can find out where the error object gets lost.
I found out what is happening here:
I am using a C++ wrapper class for virDomainPtr, to make it fit better with the RAII principles in that language. That means that, in the above code snippet, even though it's not actually visible because it happens in the constructor and destructor of a temporary object, the virDomainPtr for domain gets copied once, virDomainRef() gets called on it, and after the attempt to start the domain, virDomainFree() gets called.
The problem is that virDomainFree() will unconditionally call virResetLastError().
Correct. _Every_ call to a libvirt API resets last error, _on entry_.
With that info, I could work around the problem, however, I still think that, since virDomainFree() is a reference-counting free, it should only call virResetLastError() when the reference count has actually reached 0. In fact, it should probably even check whether the last error even belongs to the domain being deleted.
Not possible - it is not known on entry whether this is the last reference being freed. We've already fixed this sort of bug in libvirt itself in the past, where the code in daemon/remote.c wasn't properly passing errors back to the caller because the daemon glue code was calling intermediate functions like virDomainFree(). Your wrapper function that calls virDomainFree() on the temporary object needs to save the error beforehand and restore it afterwords. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (2)
-
Eric Blake
-
Guido Winkelmann