[Libvir] Bug with libvirt in Xen 3.0.1?

Hi, I think I may have found a bug in libvirt and wanted see what people thought. I'm using the stock FC5 installation at the moment (with xen 3.0.1), and the newest version of libvirt. I am noticing that with xen 3.0.1 and newer versions of libvirt, getDomainsID() seems to return bogus values. For example: [root@test05 ~]# virsh list Id Name State ---------------------------------- 0 Domain-0 running libvir: Xen Daemon error : GET operation failed: No such domain 65486 libvir: Xen Daemon error : GET operation failed: No such domain 2986 But, if I use xm, I get what I expect: [root@test05 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1507 2 r----- 387.5 vm1 1 256 1 -b---- 151.3 vm2 2 256 1 -b---- 87.9 After some digging around in the code, I believe that libvirt is incorrectly identifying the hypervisor as being "old" in xen_internal.c:xenHypervisorInit, and is therefore passing the incorrect parameter structure into the hypervisor when it makes its ioctl in xen_internal.c:xenHypervisorListDomains. I've tried this same test on a system running xen 3.0.2, and as I expected everything works fine. So, there must be something different about xen 3.0.1 that libvirt is not accounting for. At this point, I don't really have more time to dig further but I thought I'd bring up the issue in case someone on this list can offer more insight. Pete

On Thu, Jul 20, 2006 at 09:16:16AM -0400, pvetere@redhat.com wrote:
Hi, I think I may have found a bug in libvirt and wanted see what people thought. I'm using the stock FC5 installation at the moment (with xen 3.0.1), and the newest version of libvirt. I am noticing that with xen 3.0.1 and newer versions of libvirt, getDomainsID() seems to return bogus values. For example:
[root@test05 ~]# virsh list Id Name State ---------------------------------- 0 Domain-0 running libvir: Xen Daemon error : GET operation failed: No such domain 65486 libvir: Xen Daemon error : GET operation failed: No such domain 2986
But, if I use xm, I get what I expect:
[root@test05 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1507 2 r----- 387.5 vm1 1 256 1 -b---- 151.3 vm2 2 256 1 -b---- 87.9
After some digging around in the code, I believe that libvirt is incorrectly identifying the hypervisor as being "old" in xen_internal.c:xenHypervisorInit, and is therefore passing the incorrect parameter structure into the hypervisor when it makes its ioctl in xen_internal.c:xenHypervisorListDomains.
I assume it's an i386 platform, because the ABI breakage should not show up on x86_64.
I've tried this same test on a system running xen 3.0.2, and as I expected everything works fine. So, there must be something different about xen 3.0.1 that libvirt is not accounting for.
That's possible. If you still have that setup around, could you rerun virsh (as root ) under gdb and put a breakpoint in xenHypervisorInit and see what's happening in the first hypervisor call values of hc.op and cmd, and return value (hv_version). Then what's happen in the second call (is it failing too ?) in that same routine.
At this point, I don't really have more time to dig further but I thought I'd bring up the issue in case someone on this list can offer more insight.
If you don't have time, that's okay, just bugzilla this so I remember to have a look at the issue, thanks ! Daniel -- Daniel Veillard | Red Hat http://redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Quoting Daniel Veillard <veillard@redhat.com>:
On Thu, Jul 20, 2006 at 09:16:16AM -0400, pvetere@redhat.com wrote:
Hi, I think I may have found a bug in libvirt and wanted see what people thought. I'm using the stock FC5 installation at the moment (with xen 3.0.1), and the newest version of libvirt. I am noticing that with xen 3.0.1 and newer versions of libvirt, getDomainsID() seems to return bogus values. <snip> After some digging around in the code, I believe that libvirt is incorrectly identifying the hypervisor as being "old" in xen_internal.c:xenHypervisorInit, and is therefore passing the incorrect parameter structure into the hypervisor when it makes its ioctl in xen_internal.c:xenHypervisorListDomains.
I assume it's an i386 platform, because the ABI breakage should not show up on x86_64.
Yes, that's correct; it's an i386 platform.
I've tried this same test on a system running xen 3.0.2, and as I expected everything works fine. So, there must be something different about xen 3.0.1 that libvirt is not accounting for.
That's possible. If you still have that setup around, could you rerun virsh (as root ) under gdb and put a breakpoint in xenHypervisorInit and see what's happening in the first hypervisor call values of hc.op and cmd, and return value (hv_version). Then what's happen in the second call (is it failing too ?) in that same routine.
Sure, I can do this. I'll let you know what I find later on today. Pete

I've tried this same test on a system running xen 3.0.2, and as I expected everything works fine. So, there must be something different about xen 3.0.1 that libvirt is not accounting for.
That's possible. If you still have that setup around, could you rerun virsh (as root ) under gdb and put a breakpoint in xenHypervisorInit and see what's happening in the first hypervisor call values of hc.op and cmd, and return value (hv_version). Then what's happen in the second call (is it failing too ?) in that same routine.
Sure, I can do this. I'll let you know what I find later on today.
Ok, I had a chance to run libvirt through the debugger as you asked. Here's what I got back: After the first ioctl: (gdb) ptype hc type = struct privcmd_hypercall { __u64 op; __u64 arg[5]; } (gdb) print hc.op $6 = 17 (gdb) print cmd $7 = 3166208 (gdb) print ret $8 = -1 Then, after the second ioctl: (gdb) ptype old_hc type = struct old_hypercall_struct { long unsigned int op; long unsigned int arg[5]; } (gdb) print old_hc.op $11 = 17 (gdb) print cmd $12 = 1593344 (gdb) print ret $13 = 196608 The second ioctl appears to succeed. One additional item of note is that I am running code that was compiled against xen 3.0.2, but is running on 3.0.1. This may be part of the problem. If there are any other quick tests you'd like me to run, just let me know. Pete

On Thu, Jul 20, 2006 at 04:14:49PM -0400, Peter Vetere wrote:
Ok, I had a chance to run libvirt through the debugger as you asked. Here's what I got back:
After the first ioctl:
(gdb) ptype hc type = struct privcmd_hypercall { __u64 op; __u64 arg[5]; } (gdb) print hc.op $6 = 17 (gdb) print cmd $7 = 3166208 (gdb) print ret $8 = -1
okay, failure with the new hypercall structure, that's normal.
Then, after the second ioctl:
(gdb) ptype old_hc type = struct old_hypercall_struct { long unsigned int op; long unsigned int arg[5]; } (gdb) print old_hc.op $11 = 17 (gdb) print cmd $12 = 1593344 (gdb) print ret $13 = 196608
The second ioctl appears to succeed. One additional item of note is
And here that succeed, now I would have to find why subsequent hypercall are failing even though libvirt detected it was the old interface.
that I am running code that was compiled against xen 3.0.2, but is running on 3.0.1. This may be part of the problem.
Well if it was compiled against 3.0.1 that would not have changed I guess but it's another thing to test.
If there are any other quick tests you'd like me to run, just let me know.
I'm afraid now that the easy potential error has been dismissed, it will be a real debug needed to find out why other hypercalls are failing. Best is probably to bugzilla this and add a pointer to this thread for reference (the recompilation against 3.0.1 should be tested too :-) thanks again, Daniel -- Daniel Veillard | Red Hat http://redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Quoting Daniel Veillard <veillard@redhat.com>: <snip>
I'm afraid now that the easy potential error has been dismissed, it will be a real debug needed to find out why other hypercalls are failing. Best is probably to bugzilla this and add a pointer to this thread for reference (the recompilation against 3.0.1 should be tested too :-)
Ok, bug filed: BZ 199651. I referenced this thread in the bug. Thanks! Pete
participants (3)
-
Daniel Veillard
-
Peter Vetere
-
pvetere@redhat.com