[PATCH 0 of 2] Move to per-classname connections, without the race

This should fix the behavior Heidi and I were seeing with connect_by_classname() Comments and test feedback appreciated :)

# HG changeset patch # User Dan Smith <danms@us.ibm.com> # Date 1195154883 28800 # Node ID ecc46606e90388217b94916c1eab2ceeb051064a # Parent 670294145fcc1a786594f5110e2248cd4ec0b7ed Initialize libvirt early so that we avoid the backend register race This fixes the errant behavior we see with parallel calls to connect_by_classname(). This behavior was certainly possible with lv_connect(), but I imagine we were getting lucky most of the time. Signed-off-by: Dan Smith <danms@us.ibm.com> diff -r 670294145fcc -r ecc46606e903 src/Virt_ComputerSystem.c --- a/src/Virt_ComputerSystem.c Wed Nov 14 16:18:00 2007 -0800 +++ b/src/Virt_ComputerSystem.c Thu Nov 15 11:28:03 2007 -0800 @@ -688,7 +688,7 @@ Virt_ComputerSystemProvider_Create_Insta const CMPIContext *, CMPIStatus *rc); -CMInstanceMIStub(, Virt_ComputerSystemProvider, _BROKER, CMNoHook); +CMInstanceMIStub(, Virt_ComputerSystemProvider, _BROKER, virInitialize()); static struct method_handler RequestStateChange = { .name = "RequestStateChange",

On Thu, Nov 15, 2007 at 11:28:35AM -0700, Dan Smith wrote:
# HG changeset patch # User Dan Smith <danms@us.ibm.com> # Date 1195154883 28800 # Node ID ecc46606e90388217b94916c1eab2ceeb051064a # Parent 670294145fcc1a786594f5110e2248cd4ec0b7ed Initialize libvirt early so that we avoid the backend register race
Yup, if you don't call Initialize() first multiple parallel connect operations may race trying to do the initialization simultaneously. We actually have locks in libvirt (or rather we reuse the ones from libxml2) so we could try to do some locking in Initialize, but I really prefer a model where the init is clearly separated. Looks fine to me even if I don't understand the details of how that patch works. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Dan Smith wrote:
# HG changeset patch # User Dan Smith <danms@us.ibm.com> # Date 1195154883 28800 # Node ID ecc46606e90388217b94916c1eab2ceeb051064a # Parent 670294145fcc1a786594f5110e2248cd4ec0b7ed Initialize libvirt early so that we avoid the backend register race
This fixes the errant behavior we see with parallel calls to connect_by_classname(). This behavior was certainly possible with lv_connect(), but I imagine we were getting lucky most of the time.
Signed-off-by: Dan Smith <danms@us.ibm.com>
I was also seeing the same issue after applying the connect_by_classname(). I'm using Pegasus with forceProviderProcesses=true. I would see the issue the first call after restarting Pegasus. After applying this patch, I haven't seen the error. I've tried restarting Pegasus about 10 times, no error yet. -- Kaitlin Rupert IBM Linux Technology Center karupert@us.ibm.com

Dan Smith wrote:
# HG changeset patch # User Dan Smith <danms@us.ibm.com> # Date 1195154883 28800 # Node ID ecc46606e90388217b94916c1eab2ceeb051064a # Parent 670294145fcc1a786594f5110e2248cd4ec0b7ed Initialize libvirt early so that we avoid the backend register race
This fixes the errant behavior we see with parallel calls to connect_by_classname(). This behavior was certainly possible with lv_connect(), but I imagine we were getting lucky most of the time.
Signed-off-by: Dan Smith <danms@us.ibm.com>
I was also seeing the same issue after applying the connect_by_classname(). I'm using Pegasus with forceProviderProcesses=true. I would see the issue the first call after restarting Pegasus. After applying this patch, I haven't seen the error. I've tried restarting Pegasus about 10 times, no error yet. :( ... I'm still getting this error and reproducible for each request (not only the first one) and each CIMOM (pegasus and sfcb). Both patches have been applied to the latest tree from repository. FC6, Centrino Duo, 32 bit sfcb 1.2.4, Pegasus tog-pegasus-2.5.2-2.fc6, libvirt-0.3.1-1.fc6,
Kaitlin Rupert wrote: libxml2-2.6.29-1.fc6 But I found the reason for this: I do only have KVM on my machine and the error message is the result of the Xen_ComputerSystem request. A CIMOM does not have the possibility to return different return codes, meaning the error for Xen_ComputerSystem and the instances for KVM_ComputerSystem. In the case where one request returned with an error, this one is reported back to the client and all returned instances are revoked. The attached patch fixes this problem by adding debug messages instead of setting a FAILED status, when trying to connect to libvirt. The current implementation does also handle it like this, with the difference that an attempt to connect to libvirt is not necessary, as lv_connect already checks before, which hypervisor type is available. -- Regards Heidi Eckhart Software Engineer Linux Technology Center - Open Hypervisor heidieck@linux.vnet.ibm.com ************************************************** IBM Deutschland Entwicklung GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschaeftsfuehrung: Herbert Kircher Sitz der Gesellschaft: Boeblingen Registergericht: Amtsgericht Stuttgart, HRB 243294 Try to fix initialization error with connect_by_classname Signed-off-by: Heidi Eckhart <heidieck@linux.vnet.ibm.com> diff -r aeae495924c5 libxkutil/misc_util.c --- a/libxkutil/misc_util.c Thu Nov 15 11:28:04 2007 -0800 +++ b/libxkutil/misc_util.c Fri Nov 16 13:13:21 2007 +0100 @@ -56,11 +56,11 @@ virConnectPtr connect_by_classname(const const char *uri; virConnectPtr conn; + CMSetStatus(s, CMPI_RC_OK); + uri = cn_to_uri(classname); if (!uri) { - CMSetStatusWithChars(broker, s, - CMPI_RC_ERR_FAILED, - "Unable to generate URI from classname"); + CU_DEBUG("Provider is not repsponsibel for %s", classname); return NULL; } @@ -68,9 +68,7 @@ virConnectPtr connect_by_classname(const conn = virConnectOpen(uri); if (!conn) { - CMSetStatusWithChars(broker, s, - CMPI_RC_ERR_FAILED, - "Unable to connect to hypervisor"); + CU_DEBUG("Requested Hypervisor type is not running on this system: %s", uri); return NULL; } diff -r aeae495924c5 src/Virt_ComputerSystem.c --- a/src/Virt_ComputerSystem.c Thu Nov 15 11:28:04 2007 -0800 +++ b/src/Virt_ComputerSystem.c Fri Nov 16 13:16:37 2007 +0100 @@ -343,9 +343,6 @@ static CMPIStatus return_enum_domains(co virConnectPtr conn = NULL; int ret; - if (!provider_is_responsible(_BROKER, reference, &s)) - return s; - conn = connect_by_classname(_BROKER, CLASSNAME(reference), &s); if (conn == NULL) return s;

HE> A CIMOM does not have the possibility to return different return HE> codes, meaning the error for Xen_ComputerSystem and the instances HE> for KVM_ComputerSystem. In the case where one request returned HE> with an error, this one is reported back to the client and all HE> returned instances are revoked. Ah, okay. It seems a little weird to solve this by not returning failure if you try to enumerate a class that can't possibly exist on a particular system. However, I'm not sure there is any reasonable way around this. I'll wrap this into my patch set with some other changes I think I need to make and re-submit to the list. Thanks Heidi! -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
HE> A CIMOM does not have the possibility to return different return HE> codes, meaning the error for Xen_ComputerSystem and the instances HE> for KVM_ComputerSystem. In the case where one request returned HE> with an error, this one is reported back to the client and all HE> returned instances are revoked.
Ah, okay. It seems a little weird to solve this by not returning failure if you try to enumerate a class that can't possibly exist on a particular system. That's how the providers also behave with the lv_connect() approach. However, I'm not sure there is any reasonable way around this.
The question a client is asking is - are there instances of Xen_ComputerSystems on this machine. If there are none, than a result of 0 instances is valid. If the client is interested in the question "why there are none", then he has to ask the VirtualSystemManagementService, if Xen is running or not. If Xen is not running on this machine, then also here no instance for Xen is returned. Then the client knows, that Xen is not available on this machine and all requests against Xen classes will return 0 instances. If VirtualSystemManagementService returns an instance for Xen, then the client knows, that currently no Xen_ComputerSystems are defined, but he can work on this machine to create one/some Xen instances. So the question for a provider is - are there instances of the class I'm responsible for or not ? This one is answered with yes (returning instances) or no (returning no instances). The reason why there are no instances, is resolved by the client itself with the evaluation of the model for existing and non-existing instances. This means for the provider programmer that he has to tolerate a - at a first view - light-weight error reporting. To answer your question - no, there is no reasonable way around this ;). Its the clients responsibility to traverse and interpret the model and instances. -- Regards Heidi Eckhart Software Engineer Linux Technology Center - Open Hypervisor heidieck@linux.vnet.ibm.com ************************************************** IBM Deutschland Entwicklung GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschaeftsfuehrung: Herbert Kircher Sitz der Gesellschaft: Boeblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

HE> If there are none, than a result of 0 instances is valid. If the HE> client is interested in the question "why there are none", then he HE> has to ask the VirtualSystemManagementService, if Xen is running HE> or not. Okay, then I guess we can go forward with converting the rest of the classes as long as we mimic this behavior elsewhere. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

# HG changeset patch # User Dan Smith <danms@us.ibm.com> # Date 1195154884 28800 # Node ID ba709792befea803190657550854716fd0e21aec # Parent ecc46606e90388217b94916c1eab2ceeb051064a [RFC] Transition back to connect_by_classname() Previously, we used the incoming reference for all operations to determine the URI to pass to libvirt during a connect operation. For some reason, we transitioned away from that to a detection-based mechanism, which would not support multiple technologies to be managed from the provider simultaneously. I think that we have enough other infrastructure in place now that this is no longer a problem. As a test, I offer this patch with a few key instances of lv_connect() replaced with connect_by_classname() for discussion. I have tested that enumeration of Xen_ComputerSystem, KVM_ComputerSystem, and CIM_ComputerSystem are exclusive, and do not return any duplicate results. Further, I have confirmed that resolving Xen_SystemDevice against a KVM_ComputerSystem instance does not work, and that no duplicate results are returned in the case of CIM_SystemDevice. If this looks okay to people (and wider testing supports my findings), I propose we start to convert uses of lv_connect() back to per-class connections until we encounter a problem. Comments welcome. Signed-off-by: Dan Smith <danms@us.ibm.com> diff -r ecc46606e903 -r ba709792befe libxkutil/misc_util.c --- a/libxkutil/misc_util.c Thu Nov 15 11:28:03 2007 -0800 +++ b/libxkutil/misc_util.c Thu Nov 15 11:28:04 2007 -0800 @@ -64,6 +64,8 @@ virConnectPtr connect_by_classname(const return NULL; } + CU_DEBUG("Connecting to libvirt with uri `%s'", uri); + conn = virConnectOpen(uri); if (!conn) { CMSetStatusWithChars(broker, s, @@ -334,28 +336,21 @@ bool provider_is_responsible(const CMPIB const CMPIObjectPath *reference, CMPIStatus *status) { - const char *dft_pfx; char *pfx; - bool rc = false; + bool rc = true; CMSetStatus(status, CMPI_RC_OK); pfx = class_prefix_name(CLASSNAME(reference)); - if (STREQC(pfx, "CIM")) + if (STREQC(pfx, "CIM")) { cu_statusf(broker, status, CMPI_RC_ERR_FAILED, "Please exactly specify the class (check CIMOM behavior!): %s", CLASSNAME(reference)); - - dft_pfx = default_prefix(); - if (dft_pfx == NULL) - goto out; - - if (STREQC(pfx, dft_pfx)) - rc = true; - - out: + rc = false; + } + free(pfx); return rc; } diff -r ecc46606e903 -r ba709792befe src/Virt_ComputerSystem.c --- a/src/Virt_ComputerSystem.c Thu Nov 15 11:28:03 2007 -0800 +++ b/src/Virt_ComputerSystem.c Thu Nov 15 11:28:04 2007 -0800 @@ -346,7 +346,7 @@ static CMPIStatus return_enum_domains(co if (!provider_is_responsible(_BROKER, reference, &s)) return s; - conn = lv_connect(_BROKER, &s); + conn = connect_by_classname(_BROKER, CLASSNAME(reference), &s); if (conn == NULL) return s; @@ -386,7 +386,7 @@ static CMPIStatus get_domain(const CMPIO return s; } - conn = lv_connect(_BROKER, &s); + conn = connect_by_classname(_BROKER, CLASSNAME(reference), &s); if (conn == NULL) return s; @@ -609,7 +609,7 @@ static CMPIStatus __state_change(char *n virDomainPtr dom = NULL; virDomainInfo info; - conn = lv_connect(_BROKER, &s); + conn = connect_by_classname(_BROKER, CLASSNAME(ref), &s); if (conn == NULL) goto out; diff -r ecc46606e903 -r ba709792befe src/Virt_SystemDevice.c --- a/src/Virt_SystemDevice.c Thu Nov 15 11:28:03 2007 -0800 +++ b/src/Virt_SystemDevice.c Thu Nov 15 11:28:04 2007 -0800 @@ -61,6 +61,7 @@ static int get_dom_devices(const char *n static int get_dom_devices(const char *name, struct inst_list *list, int type, + const char *host_cn, const char *ns) { virConnectPtr conn = NULL; @@ -68,7 +69,7 @@ static int get_dom_devices(const char *n CMPIStatus s; int ret = 0; - conn = lv_connect(_BROKER, &s); + conn = connect_by_classname(_BROKER, host_cn, &s); if (conn == NULL) goto out; @@ -87,17 +88,19 @@ static int get_dom_devices(const char *n static int get_all_devices(const char *name, struct inst_list *list, - char *ns) + const char *host_cn, + const char *ns) { int i; for (i = 0; i < DEV_TYPE_COUNT; i++) - get_dom_devices(name, list, device_types[i], ns); + get_dom_devices(name, list, device_types[i], host_cn, ns); return i; } static CMPIInstance *host_instance(char *name, + const char *host_cn, const char *ns) { CMPIInstance *inst = NULL; @@ -114,7 +117,7 @@ static CMPIInstance *host_instance(char if ((s.rc != CMPI_RC_OK) || CMIsNullObject(op)) goto out; - conn = lv_connect(_BROKER, &s); + conn = connect_by_classname(_BROKER, host_cn, &s); if (conn == NULL) goto out; @@ -173,9 +176,16 @@ static CMPIStatus sys_to_dev(const CMPIO type = device_type_from_classname(info->result_class); - ret = get_dom_devices(host, list, type, NAMESPACE(ref)); + ret = get_dom_devices(host, + list, + type, + CLASSNAME(ref), + NAMESPACE(ref)); } else { - ret = get_all_devices(host, list, NAMESPACE(ref)); + ret = get_all_devices(host, + list, + CLASSNAME(ref), + NAMESPACE(ref)); } if (ret >= 0) { @@ -218,6 +228,7 @@ static CMPIStatus dev_to_sys(const CMPIO } sys = host_instance(host, + CLASSNAME(ref), NAMESPACE(ref)); if (sys == NULL)
participants (4)
-
Dan Smith
-
Daniel Veillard
-
Heidi Eckhart
-
Kaitlin Rupert