[libvirt] [RFC] Simplifying usage of {Enter, Exit}Monitor and {Begin/End}Job

Hi all. Looking at the qemu driver code, we have a lot of code with the following pattern: if (doSomething() < 0) goto cleanup; if (qemuDomainObjBeginJob() < 0) goto cleanup; if (doSomethingElse() < 0) goto endjob; qemuDomainObjEnterMonitor(); if (qemuMonitorSomething() < 0) { qemuDomainObjExitMonitor(); goto endjob; } if (qemuMonitorSomethingElse() < 0) { qemuDomainObjExitMonitor(); goto endjob; } if (qemuDomainObjExitMonitor() < 0) goto endjob; ... ret = 0; endjob: qemuDomainObjEndJob(); cleanup: ... return ret; Sometimes qemuDomainObjExitMonitor is in its own label instead of being explicitly called on every single error path. Sometimes qemuDomainObjEndJob is called explicitly followed by goto cleanup. In either case, it's pretty easy to get this wrong. It's too easy to jump to a wrong label (especially when moving code around) or forget to call the appropriate exit function before jumping to a label. So what if we make the code more robust by changing who needs to track whether we are in a monitor or have a job. Now it's the caller that tracks it while I think we could teach the job/monitor APIs themselves to track the state: bool inJob = false; bool inMonitor = false; if (doSomething() < 0) goto cleanup; if (qemuDomainObjBeginJob(..., &inJob) < 0) goto cleanup; if (doSomethingElse() < 0) goto cleanup; qemuDomainObjEnterMonitor(..., &inMonitor); if (qemuMonitorSomething() < 0) goto cleanup; if (qemuMonitorSomethingElse() < 0) goto cleanup; if (qemuDomainObjExitMonitor(..., &inMonitor) < 0) goto cleanup; ... ret = 0; cleanup: if (qemuDomainObjExitMonitor(..., &inMonitor) < 0) ret = -1; qemuDomainObjEndJob(..., &inJob); ... return ret; It's not a lot shorter or longer but it saves us from jumping to different labels and it makes sure we always exit the monitor and end the job. So is it worth the effort or do you thing it's even worse than before or do you have any other ideas? Thanks, Jirka

On Thu, Oct 22, 2015 at 01:47:28PM +0200, Jiri Denemark wrote:
Hi all.
Looking at the qemu driver code, we have a lot of code with the following pattern:
if (doSomething() < 0) goto cleanup;
if (qemuDomainObjBeginJob() < 0) goto cleanup;
if (doSomethingElse() < 0) goto endjob;
qemuDomainObjEnterMonitor();
if (qemuMonitorSomething() < 0) { qemuDomainObjExitMonitor(); goto endjob; }
if (qemuMonitorSomethingElse() < 0) { qemuDomainObjExitMonitor(); goto endjob; }
if (qemuDomainObjExitMonitor() < 0) goto endjob;
...
ret = 0;
endjob: qemuDomainObjEndJob();
cleanup: ... return ret;
Sometimes qemuDomainObjExitMonitor is in its own label instead of being explicitly called on every single error path. Sometimes qemuDomainObjEndJob is called explicitly followed by goto cleanup. In either case, it's pretty easy to get this wrong. It's too easy to jump to a wrong label (especially when moving code around) or forget to call the appropriate exit function before jumping to a label.
So what if we make the code more robust by changing who needs to track whether we are in a monitor or have a job. Now it's the caller that tracks it while I think we could teach the job/monitor APIs themselves to track the state:
bool inJob = false; bool inMonitor = false;
if (doSomething() < 0) goto cleanup;
if (qemuDomainObjBeginJob(..., &inJob) < 0) goto cleanup;
if (doSomethingElse() < 0) goto cleanup;
qemuDomainObjEnterMonitor(..., &inMonitor);
if (qemuMonitorSomething() < 0) goto cleanup;
if (qemuMonitorSomethingElse() < 0) goto cleanup;
if (qemuDomainObjExitMonitor(..., &inMonitor) < 0) goto cleanup;
...
ret = 0;
cleanup: if (qemuDomainObjExitMonitor(..., &inMonitor) < 0) ret = -1; qemuDomainObjEndJob(..., &inJob); ... return ret;
It's not a lot shorter or longer but it saves us from jumping to different labels and it makes sure we always exit the monitor and end the job.
So is it worth the effort or do you thing it's even worse than before or do you have any other ideas?
It's good and similar to what I have though of few times as well. I have just two ideas for making it even better. Firstly, there is one thing we need to fix and that is that bunch of APIs should have a job and they don't. Basically every API shold have a job, even if it's just a QEMU_JOB_QUERY, that's why we have masks that say which jobs are allowed when. So what if we had one function that fetches the domain object and sets a job there? That would solve few mishaps and made the code a bit shorter. Another thing is that since you know whether to call ExitMonitor and EndJob, you can incorporate them into one function, like there was the qemuDomObjEndAPI() if you put the inMonitor and inJob variables somewhere accessible. Later on, if we introduce job control everywhere, we could make it part of virDomainObjEndAPI() and the starting function would be also driver-agnostic. But that's for the future (and we all know the future was yesterday). Anyway, I think it's a good idea and with some polish (lowercase 'p') it'll rock.
Thanks,
Jirka
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Thu, Oct 22, 2015 at 01:47:28PM +0200, Jiri Denemark wrote:
Hi all.
Looking at the qemu driver code, we have a lot of code with the following pattern:
if (doSomething() < 0) goto cleanup;
if (qemuDomainObjBeginJob() < 0) goto cleanup;
if (doSomethingElse() < 0) goto endjob;
qemuDomainObjEnterMonitor();
if (qemuMonitorSomething() < 0) { qemuDomainObjExitMonitor(); goto endjob; }
if (qemuMonitorSomethingElse() < 0) { qemuDomainObjExitMonitor(); goto endjob; }
if (qemuDomainObjExitMonitor() < 0) goto endjob;
...
ret = 0;
endjob: qemuDomainObjEndJob();
cleanup: ... return ret;
Sometimes qemuDomainObjExitMonitor is in its own label instead of being explicitly called on every single error path. Sometimes qemuDomainObjEndJob is called explicitly followed by goto cleanup. In either case, it's pretty easy to get this wrong. It's too easy to jump to a wrong label (especially when moving code around) or forget to call the appropriate exit function before jumping to a label.
So what if we make the code more robust by changing who needs to track whether we are in a monitor or have a job. Now it's the caller that tracks it while I think we could teach the job/monitor APIs themselves to track the state:
bool inJob = false; bool inMonitor = false;
if (doSomething() < 0) goto cleanup;
if (qemuDomainObjBeginJob(..., &inJob) < 0) goto cleanup;
if (doSomethingElse() < 0) goto cleanup;
qemuDomainObjEnterMonitor(..., &inMonitor);
if (qemuMonitorSomething() < 0) goto cleanup;
if (qemuMonitorSomethingElse() < 0) goto cleanup;
if (qemuDomainObjExitMonitor(..., &inMonitor) < 0) goto cleanup;
...
ret = 0;
cleanup: if (qemuDomainObjExitMonitor(..., &inMonitor) < 0) ret = -1; qemuDomainObjEndJob(..., &inJob); ... return ret;
It's not a lot shorter or longer but it saves us from jumping to different labels and it makes sure we always exit the monitor and end the job.
So is it worth the effort or do you thing it's even worse than before or do you have any other ideas?
On a related topic, we don't have great error reporting in the (usually unlikely) scenario that we get a stuck job / timeout. I've long thought it could be desirable to record some metadata when we start jobs, such as the __FUNC__ of the method which started the job, so when we report an error we can include that info as a diagnostic aid. This would have to be against the qemuDomainObjPrivPtr struct. THis makes me think that using the separate bool inJob/inMonitor stack variables is not required. We could just add int threadid; bool inJob; bool inMonitor; const char *jobfunc; to qemuDomainObjPrivPtr. That way you don't need to modify the Enter/Exit functions to add extra arguments - we just track everything internally. When exiting, we'd compare against the threadid, to make sure we don't accidentally relaase a different thread's job. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Oct 22, 2015 at 13:52:29 +0100, Daniel P. Berrange wrote: ...
On a related topic, we don't have great error reporting in the (usually unlikely) scenario that we get a stuck job / timeout. I've long thought it could be desirable to record some metadata when we start jobs, such as the __FUNC__ of the method which started the job, so when we report an error we can include that info as a diagnostic aid.
Do you mean something like virsh # resume cd error: Failed to resume domain cd error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSuspend) This was implemented by v1.2.13-295-gb79f25e
This would have to be against the qemuDomainObjPrivPtr struct. THis makes me think that using the separate bool inJob/inMonitor stack variables is not required.
We could just add
int threadid; bool inJob; bool inMonitor; const char *jobfunc;
to qemuDomainObjPrivPtr. That way you don't need to modify the Enter/Exit functions to add extra arguments - we just track everything internally. When exiting, we'd compare against the threadid, to make sure we don't accidentally relaase a different thread's job.
Yeah, as long as we can make sure threadid is unique and stable: /* These next two functions are for debugging only, since they are not * guaranteed to give unique values for distinct threads on all * architectures, nor are the two functions guaranteed to give the same * value for the same thread. */ unsigned long long virThreadSelfID(void); unsigned long long virThreadID(virThreadPtr thread); so far we avoided using thread IDs for anything critical. Jirka

On Thu, Oct 22, 2015 at 03:23:49PM +0200, Jiri Denemark wrote:
On Thu, Oct 22, 2015 at 13:52:29 +0100, Daniel P. Berrange wrote: ...
On a related topic, we don't have great error reporting in the (usually unlikely) scenario that we get a stuck job / timeout. I've long thought it could be desirable to record some metadata when we start jobs, such as the __FUNC__ of the method which started the job, so when we report an error we can include that info as a diagnostic aid.
Do you mean something like
virsh # resume cd error: Failed to resume domain cd error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSuspend)
This was implemented by v1.2.13-295-gb79f25e
This would have to be against the qemuDomainObjPrivPtr struct. THis makes me think that using the separate bool inJob/inMonitor stack variables is not required.
We could just add
int threadid; bool inJob; bool inMonitor; const char *jobfunc;
to qemuDomainObjPrivPtr. That way you don't need to modify the Enter/Exit functions to add extra arguments - we just track everything internally. When exiting, we'd compare against the threadid, to make sure we don't accidentally relaase a different thread's job.
Yeah, as long as we can make sure threadid is unique and stable:
/* These next two functions are for debugging only, since they are not * guaranteed to give unique values for distinct threads on all * architectures, nor are the two functions guaranteed to give the same * value for the same thread. */ unsigned long long virThreadSelfID(void); unsigned long long virThreadID(virThreadPtr thread);
so far we avoided using thread IDs for anything critical.
So technically that comment is correct, since we're casting the pthread_t type to an unsigned long long, also you cannot directly compare pthread_t == pthread_t. POSIX does however provide pthread_equal() to allow you to compare whether 2 pthread_t values point to the same thread. http://man7.org/linux/man-pages/man3/pthread_self.3.html http://man7.org/linux/man-pages/man3/pthread_equal.3.html Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Jiri Denemark
-
Martin Kletzander