[libvirt] RFC: Exposing "ready" bool (of `query-block-jobs`) or QMP BLOCK_JOB_READY event

Backround --------- For QEMU block device jobs, the "ready" boolean field (part of QMP `query-block-jobs`) was introduced in commit ef6dbf1 (available in QEMU v2.2.0 or above): http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ef6dbf1e4 -- blockjob: Add "ready" field "When a block job signals readiness, this is currently reported only through QMP. If qemu wants to use block jobs for internal tasks, there needs to be another way to correctly detect when a block job may be completed. For this reason, introduce a bool "ready" which is set when the block job may be completed." And, libvirt was fixed to use the above field in this commit (available in libvirt v1.2.18 or above): http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu: Update state of block job to READY only if it actually is ready RFC --- Currently libvirt block APIs (& consequently higher-level applications like Nova which use these APIs) rely on polling for job completion via virDomainGetBlockJobInfo(), which uses QMP `query-block-jobs`, and waits for QEMU to report "offset" == "len", which translates to libvirt "cur" == "end". Based on this, libvirt can take an action (whether to gracefully abort, or pivot to the copy in case of a COPY job). Since QEMU reports the "ready": true field (followed by a BLOCK_JOB_READY QMP event). It would be helpful if libvirt expose this via an API, so upper layers could instead use that, rather than polling. Problem scenario ---------------- When virDomainBlockRebase() is invoked to start a copy job, then aborting the said copy operation with virDomainBlockJobAbort() + flag VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race condition (due to the way the virDomainGetBlockJobInfo() reports the job status) where the pivot operation fails. Race condition window ~~~~~~~~~~~~~~~~~~~~~ libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true. So Eric Blake suggests: QEMU 2.0 or 1.x probably had a synchronous setup where you could never observer cur==end on a non-ready job. But I don't remember enough history to point to when QEMU switched jobs to be a bit more asynchronous. Maybe there was no qemu regression - maybe it was BECAUSE of other block-job additions in 2.2 that offset==len was no longer reliable. I don't know that for sure. But what it DOES sound like is that IF qemu reports "ready": false, offset==len is not reliable, and libvirt should be taught to fudge that. And hopefully, QEMU too old to report "ready:" at all is reliable with regards to offset==len, because that's all we have to go by. For now, I filed this upstream libvirt bug: https://bugzilla.redhat.com/show_bug.cgi?id=1382165 -- virDomainGetBlockJobInfo: Adjust job reporting based on QEMU stats & the "ready" field of `query-block-jobs` However, exposing the "ready" boolean from QMP `query-block-jobs` might be worth considering. -- /kashyap

On Thu, Oct 06, 2016 at 12:26:50 +0200, Kashyap Chamarthy wrote:
Backround ---------
For QEMU block device jobs, the "ready" boolean field (part of QMP `query-block-jobs`) was introduced in commit ef6dbf1 (available in QEMU v2.2.0 or above):
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ef6dbf1e4 -- blockjob: Add "ready" field
"When a block job signals readiness, this is currently reported only through QMP. If qemu wants to use block jobs for internal tasks, there needs to be another way to correctly detect when a block job may be completed.
For this reason, introduce a bool "ready" which is set when the block job may be completed."
And, libvirt was fixed to use the above field in this commit (available in libvirt v1.2.18 or above):
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu: Update state of block job to READY only if it actually is ready [1]
RFC ---
Currently libvirt block APIs (& consequently higher-level applications like Nova which use these APIs) rely on polling for job completion via
libvirt is _not_ polling the data. Libvirt relies on the events from qemu which are also exposed as libvirt events.
virDomainGetBlockJobInfo(), which uses QMP `query-block-jobs`, and waits for QEMU to report "offset" == "len", which translates to libvirt "cur" == "end". Based on this, libvirt can take an action (whether to gracefully abort, or pivot to the copy in case of a COPY job).
Again false. Internally we honor the ready state and applications like virsh have code in place that waits for the async events and act after receiving them.
Since QEMU reports the "ready": true field (followed by a BLOCK_JOB_READY QMP event). It would be helpful if libvirt expose this via an API, so upper layers could instead use that, rather than polling.
We expose the state of the copy job in the XML and forward the READY event from qemu to the users. A running copy job exposes itself in the xml as: <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/> <backingStore/> <mirror type='file' file='/tmp/ble.img' format='raw' job='copy'> <format type='raw'/> <source file='/tmp/ble.img'/> </mirror> [...] </disk> While the ready copy job is exposed as: <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/> <backingStore/> <mirror type='file' file='/tmp/ble.img' format='raw' job='copy' ready='yes'> <format type='raw'/> <source file='/tmp/ble.img'/> </mirror> [...] </disk> Additionally we have anyncrhronous events that are emitted once qemu notifies us that the block job has reached sync state or finished. Libvirt uses the event to switch to the ready state. The documentation suggests that block jobs should listen to the events and act accordingly only after receiving the event.
Problem scenario ----------------
When virDomainBlockRebase() is invoked to start a copy job, then aborting the said copy operation with virDomainBlockJobAbort() + flag VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race condition (due to the way the virDomainGetBlockJobInfo() reports the job status) where the pivot operation fails.
Again, this should trigger the block job event and that should be used as the marker to perform the pivot operation. $ virsh event --all --loop event 'block-job-2' for domain hp: Block Copy for hda ready <- after the sync phase is reached event 'block-job-2' for domain hp: Block Copy for hda completed <- after successfully pivoting Applications should act only after receiving such event.
Race condition window ~~~~~~~~~~~~~~~~~~~~~
libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's done by the commit you've mentioned above.
QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true.
The problem is that you are polling the block job info which correctly reports that all data was properly copied and you are inferring the block job state from that data. I'm against deliberately reporting false data in the block info structure. The application should register handlers for the block job events and act only if it receives such event. Additionally you may want to check that the state is correct in the XML. The current block job information structure can't be extended unfortunately. Please look at the virsh handling of the block jobs for inspiration since we have example code doing the right thing here. Peter

On Thu, Oct 06, 2016 at 01:34:59PM +0200, Peter Krempa wrote:
On Thu, Oct 06, 2016 at 12:26:50 +0200, Kashyap Chamarthy wrote:
[...]
And, libvirt was fixed to use the above field in this commit (available in libvirt v1.2.18 or above):
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu: Update state of block job to READY only if it actually is ready [1]
RFC ---
Currently libvirt block APIs (& consequently higher-level applications like Nova which use these APIs) rely on polling for job completion via
libvirt is _not_ polling the data. Libvirt relies on the events from qemu which are also exposed as libvirt events.
Err, you're right. I know you mean the below enum: enum virConnectDomainEventBlockJobStatus { VIR_DOMAIN_BLOCK_JOB_COMPLETED = 0 VIR_DOMAIN_BLOCK_JOB_FAILED = 1 VIR_DOMAIN_BLOCK_JOB_CANCELED = 2 VIR_DOMAIN_BLOCK_JOB_READY = 3 VIR_DOMAIN_BLOCK_JOB_LAST = 4 } Which are used internally in the event processing code in qemuBlockJobEventProcess() [libvirt/src/qemu/qemu_blockjob.c] that is used to update the disk mirroring state based on events from QEMU.
virDomainGetBlockJobInfo(), which uses QMP `query-block-jobs`, and waits for QEMU to report "offset" == "len", which translates to libvirt "cur" == "end". Based on this, libvirt can take an action (whether to gracefully abort, or pivot to the copy in case of a COPY job).
Again false. Internally we honor the ready state
Oops, I do realize that libvirt honors seen the 'ready' boolean -- I even mentioned the libvirt commit message (eae5924) from you in the beginning, which handles it.
and applications like virsh have code in place that waits for the async events and act after receiving them.
Yep, I remember Eric mentioning the other day that `virsh` is setup to capture libvirt events. So I briefly went through the virshBlockJobWait() in tools/virsh-domain.c where you see that: [...] /* Fallback behaviour is only needed if one or both callbacks could not * be registered */ if (data->cb_id < 0 || data->cb_id2 < 0) { /* If the block job vanishes, synthesize a COMPLETED event */ if (result == 0) { ret = VIR_DOMAIN_BLOCK_JOB_COMPLETED; break; } /* If the block job hits 100%, wait a little while for a possible * event from libvirt unless both callbacks could not be registered * in order to synthesize our own READY event */ if (info.end == info.cur && ((data->cb_id < 0 && data->cb_id2 < 0) || --retries == 0)) { ret = VIR_DOMAIN_BLOCK_JOB_READY; break; } } [...]
Since QEMU reports the "ready": true field (followed by a BLOCK_JOB_READY QMP event). It would be helpful if libvirt expose this via an API, so upper layers could instead use that, rather than polling.
We expose the state of the copy job in the XML and forward the READY event from qemu to the users.
Yep, I see that in the QMP traffic: ----- 2016-10-04 15:54:52.333+0000: 30550: info : qemuMonitorIOProcess:423 : QEMU_MONITOR_IO_PROCESS: mon=0x7fa43000ee60 buf={"timestamp": {"seconds": 1475596492, "microseconds": 333414}, "event": "BLOCK_JOB_READY", "data": {"device": "drive-virtio-disk1", "len": 1073741824, "offset": 1073741824, "speed": 0, "type": "mirror"}} -----
A running copy job exposes itself in the xml as:
<disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/> <backingStore/> <mirror type='file' file='/tmp/ble.img' format='raw' job='copy'> <format type='raw'/> <source file='/tmp/ble.img'/> </mirror> [...] </disk>
While the ready copy job is exposed as:
<disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/> <backingStore/> <mirror type='file' file='/tmp/ble.img' format='raw' job='copy' ready='yes'> <format type='raw'/> <source file='/tmp/ble.img'/> </mirror> [...] </disk>
Thanks for the XML snippets.
Additionally we have anyncrhronous events that are emitted once qemu notifies us that the block job has reached sync state or finished. Libvirt uses the event to switch to the ready state.
The documentation suggests that block jobs should listen to the events and act accordingly only after receiving the event.
Problem scenario ----------------
When virDomainBlockRebase() is invoked to start a copy job, then aborting the said copy operation with virDomainBlockJobAbort() + flag VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race condition (due to the way the virDomainGetBlockJobInfo() reports the job status) where the pivot operation fails.
Again, this should trigger the block job event and that should be used as the marker to perform the pivot operation.
$ virsh event --all --loop event 'block-job-2' for domain hp: Block Copy for hda ready <- after the sync phase is reached event 'block-job-2' for domain hp: Block Copy for hda completed <- after successfully pivoting
Applications should act only after receiving such event.
Okay, Eric also has suggested
Race condition window ~~~~~~~~~~~~~~~~~~~~~
libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's done by the commit you've mentioned above.
Right, because I've independently tested this with libvirt and have seen the correct behavior. I meant to write: s/libvirt finds/Nova libvirt driver finds/.
QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true.
The problem is that you are polling the block job info which correctly reports that all data was properly copied and you are inferring the block job state from that data.
Indeed. We should listen to block job events.
I'm against deliberately reporting false data in the block info structure.
Yes, you're not alone in that, Nova developer (Matt Booth, CCed) also expressed a similar concern, albeit not as strongly as you.
The application should register handlers for the block job events and act only if it receives such event. Additionally you may want to check that the state is correct in the XML. The current block job information structure can't be extended unfortunately.
Okay, we sort of came to the similar conclusion (in a discussion on the upstream Nova IRC channel) to rely on events, rather than polling for cur/end values.
Please look at the virsh handling of the block jobs for inspiration since we have example code doing the right thing here.
Yes, looking already at them. Thanks! -- /kashyap

On 10/06/2016 06:34 AM, Peter Krempa wrote:
Currently libvirt block APIs (& consequently higher-level applications like Nova which use these APIs) rely on polling for job completion via
libvirt is _not_ polling the data. Libvirt relies on the events from qemu which are also exposed as libvirt events.
Libvirt is not the one deciding when to issue the pivot command, Nova is. Right now, Nova is polling (rather than waiting for events), and its polling is solely conditional on cur==end rather than on the XML addition of ready='true'.
We expose the state of the copy job in the XML and forward the READY event from qemu to the users.
I was not aware of that when I was chatting on IRC yesterday; that's useful to know, because virDomainGetBlockJobInfo() is NOT exposing that information at the moment.
The documentation suggests that block jobs should listen to the events and act accordingly only after receiving the event.
Yes, but the documentation ALSO states that waiting for cur==end is SUPPOSED to work. And it doesn't.
~~~~~~~~~~~~~~~~~~~~~
libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's done by the commit you've mentioned above.
In other words, Nova sees cur==end, and requests the pivot, but libvirt is rejecting Nova's request because 'ready' is not true yet; and Nova then gives up rather than trying again.
QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true.
The problem is that you are polling the block job info which correctly reports that all data was properly copied and you are inferring the block job state from that data.
But the problem here is that qemu is NOT accurately reporting data - it is reporting cur==end with the promise that they are only equal if the job is stable, WHEN THE JOB IS NOT STABLE.
I'm against deliberately reporting false data in the block info structure.
We are NOT falsifying any information, any more than we are falsifying information by changing cur/end to 0/1 when ready:false and qemu reported 0/0. (see commit 988218ca).
The application should register handlers for the block job events and act only if it receives such event. Additionally you may want to check that the state is correct in the XML. The current block job information structure can't be extended unfortunately.
Yes, changing Nova to use event handlers is a good idea. But I'm ALSO in favor of fixing libvirt to work around the qemu bug, by intentionally munging the output to state cur<end (even if qemu reported cur==end) if qemu reports ready:false. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Thu, Oct 06, 2016 at 09:25:26AM -0500, Eric Blake wrote:
On 10/06/2016 06:34 AM, Peter Krempa wrote:
[...]
We expose the state of the copy job in the XML and forward the READY event from qemu to the users.
I was not aware of that when I was chatting on IRC yesterday; that's useful to know, because virDomainGetBlockJobInfo() is NOT exposing that information at the moment.
That is what this RFC was asking to consider -- whether an [I think it has to be a new one] API should report.
The documentation suggests that block jobs should listen to the events and act accordingly only after receiving the event.
Yes, but the documentation ALSO states that waiting for cur==end is SUPPOSED to work. And it doesn't.
Yes.
libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's done by the commit you've mentioned above.
In other words, Nova sees cur==end, and requests the pivot, but libvirt is rejecting Nova's request because 'ready' is not true yet; and Nova then gives up rather than trying again.
Indeed ^ (I made this correction in my other response.)
QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true.
The problem is that you are polling the block job info which correctly reports that all data was properly copied and you are inferring the block job state from that data.
But the problem here is that qemu is NOT accurately reporting data - it is reporting cur==end with the promise that they are only equal if the job is stable, WHEN THE JOB IS NOT STABLE.
That's precisely the source of the confusion for Nova here.
I'm against deliberately reporting false data in the block info structure.
We are NOT falsifying any information, any more than we are falsifying information by changing cur/end to 0/1 when ready:false and qemu reported 0/0. (see commit 988218ca).
Indeed, it seems inconsistent to allow it in one case (like the above commit ID) to adjust (& _not_ falsify, as you accurately point out) libvirt reporting, but not the other case (cur==end, "ready": false case when cur != 0).
The application should register handlers for the block job events and act only if it receives such event. Additionally you may want to check that the state is correct in the XML. The current block job information structure can't be extended unfortunately.
Yes, changing Nova to use event handlers is a good idea. But I'm ALSO in favor of fixing libvirt to work around the qemu bug, by intentionally munging the output to state cur<end (even if qemu reported cur==end) if qemu reports ready:false.
Given the above, I've re-opened the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1382165#c3 -- /kashyap

On 10/06/2016 10:06 AM, Kashyap Chamarthy wrote:
On Thu, Oct 06, 2016 at 09:25:26AM -0500, Eric Blake wrote:
On 10/06/2016 06:34 AM, Peter Krempa wrote:
[...]
We expose the state of the copy job in the XML and forward the READY event from qemu to the users.
I was not aware of that when I was chatting on IRC yesterday; that's useful to know, because virDomainGetBlockJobInfo() is NOT exposing that information at the moment.
That is what this RFC was asking to consider -- whether an [I think it has to be a new one] API should report.
I think we've answered this one - we don't need a new API, because parsing the domain XML already provides the current 'ready':true/false status from qemu. The existing virDomainGetBlockJobInfo() can't be extended, but it can be fixed to quit reporting cur==end when ready:false.
Given the above, I've re-opened the bug here:
Thanks. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Thu, Oct 06, 2016 at 15:51:38 -0500, Eric Blake wrote:
On 10/06/2016 10:06 AM, Kashyap Chamarthy wrote:
On Thu, Oct 06, 2016 at 09:25:26AM -0500, Eric Blake wrote:
On 10/06/2016 06:34 AM, Peter Krempa wrote:
[...]
We expose the state of the copy job in the XML and forward the READY event from qemu to the users.
I was not aware of that when I was chatting on IRC yesterday; that's useful to know, because virDomainGetBlockJobInfo() is NOT exposing that information at the moment.
That is what this RFC was asking to consider -- whether an [I think it has to be a new one] API should report.
I think we've answered this one - we don't need a new API, because parsing the domain XML already provides the current 'ready':true/false status from qemu.
The existing virDomainGetBlockJobInfo() can't be extended, but it can be fixed to quit reporting cur==end when ready:false.
Yes, I agree about this one (although I don't really like it [1]), but this one will actually fix software not listening for events without any change. Any new API would not help since the apps would need to change anyways thus can use the current correct approach right away even with older libvirt versions. Peter [1]: I'm expecting users to start complaining: "Why is this last byte of my image taking so long to copy after the rest copied pretty quickly".

On 10/07/2016 02:09 AM, Peter Krempa wrote:
The existing virDomainGetBlockJobInfo() can't be extended, but it can be fixed to quit reporting cur==end when ready:false.
Yes, I agree about this one (although I don't really like it [1]), but this one will actually fix software not listening for events without any change.
Any new API would not help since the apps would need to change anyways thus can use the current correct approach right away even with older libvirt versions.
Peter
[1]: I'm expecting users to start complaining: "Why is this last byte of my image taking so long to copy after the rest copied pretty quickly".
And our response is "We never promised that cur and end are bytes, only rough status indicators. And we don't know why qemu is taking so long - move the bug report to them" - if the user can even see this state long enough for it to bother them. (Nova is hitting it, because it is a software-triggered reaction time, not a human in the mix; my understanding is that it is still at most a fraction of a second where we'd even have to do the fudging). -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 10/06/2016 10:25 AM, Eric Blake wrote:
On 10/06/2016 06:34 AM, Peter Krempa wrote:
Currently libvirt block APIs (& consequently higher-level applications like Nova which use these APIs) rely on polling for job completion via
libvirt is _not_ polling the data. Libvirt relies on the events from qemu which are also exposed as libvirt events.
Libvirt is not the one deciding when to issue the pivot command, Nova is. Right now, Nova is polling (rather than waiting for events), and its polling is solely conditional on cur==end rather than on the XML addition of ready='true'.
We expose the state of the copy job in the XML and forward the READY event from qemu to the users.
I was not aware of that when I was chatting on IRC yesterday; that's useful to know, because virDomainGetBlockJobInfo() is NOT exposing that information at the moment.
The documentation suggests that block jobs should listen to the events and act accordingly only after receiving the event.
Yes, but the documentation ALSO states that waiting for cur==end is SUPPOSED to work. And it doesn't.
~~~~~~~~~~~~~~~~~~~~~
libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's done by the commit you've mentioned above.
In other words, Nova sees cur==end, and requests the pivot, but libvirt is rejecting Nova's request because 'ready' is not true yet; and Nova then gives up rather than trying again.
QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true.
The problem is that you are polling the block job info which correctly reports that all data was properly copied and you are inferring the block job state from that data.
But the problem here is that qemu is NOT accurately reporting data - it is reporting cur==end with the promise that they are only equal if the job is stable, WHEN THE JOB IS NOT STABLE.
Do we really promise that in QEMU? I guess since jobs have existed since before the ready field I guess we do...
I'm against deliberately reporting false data in the block info structure.
We are NOT falsifying any information, any more than we are falsifying information by changing cur/end to 0/1 when ready:false and qemu reported 0/0. (see commit 988218ca).
The application should register handlers for the block job events and act only if it receives such event. Additionally you may want to check that the state is correct in the XML. The current block job information structure can't be extended unfortunately.
Yes, changing Nova to use event handlers is a good idea. But I'm ALSO in favor of fixing libvirt to work around the qemu bug, by intentionally munging the output to state cur<end (even if qemu reported cur==end) if qemu reports ready:false.
participants (4)
-
Eric Blake
-
John Snow
-
Kashyap Chamarthy
-
Peter Krempa