On Thu, Oct 06, 2016 at 01:34:59PM +0200, Peter Krempa wrote:
On Thu, Oct 06, 2016 at 12:26:50 +0200, Kashyap Chamarthy wrote:
[...]
> And, libvirt was fixed to use the above field in this commit
(available
> in libvirt v1.2.18 or above):
>
>
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu:
> Update state of block job to READY only if it actually is ready [1]
>
> RFC
> ---
>
> Currently libvirt block APIs (& consequently higher-level applications
> like Nova which use these APIs) rely on polling for job completion via
libvirt is _not_ polling the data. Libvirt relies on the events from
qemu which are also exposed as libvirt events.
Err, you're right. I know you mean the below enum:
enum virConnectDomainEventBlockJobStatus {
VIR_DOMAIN_BLOCK_JOB_COMPLETED = 0
VIR_DOMAIN_BLOCK_JOB_FAILED = 1
VIR_DOMAIN_BLOCK_JOB_CANCELED = 2
VIR_DOMAIN_BLOCK_JOB_READY = 3
VIR_DOMAIN_BLOCK_JOB_LAST = 4
}
Which are used internally in the event processing code in
qemuBlockJobEventProcess() [libvirt/src/qemu/qemu_blockjob.c] that is
used to update the disk mirroring state based on events from QEMU.
> virDomainGetBlockJobInfo(), which uses QMP `query-block-jobs`,
and
> waits for QEMU to report "offset" == "len", which translates to
> libvirt "cur" == "end". Based on this, libvirt can take an
action
> (whether to gracefully abort, or pivot to the copy in case of a COPY
> job).
Again false. Internally we honor the ready state
Oops, I do realize that libvirt honors seen the 'ready' boolean -- I
even mentioned the libvirt commit message (eae5924) from you in the
beginning, which handles it.
and applications like
virsh have code in place that waits for the async events and act after
receiving them.
Yep, I remember Eric mentioning the other day that `virsh` is setup to
capture libvirt events. So I briefly went through the
virshBlockJobWait() in tools/virsh-domain.c where you see that:
[...]
/* Fallback behaviour is only needed if one or both callbacks could not
* be registered */
if (data->cb_id < 0 || data->cb_id2 < 0) {
/* If the block job vanishes, synthesize a COMPLETED event */
if (result == 0) {
ret = VIR_DOMAIN_BLOCK_JOB_COMPLETED;
break;
}
/* If the block job hits 100%, wait a little while for a possible
* event from libvirt unless both callbacks could not be registered
* in order to synthesize our own READY event */
if (info.end == info.cur &&
((data->cb_id < 0 && data->cb_id2 < 0) || --retries ==
0)) {
ret = VIR_DOMAIN_BLOCK_JOB_READY;
break;
}
}
[...]
> Since QEMU reports the "ready": true field (followed
by a
> BLOCK_JOB_READY QMP event). It would be helpful if libvirt expose this
> via an API, so upper layers could instead use that, rather than polling.
We expose the state of the copy job in the XML and forward the READY
event from qemu to the users.
Yep, I see that in the QMP traffic:
-----
2016-10-04 15:54:52.333+0000: 30550: info : qemuMonitorIOProcess:423 :
QEMU_MONITOR_IO_PROCESS: mon=0x7fa43000ee60 buf={"timestamp":
{"seconds": 1475596492, "microseconds": 333414}, "event":
"BLOCK_JOB_READY", "data": {"device":
"drive-virtio-disk1", "len":
1073741824, "offset": 1073741824, "speed": 0, "type":
"mirror"}}
-----
A running copy job exposes itself in the xml as:
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source
file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/>
<backingStore/>
<mirror type='file' file='/tmp/ble.img' format='raw'
job='copy'>
<format type='raw'/>
<source file='/tmp/ble.img'/>
</mirror>
[...]
</disk>
While the ready copy job is exposed as:
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source
file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/>
<backingStore/>
<mirror type='file' file='/tmp/ble.img' format='raw'
job='copy' ready='yes'>
<format type='raw'/>
<source file='/tmp/ble.img'/>
</mirror>
[...]
</disk>
Thanks for the XML snippets.
Additionally we have anyncrhronous events that are emitted once qemu
notifies us that the block job has reached sync state or finished.
Libvirt uses the event to switch to the ready state.
The documentation suggests that block jobs should listen to the events
and act accordingly only after receiving the event.
> Problem scenario
> ----------------
>
> When virDomainBlockRebase() is invoked to start a copy job, then
> aborting the said copy operation with virDomainBlockJobAbort() + flag
> VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race
> condition (due to the way the virDomainGetBlockJobInfo() reports the job
> status) where the pivot operation fails.
Again, this should trigger the block job event and that should be used
as the marker to perform the pivot operation.
$ virsh event --all --loop
event 'block-job-2' for domain hp: Block Copy for hda ready <- after the sync
phase is reached
event 'block-job-2' for domain hp: Block Copy for hda completed <- after
successfully pivoting
Applications should act only after receiving such event.
Okay, Eric also has suggested
> Race condition window
> ~~~~~~~~~~~~~~~~~~~~~
>
> libvirt finds cur==end AND sends a pivot request, all in the window
> before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's
done by the commit you've mentioned above.
Right, because I've independently tested this with libvirt and have seen
the correct behavior. I meant to write: s/libvirt finds/Nova libvirt
driver finds/.
> QMP `query-block-jobs` command's response, indicating that
the job has
> actually completed], however the pivot request fails because it requires
> "ready": true.
The problem is that you are polling the block job info which correctly
reports that all data was properly copied and you are inferring the
block job state from that data.
Indeed. We should listen to block job events.
I'm against deliberately reporting false data in the block info
structure.
Yes, you're not alone in that, Nova developer (Matt Booth, CCed) also
expressed a similar concern, albeit not as strongly as you.
The application should register handlers for the block job events
and
act only if it receives such event. Additionally you may want to check
that the state is correct in the XML. The current block job information
structure can't be extended unfortunately.
Okay, we sort of came to the similar conclusion (in a discussion on the
upstream Nova IRC channel) to rely on events, rather than polling for
cur/end values.
Please look at the virsh handling of the block jobs for inspiration
since we have example code doing the right thing here.
Yes, looking already at them. Thanks!
--
/kashyap