On Thu, Oct 06, 2016 at 12:26:50 +0200, Kashyap Chamarthy wrote:
Backround
---------
For QEMU block device jobs, the "ready" boolean field (part of QMP
`query-block-jobs`) was introduced in commit ef6dbf1 (available in QEMU
v2.2.0 or above):
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ef6dbf1e4 --
blockjob: Add "ready" field
"When a block job signals readiness, this is currently reported only
through QMP. If qemu wants to use block jobs for internal tasks,
there needs to be another way to correctly detect when a block job
may be completed.
For this reason, introduce a bool "ready" which is set when the
block job may be completed."
And, libvirt was fixed to use the above field in this commit (available
in libvirt v1.2.18 or above):
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu:
Update state of block job to READY only if it actually is ready [1]
RFC
---
Currently libvirt block APIs (& consequently higher-level applications
like Nova which use these APIs) rely on polling for job completion via
libvirt is _not_ polling the data. Libvirt relies on the events from
qemu which are also exposed as libvirt events.
virDomainGetBlockJobInfo(), which uses QMP `query-block-jobs`, and
waits for QEMU to report "offset" == "len", which translates to
libvirt
"cur" == "end". Based on this, libvirt can take an action (whether
to
gracefully abort, or pivot to the copy in case of a COPY job).
Again false. Internally we honor the ready state and applications like
virsh have code in place that waits for the async events and act after
receiving them.
Since QEMU reports the "ready": true field (followed by a
BLOCK_JOB_READY QMP event). It would be helpful if libvirt expose this
via an API, so upper layers could instead use that, rather than polling.
We expose the state of the copy job in the XML and forward the READY
event from qemu to the users.
A running copy job exposes itself in the xml as:
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/>
<backingStore/>
<mirror type='file' file='/tmp/ble.img' format='raw'
job='copy'>
<format type='raw'/>
<source file='/tmp/ble.img'/>
</mirror>
[...]
</disk>
While the ready copy job is exposed as:
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/>
<backingStore/>
<mirror type='file' file='/tmp/ble.img' format='raw'
job='copy' ready='yes'>
<format type='raw'/>
<source file='/tmp/ble.img'/>
</mirror>
[...]
</disk>
Additionally we have anyncrhronous events that are emitted once qemu
notifies us that the block job has reached sync state or finished.
Libvirt uses the event to switch to the ready state.
The documentation suggests that block jobs should listen to the events
and act accordingly only after receiving the event.
Problem scenario
----------------
When virDomainBlockRebase() is invoked to start a copy job, then
aborting the said copy operation with virDomainBlockJobAbort() + flag
VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race
condition (due to the way the virDomainGetBlockJobInfo() reports the job
status) where the pivot operation fails.
Again, this should trigger the block job event and that should be used
as the marker to perform the pivot operation.
$ virsh event --all --loop
event 'block-job-2' for domain hp: Block Copy for hda ready <- after the sync
phase is reached
event 'block-job-2' for domain hp: Block Copy for hda completed <- after
successfully pivoting
Applications should act only after receiving such event.
Race condition window
~~~~~~~~~~~~~~~~~~~~~
libvirt finds cur==end AND sends a pivot request, all in the window
before QEMU would have sent "ready": true field [emitted as part of the
This is not true. Libvirt checks that the mirror is actually ready. It's
done by the commit you've mentioned above.
QMP `query-block-jobs` command's response, indicating that the
job has
actually completed], however the pivot request fails because it requires
"ready": true.
The problem is that you are polling the block job info which correctly
reports that all data was properly copied and you are inferring the
block job state from that data.
I'm against deliberately reporting false data in the block info
structure.
The application should register handlers for the block job events and
act only if it receives such event. Additionally you may want to check
that the state is correct in the XML. The current block job information
structure can't be extended unfortunately.
Please look at the virsh handling of the block jobs for inspiration
since we have example code doing the right thing here.
Peter