On 01/26/2012 02:13 PM, Luiz Capitulino wrote:
On Thu, 26 Jan 2012 20:41:13 +0100
Michal Privoznik<mprivozn(a)redhat.com> wrote:
> On 26.01.2012 20:35, Luiz Capitulino wrote:
>> On Thu, 26 Jan 2012 08:18:03 -0700
>> Eric Blake<eblake(a)redhat.com> wrote:
>>
>>> [adding qemu-devel]
>>>
>>> On 01/26/2012 07:46 AM, Daniel P. Berrange wrote:
>>>>> One thing, that you'll probably notice is this
>>>>> 'set-support-level' command. Basically, it tells GA what qemu
version
>>>>> is it running on. Ideally, this should be done as soon as
>>>>> GA starts up. However, that cannot be determined from outside
>>>>> world as GA doesn't emit any events yet.
>>>>> Ideally^2 this command should be left out as it should be qemu
>>>>> who tells its own agent this kind of information.
>>>>> Anyway, I was going to call this command in qemuProcess{Startup,
>>>>> Reconnect,Attach}, but it won't work. We need to un-pause guest
CPUs
>>>>> so guest can boot and start GA, but that implies returning from
qemuProcess*.
>>>>>
>>>>> So I am setting this just before 'guest-suspend' command, as
>>>>> there is one more thing about GA. It is unable to remember anything
>>>>> upon its restart (GA process). Which has BTW show flaw
>>>>> in our current code with FS freeze& thaw. If we freeze guest
>>>>> FS, and somebody restart GA, the simple FS Thaw will not succeed as
>>>>> GA thinks FS are not frozen. But that's a different cup of tea.
>>>>>
>>>>> Because of what written above, we need to call set-level
>>>>> on every suspend.
>>>>
>>>>
>>>> IMHO all this says that the 'set-level' command is a
conceptually
>>>> unfixably broken design& should be killed in QEMU before it turns
>>>> into an even bigger mess.
>>
>> Can you elaborate on this? Michal and I talked on irc about making the
>> compatibility level persistent, would that help?
>>
>>>> Once we're in a situation where we need to call 'set-level'
prior
>>>> to every single invocation, you might as well just allow the QEMU
>>>> version number to be passed in directly as an arg to the command
>>>> you are running directly thus avoiding this horrificness.
>>>
>>> Qemu folks, would you care to chime in on this?
>>>
>>> Exactly how is the set-level command supposed to work? As I understand
>>> it, the goal is that if the guest has qemu-ga 1.1 installed, but is
>>> being run by qemu 1.0, then we want to ensure that any guest agent
>>> command supported by qemu-ga 1.1 but requiring features of qemu not
>>> present in qemu 1.0 will be properly rejected.
>>
>> Not exactly, the default support of qemu-ga is qemu 1.0. This means that by
>> default qemu-ga will only support qemu 1.0 even when running on qemu 2.0. This
>> way the set-support-level command allows you to specify that qemu 2.0 features
>> are supported.
>>
>> Note that this is only about specific features that depend on host support,
>> like S3 suspend which is known to be buggy in current and old qemu.
>>
>>> But whose job is it to tell the guest agent what version of qemu is
>>> running? Based on the above conversation, it looks like the current
>>> qemu implementation does not do any handshaking on its own when the
>>> guest agent first comes alive, which means that you are forcing the work
>>> on the management app (libvirt). And this is inherently racy - if the
>>> guest is allowed to restart its qemu-ga process at will, and each
>>> restart of that guest process triggers a need to redo the handshake,
>>> then libvirt can never reliably know what version the agent is running at.
>>
>> Making the set-support-level persistent would solve it, wouldn't it?
>
> Yes and no. We still need an event when GA come to live. Because if
> anybody tries to write something for GA which is not running (and for
> purpose of this scenario assume it never will), like 'set-support-level'
> and wait for answer (which will never come) he will be blocked
> indefinitely. However, if he writes it after 1st event come, everything
> is OK.
What if the event never reach libvirt?
This problem is a lot more general and is not related to the
set-support-level command. Maybe adding shutdown& start events can serve as
good hints, but they won't fix the problem.
Yah, start up events are a good indicator to issue the guest-sync
sequence (we had them at one point, and planned to re-add them for QMP
integration, and since libvirt is taking on this role for now it might
make sense to re-add it now), but once that sequence is issued the agent
can still be manually stopped, or the guest-sync sequence itself can
timeout.
And there's no way to reliably send a stop indicator, maybe to capture
shutdown events, but not consistently enough that we can base the
protocol on it (agent might get pkill -9'd for instance, and
virtio-serial doesn't currently plumb a guest-side disconnect to the
chardev front-end, so you'd never know).
So, the only indication you'll ever get that your "session" ended is
either a timeout, or, if we add it, a start up event. In either case the
response is to issue the reset sequence.
The way it would need to work with resets is everytime a command times
out you:
1) report the timeout error to libvirt client/management app. set
guest-agent_available = 0, such that further libvirt calls that depend
on it would return "not currently available", or something to that effect.
2) issue guest-sync with new unique session id
3) read a json object/response.
- if you time out, goto 2
- if your response doesn't have the session id you're expecting,
repeat 3) (since it may be a response to a previous guest-sync RPC that
you timed out on prematurely, but you can't just wait indefinitely,
since it may never arrive)
4) set guest_agent_available = 1, proceed with normal operation till the
next timeout (or start up event, if we add one).
IMHO, the best way to solve this is to issue the guest-sync command with
a timeout. If you get no answer, then try again later.