On Thu, Feb 19, 2015 at 11:07:46 +0000, Daniel Berrange wrote:
On Thu, Feb 19, 2015 at 12:02:27PM +0100, Peter Krempa wrote:
> Previously when a domain would get stuck in a domain job due to a
> programming mistake we'd report the following control state:
>
> $ virsh domcontrol domain
> occupied (1424343406.150s)
>
> The timestamp is invalid as the monitor was not entered for that domain.
> We can use that to detect that the domain has an active job and report a
> better error instead:
>
> $ virsh domcontrol domain
> error: internal (locking) error
I don't really think that is much better as it still doesn't give us any
clue as to how/why we got into this broken state.
Well, yes, it doesn't give a clue how we got into a broken state but
fixes the output in case we are already in the broken state for the
virDomainGetControlInfo API.
In the example above you can see that the API returns state "occupied"
which usually means that the domain is in a monitor call and the elapsed
time is totally incorrect as it's calculated as CURRENT_TIME -
monitorStartTime. In case the domain is _not_ in a monitor call, the
monitorStartTime variable is set to 0 so we basically state that the
domain is in a monitor call since start of the epoch.
I think it'd be desirable to modify our JobEnter APIs so that they get
passed the __FUNC__ __FILE__ and __LINE__ of the code which starts the
job. Then include this info when we report a job error.
I agree though that this is necessary. Jiri set out to improve the error
reporting a while ago in a very similar manner but still didn't get to
finish it AFAIK.
Peter