On 09/21/2015 05:09 AM, Shivaprasad bhat wrote:
Thanks John for the comments.
On Fri, Sep 18, 2015 at 10:34 PM, John Ferlan <jferlan(a)redhat.com> wrote:
>
>
> On 09/14/2015 10:44 AM, Shivaprasad G Bhat wrote:
>> Tunnelled migration can hang if the destination qemu exits despite all the
>> ABI checks. This happens whenever the destination qemu exits before the
>> complete transfer is noticed by source qemu. The savevm state checks at
>> runtime can fail at destination and cause qemu to error out.
>> The source qemu cant notice it as the EPIPE is not propogated to it.
>> The qemuMigrationIOFunc() notices the stream being broken from virStreamSend()
>> and it cleans up the stream alone. The qemuMigrationWaitForCompletion() would
>> never get to 100% transfer completion.
>> The qemuMigrationWaitForCompletion() never breaks out as well since
>> the ssh connection to destination is healthy, and the source qemu also thinks
>> the migration is ongoing as the Fd to which it transfers, is never
>> closed or broken. So, the migration will hang forever. Even Ctrl-C on the
>> virsh migrate wouldn't be honoured. Close the source side FD when there is
>> an error in the stream. That way, the source qemu updates itself and
>> qemuMigrationWaitForCompletion() notices the failure.
>>
>> Close the FD for all kinds of errors to be sure. The error message is not
>> copied for EPIPE so that the destination error is copied instead later.
>>
>> Note:
>> Reproducible with repeated migrations between Power hosts running in different
>> subcores-per-core modes.
>>
>> Changes from v1 -> v2:
>> VIR_FORCE_CLOSE() was called twice for this use case which would log
>> unneccessary warnings. So, move the fd close to qemuMigrationIOFunc
>> so that there are no unnecessary duplicate attempts.(Would this trigger
>> a Coverity error? I don't have a setup to check.)
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.vnet.ibm.com>
>> ---
>> src/qemu/qemu_migration.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
>> index ff89ab5..9602fb2 100644
>> --- a/src/qemu/qemu_migration.c
>> +++ b/src/qemu/qemu_migration.c
>> @@ -4012,6 +4012,7 @@ static void qemuMigrationIOFunc(void *arg)
>> if (virStreamFinish(data->st) < 0)
>> goto error;
>>
>> + VIR_FORCE_CLOSE(data->sock);
>> VIR_FREE(buffer);
>>
>> return;
>> @@ -4029,7 +4030,11 @@ static void qemuMigrationIOFunc(void *arg)
>> }
>>
>> error:
>> - virCopyLastError(&data->err);
>> + /* Let the source qemu know that the transfer cant continue anymore.
>> + * Don't copy the error for EPIPE as destination has the actual error.
*/
>> + VIR_FORCE_CLOSE(data->sock);
>> + if (!virLastErrorIsSystemErrno(EPIPE))
>> + virCopyLastError(&data->err);
>> virResetLastError();
>> VIR_FREE(buffer);
>> }
>> @@ -4397,7 +4402,6 @@ qemuMigrationRun(virQEMUDriverPtr driver,
>> if (iothread && qemuMigrationStopTunnel(iothread, ret < 0)
< 0)
>> ret = -1;
>> }
>> - VIR_FORCE_CLOSE(fd);
>
> ^^^
>
> This causes Coverity to claim a RESOURCE_LEAK
>
> Feels like this was a mistake edit...
>
The VIR_FORCE_CLOSE() inside qemuMigrationIOFunc() alone is sufficient.
Having this again here would lead to Warning in the logs. I too thought coverity
would complain. Is there a way to force ignore a coverity warning?
Typically a marker of sorts, such as
/* coverity[leaked_handle] <some extra comment explaining why> */
Although I'm not sure that's the best way to handle this either.
The problem I see though is this is an "error path" issue and when
perhaps it's safe/required to close fd/io->sock/data->sock.
Your commit comment notes that the issue is seen on a fairly specific
event (virStreamSend failure) for a specific type of migration. As I
read the code, that failure jumps to error (as does virStreamFinish). So
now you have a fairly specific set of instances which perhaps would
cause qemuMigrationWaitForCompletion to need to fail. The fix you have
proposed is to close the data->sock (io->sock, fd). However, your
proposal is a larger hammer. I assume closure of data->sock causes
WaitForCompletion to fail (perhaps differently) and that's why you chose it.
Going back to the beginning of qemuMigrationRun, setting the 'fd' and
using StartTunnel seems to rely on MIGRATION_FWD_DIRECT, but if a tunnel
is created an (ugh) 'iothread' variable is NON-NULL. What if 'iothread'
were passed to qemuMigrationWaitForCompletion? Then again passed to
qemuMigrationCompleted which would check if iothread non-NULL and for
some new flag that could be created in _qemuMigrationIOThread, being
true. If it were true, then that would cause failure so that
WaitForCompletion would return error status. The only way the flag is
set to true is in qemuMigrationIOFunc when the code jumps to error.
(e.g. add bool stream_abort and data->stream_abort = true in "error:",
then check iothread && iothread->stream_abort, force error).
The only question then in my mind then is would this also be something
that should be done for the virStreamFinish failure path which also
jumps to error?
Doing this means you shouldn't need the VIR_FILE_CLOSE(data->sock) for
either the normal or error path. Does this seem to be a reasonable
approach and solve the issue you're facing?
John
Thanks and Regards,
Shivaprasad
> The rest of the patch looks reasonable; however, I'm far from the expert
> in these matters.
>
> John
>>
>> if (priv->job.completed) {
>> qemuDomainJobInfoUpdateTime(priv->job.completed);
>>
>> --
>> libvir-list mailing list
>> libvir-list(a)redhat.com
>>
https://www.redhat.com/mailman/listinfo/libvir-list
>>