Re: [libvirt] [Qemu-devel] [PATCH v2 0/3] HMP/snapshot changes - do not use ID anymore

Cc: libvir-list for review of the compatibility argument below. Daniel Henrique Barboza <danielhb413@gmail.com> writes:
Hey David,
On 9/21/18 9:29 AM, Dr. David Alan Gilbert wrote:
* Daniel Henrique Barboza (danielhb413@gmail.com) wrote:
changes in v2: - removed the "RFC" marker; - added a new patch (patch 2) that removes bdrv_snapshot_delete_by_id_or_name from the code; - made changes in patch 1 as suggested by Murilo; - previous patch set link: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04658.html
It is not uncommon to see bugs being opened by testers that attempt to create VM snapshots using HMP. It turns out that "0" and "1" are quite common snapshot names and they trigger a lot of bugs. I gave an example in the commit message of patch 1, but to sum up here: QEMU treats the input of savevm/loadvm/delvm sometimes as 'ID', sometimes as 'name'. It is documented as such, but this can lead to strange situations.
Given that it is strange for an API to consider a parameter to be 2 fields at the same time, and inadvently treating them as one or the other, and that removing the ID field is too drastic, my idea here is to keep the ID field for internal control, but do not let the user set it.
I guess there's room for discussion about considering this change an API change or not. It doesn't affect users of HMP and it doesn't affect Libvirt, but simplifying the meaning of the parameters of savevm/loadvm/delvm. Can you clarify a couple of things: a) What is it about libvirt's use that means it's OK ? Does it never use numeric tags?
I am glad you asked because my understanding in how Libvirt was dealing with snapshots was wrong, and I just looked into it further to answer your question. Luckily, this series fixes the situation for Libvirt as well.
I was misled by the fact that Libvirt does not show the same symptoms we see in QEMU of this problem, but the bug is there. Here's a quick test with Libvirt with "0" and "1" as snapshot names, considering a VM named with no snapshots, using QEMU 2.12 and Libvirt 4.0.0:
- create the "0" snapshot:
$ sudo virsh snapshot-create-as --name 0 dhb Domain snapshot 0 created
$ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 0 405M 2018-09-24 15:47:56 00:04:20.867
- created the "1" snapshot. Here, Libvirt shows both snapshots with snapshot-list, but the snapshot was erased inside QEMU:
$ sudo virsh snapshot-create-as --name 1 dhb Domain snapshot 1 created $ $ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running 1 2018-09-24 15:50:09 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 1 404M 2018-09-24 15:50:10 00:05:36.226
This is where I stopped checking out Libvirt at first, concluding wrongly that it was immune to the bug.
Libvirt does not throw an error when trying to apply snapshot 0. In fact, it acts like everything went fine:
$ sudo virsh snapshot-revert dhb 0
$ echo $? 0
Is that a libvirt bug?
Reverting back to snapshot "1" works as intended, restoring the VM state when it was created.
(perhaps this is something we should let Libvirt be aware of ...)
This series fixes this behavior because the snapshot 0 isn't erased from QEMU, allowing Libvirt to successfully restore it.
b) After this series are you always guaranteed to be able to fix any existing oddly named snapshots?
The oddly named snapshots that already exists can be affected by the semantic change in loadvm and delvm, in a way that the user can't load/del using the snap ID, just the tag. Aside from that, I don't see any side effects with existing snapshots and this patch series.
Do all snapshots have a tag that is unique within their image? Even snapshots created by old versions of QEMU?

On Tue, Oct 09, 2018 at 19:34:43 +0200, Markus Armbruster wrote:
Cc: libvir-list for review of the compatibility argument below.
Daniel Henrique Barboza <danielhb413@gmail.com> writes:
Hey David,
On 9/21/18 9:29 AM, Dr. David Alan Gilbert wrote:
* Daniel Henrique Barboza (danielhb413@gmail.com) wrote:
changes in v2: - removed the "RFC" marker; - added a new patch (patch 2) that removes bdrv_snapshot_delete_by_id_or_name from the code; - made changes in patch 1 as suggested by Murilo; - previous patch set link: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04658.html
It is not uncommon to see bugs being opened by testers that attempt to create VM snapshots using HMP. It turns out that "0" and "1" are quite common snapshot names and they trigger a lot of bugs. I gave an example in the commit message of patch 1, but to sum up here: QEMU treats the input of savevm/loadvm/delvm sometimes as 'ID', sometimes as 'name'. It is documented as such, but this can lead to strange situations.
Given that it is strange for an API to consider a parameter to be 2 fields at the same time, and inadvently treating them as one or the other, and that removing the ID field is too drastic, my idea here is to keep the ID field for internal control, but do not let the user set it.
I guess there's room for discussion about considering this change an API change or not. It doesn't affect users of HMP and it doesn't affect Libvirt, but simplifying the meaning of the parameters of savevm/loadvm/delvm. Can you clarify a couple of things: a) What is it about libvirt's use that means it's OK ? Does it never use numeric tags?
I am glad you asked because my understanding in how Libvirt was dealing with snapshots was wrong, and I just looked into it further to answer your question. Luckily, this series fixes the situation for Libvirt as well.
I was misled by the fact that Libvirt does not show the same symptoms we see in QEMU of this problem, but the bug is there. Here's a quick test with Libvirt with "0" and "1" as snapshot names, considering a VM named with no snapshots, using QEMU 2.12 and Libvirt 4.0.0:
- create the "0" snapshot:
$ sudo virsh snapshot-create-as --name 0 dhb Domain snapshot 0 created
$ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 0 405M 2018-09-24 15:47:56 00:04:20.867
- created the "1" snapshot. Here, Libvirt shows both snapshots with snapshot-list, but the snapshot was erased inside QEMU:
$ sudo virsh snapshot-create-as --name 1 dhb Domain snapshot 1 created $ $ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running 1 2018-09-24 15:50:09 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 1 404M 2018-09-24 15:50:10 00:05:36.226
This is where I stopped checking out Libvirt at first, concluding wrongly that it was immune to the bug.
Libvirt does not throw an error when trying to apply snapshot 0. In fact, it acts like everything went fine:
$ sudo virsh snapshot-revert dhb 0
$ echo $? 0
Is that a libvirt bug?
Probably yes. The error handling from HMP sucks and can't really be fixed in all cases. This is for the handler which calls "loadvm": if (strstr(reply, "No block device supports snapshots") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("this domain does not have a device to load snapshots")); goto cleanup; } else if (strstr(reply, "Could not find snapshot") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, _("the snapshot '%s' does not exist, and was not loaded"), name); goto cleanup; } else if (strstr(reply, "Snapshots not supported on device") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", reply); goto cleanup; } else if (strstr(reply, "Could not open VM state file") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } else if (strstr(reply, "Error") != NULL && strstr(reply, "while loading VM state") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } else if (strstr(reply, "Error") != NULL && strstr(reply, "while activating snapshot on") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } If the above does not match the reported error, we report success. The same problem can happen with any of the 5 [1] HMP command handlers we still have. Note that a similar abomination is also for the code which calls "savevm". [1] We technically have 6 HMP handlers, but "cpu_set" is not used if yoy have a QEMU younger than 3 years. Soon also "drive_add" and "drive_del" will be replaced, so we'll be stuck with the 3 internal snapshot ones.
Reverting back to snapshot "1" works as intended, restoring the VM state when it was created.
(perhaps this is something we should let Libvirt be aware of ...)
The error message from qemu which was wrongly ignored by qemu can be found in the libvirtd debug log. It would be helpful if you could provide it.
This series fixes this behavior because the snapshot 0 isn't erased from QEMU, allowing Libvirt to successfully restore it.
b) After this series are you always guaranteed to be able to fix any existing oddly named snapshots?
The oddly named snapshots that already exists can be affected by the semantic change in loadvm and delvm, in a way that the user can't load/del using the snap ID, just the tag. Aside from that, I don't see any side effects with existing snapshots and this patch series.
Do all snapshots have a tag that is unique within their image? Even snapshots created by old versions of QEMU?
I remember there was a discussion which regarded problems with collisions between the name and the ID of the snapshot when dealing with loadvm/delvm commands but I can't seem to find it currently. Note that libvirt plainly issues loadvm/delvm $SNAPSHOTNAME. If the name is numeric it might clash. Similarly for inactive vms via qemu-img.

* Peter Krempa (pkrempa@redhat.com) wrote:
On Tue, Oct 09, 2018 at 19:34:43 +0200, Markus Armbruster wrote:
Cc: libvir-list for review of the compatibility argument below.
Daniel Henrique Barboza <danielhb413@gmail.com> writes:
Hey David,
On 9/21/18 9:29 AM, Dr. David Alan Gilbert wrote:
* Daniel Henrique Barboza (danielhb413@gmail.com) wrote:
changes in v2: - removed the "RFC" marker; - added a new patch (patch 2) that removes bdrv_snapshot_delete_by_id_or_name from the code; - made changes in patch 1 as suggested by Murilo; - previous patch set link: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04658.html
It is not uncommon to see bugs being opened by testers that attempt to create VM snapshots using HMP. It turns out that "0" and "1" are quite common snapshot names and they trigger a lot of bugs. I gave an example in the commit message of patch 1, but to sum up here: QEMU treats the input of savevm/loadvm/delvm sometimes as 'ID', sometimes as 'name'. It is documented as such, but this can lead to strange situations.
Given that it is strange for an API to consider a parameter to be 2 fields at the same time, and inadvently treating them as one or the other, and that removing the ID field is too drastic, my idea here is to keep the ID field for internal control, but do not let the user set it.
I guess there's room for discussion about considering this change an API change or not. It doesn't affect users of HMP and it doesn't affect Libvirt, but simplifying the meaning of the parameters of savevm/loadvm/delvm. Can you clarify a couple of things: a) What is it about libvirt's use that means it's OK ? Does it never use numeric tags?
I am glad you asked because my understanding in how Libvirt was dealing with snapshots was wrong, and I just looked into it further to answer your question. Luckily, this series fixes the situation for Libvirt as well.
I was misled by the fact that Libvirt does not show the same symptoms we see in QEMU of this problem, but the bug is there. Here's a quick test with Libvirt with "0" and "1" as snapshot names, considering a VM named with no snapshots, using QEMU 2.12 and Libvirt 4.0.0:
- create the "0" snapshot:
$ sudo virsh snapshot-create-as --name 0 dhb Domain snapshot 0 created
$ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 0 405M 2018-09-24 15:47:56 00:04:20.867
- created the "1" snapshot. Here, Libvirt shows both snapshots with snapshot-list, but the snapshot was erased inside QEMU:
$ sudo virsh snapshot-create-as --name 1 dhb Domain snapshot 1 created $ $ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running 1 2018-09-24 15:50:09 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 1 404M 2018-09-24 15:50:10 00:05:36.226
This is where I stopped checking out Libvirt at first, concluding wrongly that it was immune to the bug.
Libvirt does not throw an error when trying to apply snapshot 0. In fact, it acts like everything went fine:
$ sudo virsh snapshot-revert dhb 0
$ echo $? 0
Is that a libvirt bug?
Probably yes. The error handling from HMP sucks and can't really be fixed in all cases. This is for the handler which calls "loadvm":
if (strstr(reply, "No block device supports snapshots") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("this domain does not have a device to load snapshots")); goto cleanup; } else if (strstr(reply, "Could not find snapshot") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, _("the snapshot '%s' does not exist, and was not loaded"), name); goto cleanup; } else if (strstr(reply, "Snapshots not supported on device") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", reply); goto cleanup; } else if (strstr(reply, "Could not open VM state file") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } else if (strstr(reply, "Error") != NULL && strstr(reply, "while loading VM state") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } else if (strstr(reply, "Error") != NULL && strstr(reply, "while activating snapshot on") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; }
If the above does not match the reported error, we report success. The same problem can happen with any of the 5 [1] HMP command handlers we still have.
Note that a similar abomination is also for the code which calls "savevm".
Would the following small qemu change make life a little safer: diff --git a/hmp.c b/hmp.c index 576253a01f..0729a8c7ed 100644 --- a/hmp.c +++ b/hmp.c @@ -62,7 +62,7 @@ static void hmp_handle_error(Monitor *mon, Error **errp) { assert(errp); if (*errp) { - error_report_err(*errp); + error_reportf_err(*errp, "Error: "); } } changing: No block device supports snapshots to: Error: No block device supports snapshots so you could check for Error: at the start and know that you've hit some unknown error? Dave
[1] We technically have 6 HMP handlers, but "cpu_set" is not used if yoy have a QEMU younger than 3 years. Soon also "drive_add" and "drive_del" will be replaced, so we'll be stuck with the 3 internal snapshot ones.
Reverting back to snapshot "1" works as intended, restoring the VM state when it was created.
(perhaps this is something we should let Libvirt be aware of ...)
The error message from qemu which was wrongly ignored by qemu can be found in the libvirtd debug log. It would be helpful if you could provide it.
This series fixes this behavior because the snapshot 0 isn't erased from QEMU, allowing Libvirt to successfully restore it.
b) After this series are you always guaranteed to be able to fix any existing oddly named snapshots?
The oddly named snapshots that already exists can be affected by the semantic change in loadvm and delvm, in a way that the user can't load/del using the snap ID, just the tag. Aside from that, I don't see any side effects with existing snapshots and this patch series.
Do all snapshots have a tag that is unique within their image? Even snapshots created by old versions of QEMU?
I remember there was a discussion which regarded problems with collisions between the name and the ID of the snapshot when dealing with loadvm/delvm commands but I can't seem to find it currently. Note that libvirt plainly issues loadvm/delvm $SNAPSHOTNAME. If the name is numeric it might clash. Similarly for inactive vms via qemu-img.
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Thu, Oct 11, 2018 at 13:06:14 +0100, Dr. David Alan Gilbert wrote:
* Peter Krempa (pkrempa@redhat.com) wrote:
On Tue, Oct 09, 2018 at 19:34:43 +0200, Markus Armbruster wrote:
Cc: libvir-list for review of the compatibility argument below.
Daniel Henrique Barboza <danielhb413@gmail.com> writes:
Hey David,
On 9/21/18 9:29 AM, Dr. David Alan Gilbert wrote:
* Daniel Henrique Barboza (danielhb413@gmail.com) wrote:
changes in v2: - removed the "RFC" marker; - added a new patch (patch 2) that removes bdrv_snapshot_delete_by_id_or_name from the code; - made changes in patch 1 as suggested by Murilo; - previous patch set link: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04658.html
It is not uncommon to see bugs being opened by testers that attempt to create VM snapshots using HMP. It turns out that "0" and "1" are quite common snapshot names and they trigger a lot of bugs. I gave an example in the commit message of patch 1, but to sum up here: QEMU treats the input of savevm/loadvm/delvm sometimes as 'ID', sometimes as 'name'. It is documented as such, but this can lead to strange situations.
Given that it is strange for an API to consider a parameter to be 2 fields at the same time, and inadvently treating them as one or the other, and that removing the ID field is too drastic, my idea here is to keep the ID field for internal control, but do not let the user set it.
I guess there's room for discussion about considering this change an API change or not. It doesn't affect users of HMP and it doesn't affect Libvirt, but simplifying the meaning of the parameters of savevm/loadvm/delvm. Can you clarify a couple of things: a) What is it about libvirt's use that means it's OK ? Does it never use numeric tags?
I am glad you asked because my understanding in how Libvirt was dealing with snapshots was wrong, and I just looked into it further to answer your question. Luckily, this series fixes the situation for Libvirt as well.
I was misled by the fact that Libvirt does not show the same symptoms we see in QEMU of this problem, but the bug is there. Here's a quick test with Libvirt with "0" and "1" as snapshot names, considering a VM named with no snapshots, using QEMU 2.12 and Libvirt 4.0.0:
- create the "0" snapshot:
$ sudo virsh snapshot-create-as --name 0 dhb Domain snapshot 0 created
$ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 0 405M 2018-09-24 15:47:56 00:04:20.867
- created the "1" snapshot. Here, Libvirt shows both snapshots with snapshot-list, but the snapshot was erased inside QEMU:
$ sudo virsh snapshot-create-as --name 1 dhb Domain snapshot 1 created $ $ sudo virsh snapshot-list dhb Name Creation Time State ------------------------------------------------------------ 0 2018-09-24 15:47:56 -0400 running 1 2018-09-24 15:50:09 -0400 running
$ sudo virsh qemu-monitor-command dhb --hmp info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 1 404M 2018-09-24 15:50:10 00:05:36.226
This is where I stopped checking out Libvirt at first, concluding wrongly that it was immune to the bug.
Libvirt does not throw an error when trying to apply snapshot 0. In fact, it acts like everything went fine:
$ sudo virsh snapshot-revert dhb 0
$ echo $? 0
Is that a libvirt bug?
Probably yes. The error handling from HMP sucks and can't really be fixed in all cases. This is for the handler which calls "loadvm":
if (strstr(reply, "No block device supports snapshots") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("this domain does not have a device to load snapshots")); goto cleanup; } else if (strstr(reply, "Could not find snapshot") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, _("the snapshot '%s' does not exist, and was not loaded"), name); goto cleanup; } else if (strstr(reply, "Snapshots not supported on device") != NULL) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", reply); goto cleanup; } else if (strstr(reply, "Could not open VM state file") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } else if (strstr(reply, "Error") != NULL && strstr(reply, "while loading VM state") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; } else if (strstr(reply, "Error") != NULL && strstr(reply, "while activating snapshot on") != NULL) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", reply); goto cleanup; }
If the above does not match the reported error, we report success. The same problem can happen with any of the 5 [1] HMP command handlers we still have.
Note that a similar abomination is also for the code which calls "savevm".
Would the following small qemu change make life a little safer:
diff --git a/hmp.c b/hmp.c index 576253a01f..0729a8c7ed 100644 --- a/hmp.c +++ b/hmp.c @@ -62,7 +62,7 @@ static void hmp_handle_error(Monitor *mon, Error **errp) { assert(errp); if (*errp) { - error_report_err(*errp); + error_reportf_err(*errp, "Error: "); } }
changing:
No block device supports snapshots
to:
Error: No block device supports snapshots
so you could check for Error: at the start and know that you've hit some unknown error?
That certainly would help, provided that the message is not localized in any way. I'd be very glad to see it also for 'savevm' and 'delvm'. It's a shame that drive_add and drive_del don't have that either, but they will soon become unused.
participants (3)
-
Dr. David Alan Gilbert
-
Markus Armbruster
-
Peter Krempa