Re: [libvirt] [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

On 11/10/2011 02:55 AM, Avi Kivity wrote:
On 11/09/2011 07:35 PM, Anthony Liguori wrote:
On 11/09/2011 11:02 AM, Avi Kivity wrote:
On 11/09/2011 06:39 PM, Anthony Liguori wrote:
Migration with qcow2 is not a supported feature for 1.0. Migration is only supported with raw images using coherent shared storage[1].
[1] NFS is only coherent with close-to-open which right now is not good enough for migration.
Say what?
Due to block format probing, we read at least the first sector of the disk during start up.
Strictly going by what NFS guarantees, since we don't open on the destination *after* as close on the source, we aren't guaranteed to see what's written by the source.
In practice, because of block format probing, unless we're using cache=none, the first sector can be out of sync with the source on the destination. If you use cache=none on a Linux client with at least a Linux NFS server, you should be relatively safe.
IMO, this should be a release blocker. qemu 1.0 only supporting migration on enterprise storage?
If we have to delay the release for a month to get it right, we should. Not that I think we have to.
Adding libvirt to the discussion. What does libvirt actually do in the monitor prior to migration completing on the destination? The least invasive way of doing delayed open of block devices is probably to make -incoming create a monitor and run a main loop before the block devices (and full device model) is initialized. Since this isolates the changes strictly to migration, I'd feel okay doing this for 1.0 (although it might need to be in the stable branch). I know a monitor can run like this as I've done it before but some of the commands will not behave as expected so it's pretty important to be comfortable with what commands are actually being used in this mode. Regards, Anthony Liguori

On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
What does libvirt actually do in the monitor prior to migration completing on the destination? The least invasive way of doing delayed open of block devices is probably to make -incoming create a monitor and run a main loop before the block devices (and full device model) is initialized. Since this isolates the changes strictly to migration, I'd feel okay doing this for 1.0 (although it might need to be in the stable branch).
The way migration works with libvirt wrt QEMU interactions is now as follows 1. Destination. Run qemu -incoming ...args... Query chardevs via monitor Query vCPU threads via monitor Set disk / vnc passwords Set netdev link states Set balloon target 2. Source Set migration speed Set migration max downtime Run migrate command (detached) while 1 Query migration status if status is failed or success break; 3. Destination If final status was success Run 'cont' in monitor else kill QEMU process 4. Source If final status was success and 'cont' on dest succeeded kill QEMU process else Run 'cont' in monitor In older libvirt, the bits from step 4, would actually take place at the end of step 2. This meant we could end up with no QEMU on either the source or dest, if starting CPUs on the dest QEMU failed for some reason. We would still really like to have a 'query-migrate' command for the destination, so that we can confirm that the destination has consumed all incoming migrate data successfully, rather than just blindly starting CPUs and hoping for the best. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:
On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
What does libvirt actually do in the monitor prior to migration completing on the destination? The least invasive way of doing delayed open of block devices is probably to make -incoming create a monitor and run a main loop before the block devices (and full device model) is initialized. Since this isolates the changes strictly to migration, I'd feel okay doing this for 1.0 (although it might need to be in the stable branch).
The way migration works with libvirt wrt QEMU interactions is now as follows
1. Destination. Run qemu -incoming ...args... Query chardevs via monitor Query vCPU threads via monitor Set disk / vnc passwords
Since RHEL carries Juan's patch, and Juan's patch doesn't handle disk passwords gracefully, how does libvirt cope with that? Regards, Anthony Liguori
Set netdev link states Set balloon target
2. Source Set migration speed Set migration max downtime Run migrate command (detached) while 1 Query migration status if status is failed or success break;
3. Destination If final status was success Run 'cont' in monitor else kill QEMU process
4. Source If final status was success and 'cont' on dest succeeded kill QEMU process else Run 'cont' in monitor
In older libvirt, the bits from step 4, would actually take place at the end of step 2. This meant we could end up with no QEMU on either the source or dest, if starting CPUs on the dest QEMU failed for some reason.
We would still really like to have a 'query-migrate' command for the destination, so that we can confirm that the destination has consumed all incoming migrate data successfully, rather than just blindly starting CPUs and hoping for the best.
Regards, Daniel

On Thu, Nov 10, 2011 at 01:11:42PM -0600, Anthony Liguori wrote:
On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:
On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
What does libvirt actually do in the monitor prior to migration completing on the destination? The least invasive way of doing delayed open of block devices is probably to make -incoming create a monitor and run a main loop before the block devices (and full device model) is initialized. Since this isolates the changes strictly to migration, I'd feel okay doing this for 1.0 (although it might need to be in the stable branch).
The way migration works with libvirt wrt QEMU interactions is now as follows
1. Destination. Run qemu -incoming ...args... Query chardevs via monitor Query vCPU threads via monitor Set disk / vnc passwords
Since RHEL carries Juan's patch, and Juan's patch doesn't handle disk passwords gracefully, how does libvirt cope with that?
No idea, that's the first I've heard of any patch that causes problems with passwords in QEMU. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11/10/2011 02:06 PM, Daniel P. Berrange wrote:
On Thu, Nov 10, 2011 at 01:11:42PM -0600, Anthony Liguori wrote:
On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:
On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
What does libvirt actually do in the monitor prior to migration completing on the destination? The least invasive way of doing delayed open of block devices is probably to make -incoming create a monitor and run a main loop before the block devices (and full device model) is initialized. Since this isolates the changes strictly to migration, I'd feel okay doing this for 1.0 (although it might need to be in the stable branch).
The way migration works with libvirt wrt QEMU interactions is now as follows
1. Destination. Run qemu -incoming ...args... Query chardevs via monitor Query vCPU threads via monitor Set disk / vnc passwords
Since RHEL carries Juan's patch, and Juan's patch doesn't handle disk passwords gracefully, how does libvirt cope with that?
No idea, that's the first I've heard of any patch that causes problems with passwords in QEMU.
My guess is that migration with a password protected qcow2 file isn't a common test-case. Regards, Anthony Liguori
Daniel

On 11/10/2011 12:27 PM, Anthony Liguori wrote:
On 11/10/2011 02:55 AM, Avi Kivity wrote:
If we have to delay the release for a month to get it right, we should. Not that I think we have to.
Adding libvirt to the discussion.
What does libvirt actually do in the monitor prior to migration completing on the destination? The least invasive way of doing delayed open of block devices is probably to make -incoming create a monitor and run a main loop before the block devices (and full device model) is initialized. Since this isolates the changes strictly to migration, I'd feel okay doing this for 1.0 (although it might need to be in the stable branch).
This won't work. libvirt needs things to be initialized. Plus, once loadvm gets to loading the device model, the device model (and BDSes) need to be fully initialized. I think I've convinced myself that without proper clustered shared storage, cache=none is a hard requirement. That goes for iSCSI and NFS. I don't see a way to do migration safely with NFS and there's no way to really solve the page cache problem with iSCSI. Even with the reopen, it's racing against the close on the source. If you look at Daniel's description of what libvirt is doing and then compare that to Juan's patches, there's a race condition regarding whether the source gets closed before the reopen happens. cache=none seems to be the only way to solve this. Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. Regards, Anthony Liguori
I know a monitor can run like this as I've done it before but some of the commands will not behave as expected so it's pretty important to be comfortable with what commands are actually being used in this mode.
Regards,
Anthony Liguori

Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Kevin

On 11/11/2011 04:15 AM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
Just to be clear, reopen only addresses image format migration. It does not address NFS migration since it doesn't guarantee close-to-open semantics. The problem I have with the reopen patches are that they introduce regressions and change at semantics for a management tool. If you look at the libvirt workflow with encrypted disks, it would break with the reopen patches.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used. Given that, I don't think this is a candidate for 1.0. Regards, Anthony Liguori
Kevin

Am 11.11.2011 15:03, schrieb Anthony Liguori:
On 11/11/2011 04:15 AM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
Just to be clear, reopen only addresses image format migration. It does not address NFS migration since it doesn't guarantee close-to-open semantics.
Yes. But image formats are the only thing that is really completely broken today. For NFS etc. we can tell users to use cache=none/directsync and they will be good. There is no such option that makes image formats safe.
The problem I have with the reopen patches are that they introduce regressions and change at semantics for a management tool. If you look at the libvirt workflow with encrypted disks, it would break with the reopen patches.
Yes, this is nasty. But on the other hand: Today migration is broken for all qcow2 images, with the reopen it's only broken for encrypted ones. Certainly an improvement, even though there's still a bug left.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Given that, I don't think this is a candidate for 1.0.
Nobody says it's a regression, but it's a bad bug and you're blocking a solution for it for over a year now because the solution isn't perfect enough in your eyes. :-( Kevin

On 11/11/2011 08:29 AM, Kevin Wolf wrote:
Am 11.11.2011 15:03, schrieb Anthony Liguori:
On 11/11/2011 04:15 AM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
Just to be clear, reopen only addresses image format migration. It does not address NFS migration since it doesn't guarantee close-to-open semantics.
Yes. But image formats are the only thing that is really completely broken today. For NFS etc. we can tell users to use cache=none/directsync and they will be good. There is no such option that makes image formats safe.
The problem I have with the reopen patches are that they introduce regressions and change at semantics for a management tool. If you look at the libvirt workflow with encrypted disks, it would break with the reopen patches.
Yes, this is nasty. But on the other hand: Today migration is broken for all qcow2 images, with the reopen it's only broken for encrypted ones. Certainly an improvement, even though there's still a bug left.
This sounds like a good thing to work through in the next release.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Given that, I don't think this is a candidate for 1.0.
Nobody says it's a regression, but it's a bad bug and you're blocking a solution for it for over a year now because the solution isn't perfect enough in your eyes. :-(
This patch was posted a year ago. Feedback was provided and there was never any follow up[1]. I've never Nack'd this approach. I can't see how I was blocking this since I never even responded in the thread. If this came in before soft freeze, I wouldn't have objected if you wanted to go in this direction. This is not a bug fix, this is a new feature. We're long past feature freeze. It's not a simple and obvious fix either. It only partially fixes the problem and introduces other problems. It's not a good candidate for making an exception at this stage in the release. [1] http://mid.gmane.org/cover.1294150511.git.quintela@redhat.com Regards, Anthony Liguori
Kevin

Am 11.11.2011 15:35, schrieb Anthony Liguori:
On 11/11/2011 08:29 AM, Kevin Wolf wrote:
Am 11.11.2011 15:03, schrieb Anthony Liguori:
On 11/11/2011 04:15 AM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
Just to be clear, reopen only addresses image format migration. It does not address NFS migration since it doesn't guarantee close-to-open semantics.
Yes. But image formats are the only thing that is really completely broken today. For NFS etc. we can tell users to use cache=none/directsync and they will be good. There is no such option that makes image formats safe.
The problem I have with the reopen patches are that they introduce regressions and change at semantics for a management tool. If you look at the libvirt workflow with encrypted disks, it would break with the reopen patches.
Yes, this is nasty. But on the other hand: Today migration is broken for all qcow2 images, with the reopen it's only broken for encrypted ones. Certainly an improvement, even though there's still a bug left.
This sounds like a good thing to work through in the next release.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Given that, I don't think this is a candidate for 1.0.
Nobody says it's a regression, but it's a bad bug and you're blocking a solution for it for over a year now because the solution isn't perfect enough in your eyes. :-(
This patch was posted a year ago. Feedback was provided and there was never any follow up[1]. I've never Nack'd this approach. I can't see how I was blocking this since I never even responded in the thread. If this came in before soft freeze, I wouldn't have objected if you wanted to go in this direction.
This is not a bug fix, this is a new feature. We're long past feature freeze. It's not a simple and obvious fix either. It only partially fixes the problem and introduces other problems. It's not a good candidate for making an exception at this stage in the release.
[1] http://mid.gmane.org/cover.1294150511.git.quintela@redhat.com
Then please send a fix that fails migration with non-raw images. Not breaking images silently during migration is critical for 1.0, IMO. Kevin

On 11/11/2011 08:44 AM, Kevin Wolf wrote:
Am 11.11.2011 15:35, schrieb Anthony Liguori:
This is not a bug fix, this is a new feature. We're long past feature freeze. It's not a simple and obvious fix either. It only partially fixes the problem and introduces other problems. It's not a good candidate for making an exception at this stage in the release.
[1] http://mid.gmane.org/cover.1294150511.git.quintela@redhat.com
Then please send a fix that fails migration with non-raw images. Not breaking images silently during migration is critical for 1.0, IMO.
I sent a quick series. If you want to do things different for the blocker layer like add a field to BlockDriver to indicate whether migration is supported and register the blocker in the core code, feel free to do that. Regards, Anthony Liguori
Kevin

On 11/11/2011 04:03 PM, Anthony Liguori wrote:
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Was there ever a statement to that effect? It was never clear to me and I doubt it was clear to anyone.
Given that, I don't think this is a candidate for 1.0.
Let's just skip 1.0 and do 1.1 instead. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.

On 11/12/2011 04:27 AM, Avi Kivity wrote:
On 11/11/2011 04:03 PM, Anthony Liguori wrote:
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Was there ever a statement to that effect? It was never clear to me and I doubt it was clear to anyone.
You literally reviewed a patch who's subject was "block: allow migration to work with image files"[1] that explained in gory detail what the problem was. [1] http://mid.gmane.org/4C8CAD7C.5020102@redhat.com
Given that, I don't think this is a candidate for 1.0.
Let's just skip 1.0 and do 1.1 instead.
Let's stop being overly dramatic. You know as well as anyone that image format support up until the coroutine conversion has had enough problems that no one could practically be using them in a production environment. Live migration is an availability feature. Up until the 1.0 release, if you cared about availability and correctness, you would not be using an image format. Regards, Anthony Liguori

On 11/12/2011 03:39 PM, Anthony Liguori wrote:
On 11/12/2011 04:27 AM, Avi Kivity wrote:
On 11/11/2011 04:03 PM, Anthony Liguori wrote:
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Was there ever a statement to that effect? It was never clear to me and I doubt it was clear to anyone.
You literally reviewed a patch who's subject was "block: allow migration to work with image files"[1] that explained in gory detail what the problem was.
Isn't a patch fixing a problem with migrating image files a statement that we do support migrating image files?
Given that, I don't think this is a candidate for 1.0.
Let's just skip 1.0 and do 1.1 instead.
Let's stop being overly dramatic. You know as well as anyone that image format support up until the coroutine conversion has had enough problems that no one could practically be using them in a production environment.
They are used in production environments.
Live migration is an availability feature. Up until the 1.0 release, if you cared about availability and correctness, you would not be using an image format.
Nevertheless, people who care about both availability and correctness, do use image formats. In reality, migration and image formats are critical features for virtualization workloads. Pretending they're not makes the 1.0 release a joke. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.

On 11/12/2011 08:43 AM, Avi Kivity wrote:
On 11/12/2011 03:39 PM, Anthony Liguori wrote:
On 11/12/2011 04:27 AM, Avi Kivity wrote:
On 11/11/2011 04:03 PM, Anthony Liguori wrote:
I don't view not supporting migration with image formats as a regression as it's never been a feature we've supported. While there might be confusion about support around NFS, I think it's always been clear that image formats cannot be used.
Was there ever a statement to that effect? It was never clear to me and I doubt it was clear to anyone.
You literally reviewed a patch who's subject was "block: allow migration to work with image files"[1] that explained in gory detail what the problem was.
Isn't a patch fixing a problem with migrating image files a statement that we do support migrating image files?
You know, we could go 9 rounds about this and it really doesn't matter. For 1.0, I feel very strongly that we cannot change the semantics of migration with raw as dramatically as has been proposed. That's a huge regression risk. But since live migration with qcow2 has never worked, there's really not a lot of harm of adding something that makes qcow2 with live migration work better than it does right now. I just sent out a series that does this. It's Kevin's original idea since actually reopening the file doesn't help anything. The only thing that's different from what I expect Kevin would want is that this is restricted to qcow2. I want this restrict for 1.0. If once 1.1 opens up, Kevin wants to promote that code to generic block layer code, that's fine with me. It's also included as part of the migration blockers series so that we are very explicit about when we don't support migration for given image formats. Regards, Anthony Liguori

On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.

Am 12.11.2011 11:25, schrieb Avi Kivity:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
Not sure what results and what computation you mean, but let me clarify a bit about bdrv_reopen: The main purpose of bdrv_reopen() is to change flags, for example toggle O_SYNC during runtime in order to allow the guest to toggle WCE. This doesn't necessarily mean a close()/open() sequence if there are other means to change the flags, like fcntl() (or even using other protocols than files). The idea here was to extend this to invalidate all caches if some specific flag is set. As you don't change any other flag, this will usually not be a reopen on a lower level. If we need to use open() though, and it fails (this is really the only "different" result that comes to mind) then bdrv_reopen() would fail and the old fd would stay in use. Migration would have to fail, but I don't think this case is ever needed for reopening after migration.
What's wrong with just delaying the open?
Nothing, except that with today's code it's harder to do. Kevin

On Mon, Nov 14, 2011 at 10:58:16AM +0100, Kevin Wolf wrote:
Am 12.11.2011 11:25, schrieb Avi Kivity:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
Not sure what results and what computation you mean, but let me clarify a bit about bdrv_reopen:
The main purpose of bdrv_reopen() is to change flags, for example toggle O_SYNC during runtime in order to allow the guest to toggle WCE. This doesn't necessarily mean a close()/open() sequence if there are other means to change the flags, like fcntl() (or even using other protocols than files).
The idea here was to extend this to invalidate all caches if some specific flag is set. As you don't change any other flag, this will usually not be a reopen on a lower level.
If we need to use open() though, and it fails (this is really the only "different" result that comes to mind) then bdrv_reopen() would fail and the old fd would stay in use. Migration would have to fail, but I don't think this case is ever needed for reopening after migration.
What's wrong with just delaying the open?
Nothing, except that with today's code it's harder to do.
Kevin
It seems cleaner, though, doesn't it? -- MST

On 11/14/2011 11:58 AM, Kevin Wolf wrote:
Am 12.11.2011 11:25, schrieb Avi Kivity:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
Not sure what results and what computation you mean,
Result = open succeeded. Computation = anything that derives from the image, like size, or reading some stuff to guess CHS or something.
but let me clarify a bit about bdrv_reopen:
The main purpose of bdrv_reopen() is to change flags, for example toggle O_SYNC during runtime in order to allow the guest to toggle WCE. This doesn't necessarily mean a close()/open() sequence if there are other means to change the flags, like fcntl() (or even using other protocols than files).
The idea here was to extend this to invalidate all caches if some specific flag is set. As you don't change any other flag, this will usually not be a reopen on a lower level.
If we need to use open() though, and it fails (this is really the only "different" result that comes to mind)
(yes)
then bdrv_reopen() would fail and the old fd would stay in use. Migration would have to fail, but I don't think this case is ever needed for reopening after migration.
Okay.
What's wrong with just delaying the open?
Nothing, except that with today's code it's harder to do.
This has never stopped us (though it may delay us). -- error compiling committee.c: too many arguments to function

On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? I'm asking because for avoiding the former, things like access() could be enough, whereas for the latter we'd have to do a full open. Kevin

On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori: > Live migration with qcow2 or any other image format is just not going to work > right now even with proper clustered storage. I think doing a block level flush > cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote: > Am 10.11.2011 22:30, schrieb Anthony Liguori: >> Live migration with qcow2 or any other image format is just not going to work >> right now even with proper clustered storage. I think doing a block level flush >> cache interface and letting block devices decide how to do it is the best approach. > > I would really prefer reusing the existing open/close code. It means > less (duplicated) code, is existing code that is well tested and doesn't > make migration much of a special case. > > If you want to avoid reopening the file on the OS level, we can reopen > only the topmost layer (i.e. the format, but not the protocol) for now > and in 1.1 we can use bdrv_reopen(). >
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Daniel
Do you run qemu with -S, then give a 'cont' command to start it?
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: > On 11/11/2011 12:15 PM, Kevin Wolf wrote: >> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>> Live migration with qcow2 or any other image format is just not going to work >>> right now even with proper clustered storage. I think doing a block level flush >>> cache interface and letting block devices decide how to do it is the best approach. >> >> I would really prefer reusing the existing open/close code. It means >> less (duplicated) code, is existing code that is well tested and doesn't >> make migration much of a special case. >> >> If you want to avoid reopening the file on the OS level, we can reopen >> only the topmost layer (i.e. the format, but not the protocol) for now >> and in 1.1 we can use bdrv_reopen(). >> > > Intuitively I dislike _reopen style interfaces. If the second open > yields different results from the first, does it invalidate any > computations in between? > > What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Yes Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote: > On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: >> On 11/11/2011 12:15 PM, Kevin Wolf wrote: >>> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>>> Live migration with qcow2 or any other image format is just not going to work >>>> right now even with proper clustered storage. I think doing a block level flush >>>> cache interface and letting block devices decide how to do it is the best approach. >>> >>> I would really prefer reusing the existing open/close code. It means >>> less (duplicated) code, is existing code that is well tested and doesn't >>> make migration much of a special case. >>> >>> If you want to avoid reopening the file on the OS level, we can reopen >>> only the topmost layer (i.e. the format, but not the protocol) for now >>> and in 1.1 we can use bdrv_reopen(). >>> >> >> Intuitively I dislike _reopen style interfaces. If the second open >> yields different results from the first, does it invalidate any >> computations in between? >> >> What's wrong with just delaying the open? > > If you delay the 'open' until the mgmt app issues 'cont', then you loose > the ability to rollback to the source host upon open failure for most > deployed versions of libvirt. We only fairly recently switched to a five > stage migration handshake to cope with rollback when 'cont' fails. > > Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Yes
Daniel
Probably in an attempt to improve reliability :) So this is in fact unrelated to migration. So we can either ignore this bug (assuming no distros ship cutting edge qemu with an old libvirt), or special-case -S and do an open/close cycle on startup.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 01:51:40PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: > On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote: >> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: >>> On 11/11/2011 12:15 PM, Kevin Wolf wrote: >>>> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>>>> Live migration with qcow2 or any other image format is just not going to work >>>>> right now even with proper clustered storage. I think doing a block level flush >>>>> cache interface and letting block devices decide how to do it is the best approach. >>>> >>>> I would really prefer reusing the existing open/close code. It means >>>> less (duplicated) code, is existing code that is well tested and doesn't >>>> make migration much of a special case. >>>> >>>> If you want to avoid reopening the file on the OS level, we can reopen >>>> only the topmost layer (i.e. the format, but not the protocol) for now >>>> and in 1.1 we can use bdrv_reopen(). >>>> >>> >>> Intuitively I dislike _reopen style interfaces. If the second open >>> yields different results from the first, does it invalidate any >>> computations in between? >>> >>> What's wrong with just delaying the open? >> >> If you delay the 'open' until the mgmt app issues 'cont', then you loose >> the ability to rollback to the source host upon open failure for most >> deployed versions of libvirt. We only fairly recently switched to a five >> stage migration handshake to cope with rollback when 'cont' fails. >> >> Daniel > > I guess reopen can fail as well, so this seems to me to be an important > fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Probably in an attempt to improve reliability :)
Not really. We can't simply let QEMU start its own CPUs, because there are various tasks that need performing after the migration transfer finishes, but before the CPUs are allowed to be started. eg - Finish 802.11Qb{g,h} (VEPA) network port profile association on target - Release leases for any resources associated with the source QEMU via a configured lock manager (eg sanlock) - Acquire leases for any resources associated with the target QEMU via a configured lock manager (eg sanlock) Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote: > On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: >> On 11/11/2011 12:15 PM, Kevin Wolf wrote: >>> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>>> Live migration with qcow2 or any other image format is just not going to work >>>> right now even with proper clustered storage. I think doing a block level flush >>>> cache interface and letting block devices decide how to do it is the best approach. >>> >>> I would really prefer reusing the existing open/close code. It means >>> less (duplicated) code, is existing code that is well tested and doesn't >>> make migration much of a special case. >>> >>> If you want to avoid reopening the file on the OS level, we can reopen >>> only the topmost layer (i.e. the format, but not the protocol) for now >>> and in 1.1 we can use bdrv_reopen(). >>> >> >> Intuitively I dislike _reopen style interfaces. If the second open >> yields different results from the first, does it invalidate any >> computations in between? >> >> What's wrong with just delaying the open? > > If you delay the 'open' until the mgmt app issues 'cont', then you loose > the ability to rollback to the source host upon open failure for most > deployed versions of libvirt. We only fairly recently switched to a five > stage migration handshake to cope with rollback when 'cont' fails. > > Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Yes
Daniel
OK, so let's go back one step now - how is this related to 'rollback to source host'? -- MST

On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: > On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote: >> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: >>> On 11/11/2011 12:15 PM, Kevin Wolf wrote: >>>> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>>>> Live migration with qcow2 or any other image format is just not going to work >>>>> right now even with proper clustered storage. I think doing a block level flush >>>>> cache interface and letting block devices decide how to do it is the best approach. >>>> >>>> I would really prefer reusing the existing open/close code. It means >>>> less (duplicated) code, is existing code that is well tested and doesn't >>>> make migration much of a special case. >>>> >>>> If you want to avoid reopening the file on the OS level, we can reopen >>>> only the topmost layer (i.e. the format, but not the protocol) for now >>>> and in 1.1 we can use bdrv_reopen(). >>>> >>> >>> Intuitively I dislike _reopen style interfaces. If the second open >>> yields different results from the first, does it invalidate any >>> computations in between? >>> >>> What's wrong with just delaying the open? >> >> If you delay the 'open' until the mgmt app issues 'cont', then you loose >> the ability to rollback to the source host upon open failure for most >> deployed versions of libvirt. We only fairly recently switched to a five >> stage migration handshake to cope with rollback when 'cont' fails. >> >> Daniel > > I guess reopen can fail as well, so this seems to me to be an important > fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Yes
OK, so let's go back one step now - how is this related to 'rollback to source host'?
In the old libvirt migration protocol, by the time we run 'cont' on the destination, the source QEMU has already been killed off, so there's nothing to resume on failure. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 11:58:14AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange: > On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: >> On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote: >>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: >>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote: >>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori: >>>>>> Live migration with qcow2 or any other image format is just not going to work >>>>>> right now even with proper clustered storage. I think doing a block level flush >>>>>> cache interface and letting block devices decide how to do it is the best approach. >>>>> >>>>> I would really prefer reusing the existing open/close code. It means >>>>> less (duplicated) code, is existing code that is well tested and doesn't >>>>> make migration much of a special case. >>>>> >>>>> If you want to avoid reopening the file on the OS level, we can reopen >>>>> only the topmost layer (i.e. the format, but not the protocol) for now >>>>> and in 1.1 we can use bdrv_reopen(). >>>>> >>>> >>>> Intuitively I dislike _reopen style interfaces. If the second open >>>> yields different results from the first, does it invalidate any >>>> computations in between? >>>> >>>> What's wrong with just delaying the open? >>> >>> If you delay the 'open' until the mgmt app issues 'cont', then you loose >>> the ability to rollback to the source host upon open failure for most >>> deployed versions of libvirt. We only fairly recently switched to a five >>> stage migration handshake to cope with rollback when 'cont' fails. >>> >>> Daniel >> >> I guess reopen can fail as well, so this seems to me to be an important >> fix but not a blocker. > > If if the initial open succeeds, then it is far more likely that a later > re-open will succeed too, because you have already elminated the possibility > of configuration mistakes, and will have caught most storage runtime errors > too. So there is a very significant difference in reliability between doing > an 'open at startup + reopen at cont' vs just 'open at cont' > > Based on the bug reports I see, we want to be very good at detecting and > gracefully handling open errors because they are pretty frequent.
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Yes
OK, so let's go back one step now - how is this related to 'rollback to source host'?
In the old libvirt migration protocol, by the time we run 'cont' on the destination, the source QEMU has already been killed off, so there's nothing to resume on failure.
Daniel
I see. So again there are two solutions I see: 1. ignore old libvirt as it can't restart source reliably anyway 2. open files when migration is completed (after startup, but before cont)
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file?
Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead.
If started correctly QEMU should not report them to the guest OS, but pause instead. -- Gleb.

On Mon, Nov 14, 2011 at 11:08:02AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent.
Regards, Daniel
IIUC, the 'cont' that we were discussing is the startup of the VM at destination after migration completes. A failure results in migration failure, which libvirt has been able to handle since forever. In case of the 'cont' command on source upon migration failure, qemu was running there previously so it's likely configuration is OK. Am I confused? If no, libvirt seems unaffected.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11/14/2011 04:16 AM, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Delayed open isn't a panacea. With the series I sent, we should be able to migration with a qcow2 file on coherent shared storage. There are two other cases that we care about: migration with nfs cache!=none and direct attached storage with cache!=none Whether the open is deferred matters less with NFS than if the open happens after the close on the source. To fix NFS cache!=none, we would have to do a bdrv_close() before sending the last byte of migration data and make sure that we bdrv_open() after receiving the last byte of migration data. The problem with this IMHO is it creates a large window where noone has the file open and you're critically vulnerable to losing your VM. I'm much more in favor of a smarter caching policy. If we can fcntl() our way to O_DIRECT on NFS, that would be fairly interesting. I'm not sure if this is supported today but it's something we could look into adding in the kernel. That way we could force NFS to O_DIRECT during migration which would solve this problem robustly. Deferred open doesn't help with direct attached storage. There simple is no guarantee that there isn't data in the page cache. Again, I think defaulting DAS to cache=none|directsync is what makes the most sense here. We can even add a migration blocker for DAS with cache=on. If we can do dynamic toggling of the cache setting, then that's pretty friendly at the end of the day. Regards, Anthony Liguori
Daniel

Anthony Liguori <anthony@codemonkey.ws> wrote:
On 11/14/2011 04:16 AM, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails.
Delayed open isn't a panacea. With the series I sent, we should be able to migration with a qcow2 file on coherent shared storage.
There are two other cases that we care about: migration with nfs cache!=none and direct attached storage with cache!=none
Whether the open is deferred matters less with NFS than if the open happens after the close on the source. To fix NFS cache!=none, we would have to do a bdrv_close() before sending the last byte of migration data and make sure that we bdrv_open() after receiving the last byte of migration data.
The problem with this IMHO is it creates a large window where noone has the file open and you're critically vulnerable to losing your VM.
Red Hat NFS guru told that fsync() on source + open() after that on target is enough. But anyways, it still depends of nothing else having the file opened on target.
I'm much more in favor of a smarter caching policy. If we can fcntl() our way to O_DIRECT on NFS, that would be fairly interesting. I'm not sure if this is supported today but it's something we could look into adding in the kernel. That way we could force NFS to O_DIRECT during migration which would solve this problem robustly.
We would need O_DIRECT on target during migration, I agree than that would work.
Deferred open doesn't help with direct attached storage. There simple is no guarantee that there isn't data in the page cache.
Yeap, I asked the clustered filesystem people how they fixed the problem, because clustered filesystem have this problem, right. After lots of arm twisting, I got the ioctl(BLKFLSBUF,...), but that only works: - on linux - on some block devices So, we are back to square 1.
Again, I think defaulting DAS to cache=none|directsync is what makes the most sense here.
I think it is the only sane solution. Otherwise, we need to write the equivalent of a lock manager, to know _who_ has the storage, and distributed lock managers are a mess :-(
We can even add a migration blocker for DAS with cache=on. If we can do dynamic toggling of the cache setting, then that's pretty friendly at the end of the day.
That could fix the problem also. At the moment that we start migration, we do an fsync() + switch to O_DIRECT for all filesystems. As you said, time for implementing fcntl(O_DIRECT). Later, Juan.

On 11/15/2011 07:20 AM, Juan Quintela wrote:
Again, I think defaulting DAS to cache=none|directsync is what makes the most sense here.
I think it is the only sane solution. Otherwise, we need to write the equivalent of a lock manager, to know _who_ has the storage, and distributed lock managers are a mess :-(
We can even add a migration blocker for DAS with cache=on. If we can do dynamic toggling of the cache setting, then that's pretty friendly at the end of the day.
That could fix the problem also. At the moment that we start migration, we do an fsync() + switch to O_DIRECT for all filesystems.
As you said, time for implementing fcntl(O_DIRECT).
Yeah, I think this ends up being a very elegant solution. We always open block devices O_DIRECT to start with. That ensures reads go directly to disk if its DAS or result in NFS protocol reads. As long as we fsync on the source (and we do), then we're okay. For cache=write{back,through}, we would then just fcntl() away O_DIRECT as soon as we start the guest. Then we can start doing reads through the page cache. Regards, Anthony Liguori
Later, Juan.
participants (7)
-
Anthony Liguori
-
Avi Kivity
-
Daniel P. Berrange
-
Gleb Natapov
-
Juan Quintela
-
Kevin Wolf
-
Michael S. Tsirkin