[libvirt] libvirt(-java): How to make vm.getDomain().getJobInfo() thread safe?

Hi, this is kind of a follow-up to an older question/discussion: https://www.redhat.com/archives/libvir-list/2010-July/msg00267.html As a result of that, I use a second thread for monitoring the live migration, taking actions (setting maxdowntime to a value that fits the situation) if necessary. Although I call getJobInfo() with a quite low frequency (once a second), problems are occuring frequently, like every 10th or 15th live migration. Problems range from exceptions that the domain is not running anymore to complete JVM crashes -> http://pastebin.com/jT6sXubu Recovery from exceptions doesn't seem to work perfectly, as they seem to trigger that connections to a host can't be shut down properly because there are still open references. Of course, in my monitoring thread I'm checking in every monitoring iteration if the domain object is not null, is still active, if the jobInfo is available yet etc. But, as I can not synchronize with vm.migrate(), there still a reasonable chance that migrate() just invalidates the current domain while I'm accessing it, no matter what I do. Do I miss something or is that correct? Any ideas how to reliably solve it? Is there some experience from virt-manager, where (in my quite old version) I assume at least the domain is read for cpu etc. stats while live migrating... regards, thomas

On 09/14/2011 02:20 PM, Thomas Treutner wrote:
complete JVM crashes -> http://pastebin.com/jT6sXubu
With this message on the CLI: java: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.

On 09/16/2011 10:16 AM, Thomas Treutner wrote:
On 09/14/2011 02:20 PM, Thomas Treutner wrote:
complete JVM crashes -> http://pastebin.com/jT6sXubu
With this message on the CLI:
java: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.
Different kind of error: http://pastebin.com/ujFd4Lw9

On Wed, Sep 14, 2011 at 02:20:46PM +0200, Thomas Treutner wrote:
Hi,
this is kind of a follow-up to an older question/discussion: https://www.redhat.com/archives/libvir-list/2010-July/msg00267.html
As a result of that, I use a second thread for monitoring the live migration, taking actions (setting maxdowntime to a value that fits the situation) if necessary.
Although I call getJobInfo() with a quite low frequency (once a second), problems are occuring frequently, like every 10th or 15th live migration. Problems range from exceptions that the domain is not running anymore to complete JVM crashes -> http://pastebin.com/jT6sXubu Recovery from exceptions doesn't seem to work perfectly, as they seem to trigger that connections to a host can't be shut down properly because there are still open references.
Of course, in my monitoring thread I'm checking in every monitoring iteration if the domain object is not null, is still active, if the jobInfo is available yet etc. But, as I can not synchronize with vm.migrate(), there still a reasonable chance that migrate() just invalidates the current domain while I'm accessing it, no matter what I do.
At the C level every API in libvirt is threadsafe. The only key point is that if you use objects (eg virDomainPtr) from multiple threads you ought to hold an extra reference on them (virDomainRef) per thread to ensure that one thread does not delete an object that is in use by the other thread. At the Java level, this reference handling ought to be working automatically so you wouldn't need todo anything special to safely do migration with 2 threads as you describe. So I don't really have any explanation for what you see. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 09/16/2011 10:44 AM, Daniel P. Berrange wrote:
On Wed, Sep 14, 2011 at 02:20:46PM +0200, Thomas Treutner wrote:
Of course, in my monitoring thread I'm checking in every monitoring iteration if the domain object is not null, is still active, if the jobInfo is available yet etc. But, as I can not synchronize with vm.migrate(), there still a reasonable chance that migrate() just invalidates the current domain while I'm accessing it, no matter what I do.
At the C level every API in libvirt is threadsafe. The only key point is that if you use objects (eg virDomainPtr) from multiple threads you ought to hold an extra reference on them (virDomainRef) per thread to ensure that one thread does not delete an object that is in use by the other thread.
At the Java level, this reference handling ought to be working automatically so you wouldn't need todo anything special to safely do migration with 2 threads as you describe. So I don't really have any explanation for what you see.
I think that gave me a drift to the right direction, thanks. I'm using an additional, temporary domain (java) object in the monitoring thread now. The testing isn't running for a long time yet, but it looks promising. It's been a while since I poked around in libvirt-java, but could it be that per domain java object, a reference is held? regards, thomas
participants (2)
-
Daniel P. Berrange
-
Thomas Treutner