[libvirt-users] high memory guest issues - virsh start and QEMU_JOB_WAIT_TIME - Users

newer
[libvirt-users] provisioning with...

[libvirt-users] high memory guest issues - virsh start and QEMU_JOB_WAIT_TIME

Blair Bethwaite

14 Feb 2017 14 Feb '17

4:13 p.m.

Hi all, In IRC last night Dan helpfully confirmed my analysis of an issue we are seeing attempting to launch high memory KVM guests backed by hugepages... In this case the guests have 240GB of memory allocated from two host NUMA nodes to two guest NUMA nodes. The trouble is that allocating the hugepage backed qemu process seems to take longer than the 30s QEMU_JOB_WAIT_TIME and so libvirt then most unhelpfully kills the barely spawned guest. Dan said there was currently no workaround available so I'm now looking at building a custom libvirt which sets QEMU_JOB_WAIT_TIME=60s. I have two related questions: 1) will this change have any untoward side-effects? 2) if not, then is there any reason not to change it in master until a better solution comes along (or possibly better, alter qemuDomainObjBeginJobInternal to give a domain start job a little longer compared to other jobs)? -- Cheers, ~Blairo

Attachments:

attachment.html (text/html — 1.1 KB)

Show replies by date

Michal Privoznik

14 Feb 14 Feb

10:47 p.m.

On 02/14/2017 08:13 AM, Blair Bethwaite wrote:

...

Hi all,

In IRC last night Dan helpfully confirmed my analysis of an issue we are seeing attempting to launch high memory KVM guests backed by hugepages...

In this case the guests have 240GB of memory allocated from two host NUMA nodes to two guest NUMA nodes. The trouble is that allocating the hugepage backed qemu process seems to take longer than the 30s QEMU_JOB_WAIT_TIME and so libvirt then most unhelpfully kills the barely spawned guest. Dan said there was currently no workaround available so I'm now looking at building a custom libvirt which sets QEMU_JOB_WAIT_TIME=60s.

I don't think I understand this. Who is running the other job? I mean, I'd expect qemu fail to create the socket and thus hitting 30s timeout in qemuMonitorOpenUnix().

...

I have two related questions: 1) will this change have any untoward side-effects?

Since this timeout is shared with other jobs, you might have to wait a bit longer for an API to return with error if a domain is stuck and unresponsive.

...

2) if not, then is there any reason not to change it in master until a better solution comes along (or possibly better, alter qemuDomainObjBeginJobInternal to give a domain start job a little longer compared to other jobs)?

It's a trade off between "responsiveness" of a libvirt API and being able to talk to qemu which is under heavy load. From libvirt's POV we are unable to tell whether qemu is doing something or is stuck (e.g. looping endlessly). So far, we felt like 30 seconds is a good choice. But I don't mind being proven wrong. Michal

Daniel P. Berrange

10:57 p.m.

On Tue, Feb 14, 2017 at 06:13:20PM +1100, Blair Bethwaite wrote:

...

Hi all,

In IRC last night Dan helpfully confirmed my analysis of an issue we are seeing attempting to launch high memory KVM guests backed by hugepages...

In this case the guests have 240GB of memory allocated from two host NUMA nodes to two guest NUMA nodes. The trouble is that allocating the hugepage backed qemu process seems to take longer than the 30s QEMU_JOB_WAIT_TIME and so libvirt then most unhelpfully kills the barely spawned guest. Dan said there was currently no workaround available so I'm now looking at building a custom libvirt which sets QEMU_JOB_WAIT_TIME=60s.

I have two related questions: 1) will this change have any untoward side-effects? 2) if not, then is there any reason not to change it in master until a better solution comes along (or possibly better, alter qemuDomainObjBeginJobInternal to give a domain start job a little longer compared to other jobs)?

What is the actual error you're getting during startup. I'm not entirely sure QEMU_JOB_WAIT_TIME is the thing that's the problem. IIRC, the job wait time only comes into play when 2 threads are contending on the same QEMU process. ie one has an existing job running and a second comes along and tries to run a second job. The second will timeout after the QEMU_JOB_WAIT_TIME is reached. The first job which holds the lock will never timeout. During guest startup I didn't believe we had contending jobs in this way - all the jobs needed to startup QEMU should be serialized, so I'm not sure why QEMU_JOB_WAIT_TIME would even get hit. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Blair Bethwaite

15 Feb 15 Feb

11:43 a.m.

On 15 February 2017 at 00:57, Daniel P. Berrange <berrange@redhat.com> wrote:

...

What is the actual error you're getting during startup.

# virsh -d0 start instance-0000037c start: domain(optdata): instance-0000037c start: found option <domain>: instance-0000037c start: <domain> trying as domain NAME error: Failed to start domain instance-0000037c error: monitor socket did not show up: No such file or directory Full libvirtd debug log at https://gist.github.com/bmb/08fbb6b6136c758d027e90ff139d5701 On 15 February 2017 at 00:47, Michal Privoznik <mprivozn@redhat.com> wrote:

...

I don't think I understand this. Who is running the other job? I mean, I'd expect qemu fail to create the socket and thus hitting 30s timeout in qemuMonitorOpenUnix().

Yes you're right, I just blindly started looking for 30s constants in the code and that one seemed the most obvious but I had not tried to trace it all the way back to the domain start job or checked the debug logs yet, sorry. So looking a bit more carefully I see the real issue is in src/qemu/qemu_monitor.c: 321 static int 322 qemuMonitorOpenUnix(const char *monitor, pid_t cpid) 323 { 324 struct sockaddr_un addr; 325 int monfd; 326 int timeout = 30; /* In seconds */ Is this safe to increase? Is there any reason to keep it at 30s given (from what I'm seeing on a fast 2-socket Haswell system) that hugepage backed guests larger than ~160GB memory will not be able to start in that time? -- Cheers, ~Blairo

Michal Privoznik

6:27 p.m.

On 02/15/2017 03:43 AM, Blair Bethwaite wrote:

...

On 15 February 2017 at 00:57, Daniel P. Berrange <berrange@redhat.com> wrote:

...
What is the actual error you're getting during startup.

# virsh -d0 start instance-0000037c start: domain(optdata): instance-0000037c start: found option <domain>: instance-0000037c start: <domain> trying as domain NAME error: Failed to start domain instance-0000037c error: monitor socket did not show up: No such file or directory

Full libvirtd debug log at https://gist.github.com/bmb/08fbb6b6136c758d027e90ff139d5701

On 15 February 2017 at 00:47, Michal Privoznik <mprivozn@redhat.com> wrote:

...
I don't think I understand this. Who is running the other job? I mean, I'd expect qemu fail to create the socket and thus hitting 30s timeout in qemuMonitorOpenUnix().

Yes you're right, I just blindly started looking for 30s constants in the code and that one seemed the most obvious but I had not tried to trace it all the way back to the domain start job or checked the debug logs yet, sorry. So looking a bit more carefully I see the real issue is in src/qemu/qemu_monitor.c:

321 static int 322 qemuMonitorOpenUnix(const char *monitor, pid_t cpid) 323 { 324 struct sockaddr_un addr; 325 int monfd; 326 int timeout = 30; /* In seconds */

Is this safe to increase? Is there any reason to keep it at 30s given (from what I'm seeing on a fast 2-socket Haswell system) that hugepage backed guests larger than ~160GB memory will not be able to start in that time?

I recall some similar discussion took place in the past. But I just cannot find it now. I think the problem was that kernel is zeroing the pages on huge page allocation. Anyway, this timeout used to be 3 seconds and inly in fe89b687a0 it has been changed to 30 seconds. We can increase the limit, but that would solve just this case until somebody tries to assign even more RAM to their domain. What if we would instead make this configurable? Have yet another variable living inside qemu.conf that by default has value of 30 and specifies how long should libvirt wait for qemu monitor to show up? But frankly, on one hand I like this approach. But on the other I dislike it at the same time - we have just too much variables in qemu.conf because that's our answer to problems like these. We don't know so we offload the setting to the sys admin. Michal

Daniel P. Berrange

6:40 p.m.

On Wed, Feb 15, 2017 at 10:27:46AM +0100, Michal Privoznik wrote:

...

On 02/15/2017 03:43 AM, Blair Bethwaite wrote:

...
On 15 February 2017 at 00:57, Daniel P. Berrange <berrange@redhat.com> wrote:

...
What is the actual error you're getting during startup.

# virsh -d0 start instance-0000037c start: domain(optdata): instance-0000037c start: found option <domain>: instance-0000037c start: <domain> trying as domain NAME error: Failed to start domain instance-0000037c error: monitor socket did not show up: No such file or directory

Full libvirtd debug log at https://gist.github.com/bmb/08fbb6b6136c758d027e90ff139d5701

On 15 February 2017 at 00:47, Michal Privoznik <mprivozn@redhat.com> wrote:

...
I don't think I understand this. Who is running the other job? I mean, I'd expect qemu fail to create the socket and thus hitting 30s timeout in qemuMonitorOpenUnix().

Yes you're right, I just blindly started looking for 30s constants in the code and that one seemed the most obvious but I had not tried to trace it all the way back to the domain start job or checked the debug logs yet, sorry. So looking a bit more carefully I see the real issue is in src/qemu/qemu_monitor.c:

321 static int 322 qemuMonitorOpenUnix(const char *monitor, pid_t cpid) 323 { 324 struct sockaddr_un addr; 325 int monfd; 326 int timeout = 30; /* In seconds */

Is this safe to increase? Is there any reason to keep it at 30s given (from what I'm seeing on a fast 2-socket Haswell system) that hugepage backed guests larger than ~160GB memory will not be able to start in that time?

I recall some similar discussion took place in the past. But I just cannot find it now. I think the problem was that kernel is zeroing the pages on huge page allocation. Anyway, this timeout used to be 3 seconds and inly in fe89b687a0 it has been changed to 30 seconds.

We can increase the limit, but that would solve just this case until somebody tries to assign even more RAM to their domain. What if we would instead make this configurable? Have yet another variable living inside qemu.conf that by default has value of 30 and specifies how long should libvirt wait for qemu monitor to show up?

But frankly, on one hand I like this approach. But on the other I dislike it at the same time - we have just too much variables in qemu.conf because that's our answer to problems like these. We don't know so we offload the setting to the sys admin.

Honestly it is well overdue for us to come up with an improvement to QEMU that lets us start QEMU & open the monitor in a race-free manner. The obvious answer to this is to allow us to pass down a pre-opened UNIX listener socket FD to QEMU. We can thus connect() immediately with no race and then simply away the QMP greeting with no timeout, safely getting EOF if QEMU fails to start. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Blair Bethwaite

10:03 p.m.

On 15 February 2017 at 20:40, Daniel P. Berrange <berrange@redhat.com> wrote:

...

On Wed, Feb 15, 2017 at 10:27:46AM +0100, Michal Privoznik wrote:

...
On 02/15/2017 03:43 AM, Blair Bethwaite wrote:

...
On 15 February 2017 at 00:57, Daniel P. Berrange <berrange@redhat.com> wrote:

...
What is the actual error you're getting during startup.

# virsh -d0 start instance-0000037c start: domain(optdata): instance-0000037c start: found option <domain>: instance-0000037c start: <domain> trying as domain NAME error: Failed to start domain instance-0000037c error: monitor socket did not show up: No such file or directory

Full libvirtd debug log at https://gist.github.com/bmb/08fbb6b6136c758d027e90ff139d5701

On 15 February 2017 at 00:47, Michal Privoznik <mprivozn@redhat.com> wrote:

...
I don't think I understand this. Who is running the other job? I mean, I'd expect qemu fail to create the socket and thus hitting 30s timeout in qemuMonitorOpenUnix().

Yes you're right, I just blindly started looking for 30s constants in the code and that one seemed the most obvious but I had not tried to trace it all the way back to the domain start job or checked the debug logs yet, sorry. So looking a bit more carefully I see the real issue is in src/qemu/qemu_monitor.c:

321 static int 322 qemuMonitorOpenUnix(const char *monitor, pid_t cpid) 323 { 324 struct sockaddr_un addr; 325 int monfd; 326 int timeout = 30; /* In seconds */

Is this safe to increase? Is there any reason to keep it at 30s given (from what I'm seeing on a fast 2-socket Haswell system) that hugepage backed guests larger than ~160GB memory will not be able to start in that time?

I recall some similar discussion took place in the past. But I just cannot find it now. I think the problem was that kernel is zeroing the pages on huge page allocation. Anyway, this timeout used to be 3 seconds and inly in fe89b687a0 it has been changed to 30 seconds.

We can increase the limit, but that would solve just this case until somebody tries to assign even more RAM to their domain. What if we would instead make this configurable? Have yet another variable living inside qemu.conf that by default has value of 30 and specifies how long should libvirt wait for qemu monitor to show up?

But frankly, on one hand I like this approach. But on the other I dislike it at the same time - we have just too much variables in qemu.conf because that's our answer to problems like these. We don't know so we offload the setting to the sys admin.

Honestly it is well overdue for us to come up with an improvement to QEMU that lets us start QEMU & open the monitor in a race-free manner. The obvious answer to this is to allow us to pass down a pre-opened UNIX listener socket FD to QEMU. We can thus connect() immediately with no race and then simply away the QMP greeting with no timeout, safely getting EOF if QEMU fails to start.

Wish I could volunteer to work on that but am afraid my day job has me now thinking about building a custom package to work around this for the moment, or even attempting to find the right hexedit against the existing shared object o_0... probably a line of thinking I should squash now. Would it be helpful to have this registered as a customer RFE with Red Hat? Cheers, ~Blairo

Daniel P. Berrange

10:06 p.m.

On Thu, Feb 16, 2017 at 12:03:28AM +1100, Blair Bethwaite wrote:

...

On 15 February 2017 at 20:40, Daniel P. Berrange <berrange@redhat.com> wrote:

...
On Wed, Feb 15, 2017 at 10:27:46AM +0100, Michal Privoznik wrote:

...
On 02/15/2017 03:43 AM, Blair Bethwaite wrote:

...
On 15 February 2017 at 00:57, Daniel P. Berrange <berrange@redhat.com> wrote:

...
What is the actual error you're getting during startup.

# virsh -d0 start instance-0000037c start: domain(optdata): instance-0000037c start: found option <domain>: instance-0000037c start: <domain> trying as domain NAME error: Failed to start domain instance-0000037c error: monitor socket did not show up: No such file or directory

Full libvirtd debug log at https://gist.github.com/bmb/08fbb6b6136c758d027e90ff139d5701

On 15 February 2017 at 00:47, Michal Privoznik <mprivozn@redhat.com> wrote:

...
I don't think I understand this. Who is running the other job? I mean, I'd expect qemu fail to create the socket and thus hitting 30s timeout in qemuMonitorOpenUnix().

Yes you're right, I just blindly started looking for 30s constants in the code and that one seemed the most obvious but I had not tried to trace it all the way back to the domain start job or checked the debug logs yet, sorry. So looking a bit more carefully I see the real issue is in src/qemu/qemu_monitor.c:

321 static int 322 qemuMonitorOpenUnix(const char *monitor, pid_t cpid) 323 { 324 struct sockaddr_un addr; 325 int monfd; 326 int timeout = 30; /* In seconds */

Is this safe to increase? Is there any reason to keep it at 30s given (from what I'm seeing on a fast 2-socket Haswell system) that hugepage backed guests larger than ~160GB memory will not be able to start in that time?

I recall some similar discussion took place in the past. But I just cannot find it now. I think the problem was that kernel is zeroing the pages on huge page allocation. Anyway, this timeout used to be 3 seconds and inly in fe89b687a0 it has been changed to 30 seconds.

We can increase the limit, but that would solve just this case until somebody tries to assign even more RAM to their domain. What if we would instead make this configurable? Have yet another variable living inside qemu.conf that by default has value of 30 and specifies how long should libvirt wait for qemu monitor to show up?

But frankly, on one hand I like this approach. But on the other I dislike it at the same time - we have just too much variables in qemu.conf because that's our answer to problems like these. We don't know so we offload the setting to the sys admin.

Honestly it is well overdue for us to come up with an improvement to QEMU that lets us start QEMU & open the monitor in a race-free manner. The obvious answer to this is to allow us to pass down a pre-opened UNIX listener socket FD to QEMU. We can thus connect() immediately with no race and then simply away the QMP greeting with no timeout, safely getting EOF if QEMU fails to start.

Wish I could volunteer to work on that but am afraid my day job has me now thinking about building a custom package to work around this for the moment, or even attempting to find the right hexedit against the existing shared object o_0... probably a line of thinking I should squash now. Would it be helpful to have this registered as a customer RFE with Red Hat?

By all means file a bug report about this against RHEL if that's what you're using. It'll help track & priortize the issue for future updates. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

3250

Age (days ago)

3251

Last active (days ago)

List overview

Download

7 comments

3 participants

participants (3)

Blair Bethwaite
Daniel P. Berrange
Michal Privoznik

[libvirt-users] high memory guest issues - virsh start and QEMU_JOB_WAIT_TIME

tags

participants (3)