|
This post was updated on .
We are pretty sure we have tracked down one of the reasons for channels being terminated in the middle of builds.
As seen in a few JIRA issues: hudson.util.IOException2: Failed to join the process at hudson.Proc$RemoteProc.join(Proc.java:312) at hudson.Launcher$ProcStarter.join(Launcher.java:280) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83) [...] Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:412) at hudson.remoting.Request.call(Request.java:105) at hudson.remoting.Channel.call(Channel.java:555) [...] Our build system have 100+ slaves. Every now and then we lose important builds due to the connections being lost to seemingly random build machines. We have tried various keep alive options to ensure that the connection to the build machine is always kept open. With no real success. For us, this is a critical flaw. But what we have seen lately, is that a random number of slaves lose their connection whenever the GC starts/stops working. This is a reproducible error. The reason seems to be that the GC takes 90+ seconds to complete during heavy system load. During this time the GC freezes the Hudson process and cannot receive the ping being sent through the PingThread executed in Launcher.java or Engine.java. When the GC is done, the ping (if executed prior to the GC starts working) directly times out and the channel is closed unconditionally. I guess this is the reason Hudson forcibly terminates the channel. If not, the closed channel is still directly related to the GC taking 60sec+ to complete its work. Probably the same issue that can be seen in this JIRA: http://issues.hudson-ci.org/browse/HUDSON-6566 Can we please get a way to optionally disable the ping, both from master side and slave side? At least for verification purposes. Br Gustaf Lundh - Sony Ericsson |
|
- John On 7/7/10 4:20 AM, "Lundh, Gustaf" <Gustaf.Lundh@...> wrote: We are pretty sure we have tracked down one of the reasons for channels being terminated in the middle of builds. |
|
In reply to this post by GLundh
I've made a few patches in my Hudson instance to deal with this
(increase ping timeout, reduce JUnit history). Try building it from http://github.com/huybrechts/hudson/tree/production (it is based on source of May 28). The cobertura plugin is also really good at eating all your memory. Either disable it, or patch it to disable history graphs. Have a look at your build results too: e.g. do you really need those 50.000 low priority FindBugs warnings that nobody ever looks at ? Tom On Wed, Jul 7, 2010 at 1:20 PM, Lundh, Gustaf <[hidden email]> wrote: > We are pretty sure we have tracked down one of the reasons for channels > being terminated in the middle of builds. > > > > As seen in a few JIRA issues: > > > > hudson.util.IOException2: Failed to join the process > at hudson.Proc$RemoteProc.join(Proc.java:312) > at hudson.Launcher$ProcStarter.join(Launcher.java:280) > at > hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83) > > [...] > > Caused by: hudson.remoting.ChannelClosedException: channel is already closed > at hudson.remoting.Channel.send(Channel.java:412) > at hudson.remoting.Request.call(Request.java:105) > at hudson.remoting.Channel.call(Channel.java:555) > > [...] > > > > Our build system have 100+ slaves. Every now and then we lose important > builds due to the connections being lost to seemingly random build machines. > > > > We have tried various keep alive options to ensure that the connection to > the build machine is always kept open. With no real success. For us, this is > a critical flaw. > > > > But what we have seen lately, is that a random number of slaves lose their > connection whenever the GC starts/stops working. This is a reproducible > error. > > > > The reason seems to be that the GC takes 90+ seconds to complete during > heavy system load. During this time the GC freezes the Hudson process and > cannot receive the ping being sent through the PingThread executed in > Launcher.java or Engine.java. When the GC is done, the ping (if executed > prior to the GC starts working) directly times out and the channel is closed > unconditionally. I guess this is the reason Hudson forcibly terminates the > channel. If not, the closed channel is still directly related to the GC > taking 60sec+ to complete its work. > > > > Probably the same issue that can be seen in this JIRA: > > http://issues.hudson-ci.org/browse/HUDSON-6566 > > > > Can we please get a way to optionally disable the ping, both from master > side and slave side? At least for verification purposes. > > > > Br > > Gustaf Lundh - Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by John Pederzolli
Hi, John! No, currently we only set Max/Min Heap to decent values and run the VM with -XX:+UseParallelGC. As I understand the Sun JavaVM documentation, the Concurrent GC still will pause the application threads during the major collections. And from the VM FAQs I have also gotten the impression that the Concurrent GC also brings a quite hard performance penalty. However, I will of course investigate your suggestion further and hopefully be able to give it a try for evaluation purposes in our environment. Thanks for the suggestion! Further on, I would also like to take this post as an opportunity, to add my question from from the JIRA issue to the mail thread: Is there a possibility that the channel can get so _congested_ from one (or many) ongoing heavy tasks (file copy/searching/parsing/server load/etc), that the PingThread cannot complete its request on the same channel - and therefor actually terminate an healthy connection? Best Regards Gustaf Lundh - Sony Ericsson |
|
On 7/7/10 1:48 PM, "GLundh" <[hidden email]> wrote:
> Further on, I would also like to take this post as an opportunity, to add my > question from from the JIRA issue to the mail thread: > > Is there a possibility that the channel can get so _congested_ from one (or > many) ongoing heavy tasks (file copy/searching/parsing/server load/etc), > that the PingThread cannot complete its request on the same channel - and > therefor actually terminate an healthy connection? In my experience, this is very possible. I diagnosed as much as part of this: http://issues.hudson-ci.org/browse/HUDSON-5413 -- Dean > > Best Regards > Gustaf Lundh - Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Tom Huybrechts
Thank you! I will download the code and give it a run. How delightful that you actually had a commit with the following message: "increase slave connection timeout to 4 minutes, to account for long full GCs on server" Still, let us continue the discussion. I though feel we need a more permanent fix in the official trunk. Any other core developers that want to comment on the issue? Since this ping issue brings more problems than it solves, how about disabling ping by default and use the old --ping option to enable it...? (This counts for both launcher.java and engine.java) Or at least let us increase the timeout to 4-5min. I'd be happy to submit a patch if someone can OK this. Br Gustaf Lundh - Sony Ericsson |
|
In reply to this post by Dean Yu
Hi, Dean!
Thanks for the interesting link. It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working. But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread. Suggestion: A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch. Br Gustaf Lundh - Sony Ericsson -----Original Message----- From: Dean Yu [mailto:[hidden email]] Sent: den 7 juli 2010 22:55 To: [hidden email]; Lundh, Gustaf Subject: Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. On 7/7/10 1:48 PM, "GLundh" <[hidden email]> wrote: > Further on, I would also like to take this post as an opportunity, to add my > question from from the JIRA issue to the mail thread: > > Is there a possibility that the channel can get so _congested_ from one (or > many) ongoing heavy tasks (file copy/searching/parsing/server load/etc), > that the PingThread cannot complete its request on the same channel - and > therefor actually terminate an healthy connection? In my experience, this is very possible. I diagnosed as much as part of this: http://issues.hudson-ci.org/browse/HUDSON-5413 -- Dean > > Best Regards > Gustaf Lundh - Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
For us, this issue is very critical.
I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default: Launcher.java: 86 // no-op, but left for backward compatibility 87 @Option(name="-ping") 88 public boolean ping = true; // <- Switch to false Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds. This _really_ needs to be fixed asap. Is everyone OK with me reverting this behavior? Kohsuke? Br Gustaf Lundh - Team Hudson @ Sony Ericsson -----Original Message----- From: Lundh, Gustaf [mailto:[hidden email]] Sent: den 8 juli 2010 09:58 To: Dean Yu; [hidden email] Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. Hi, Dean! Thanks for the interesting link. It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working. But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread. Suggestion: A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch. Br Gustaf Lundh - Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Hello Gustaf,
"IOException2/ChannelClosedException" issue has been bothering us for a couple of months now. I'm glad you found a potential fix. I'll look forward to this patch. Thanks, Amar -----Original Message----- From: Lundh, Gustaf [mailto:[hidden email]] Sent: Friday, July 09, 2010 5:21 AM To: [hidden email] Cc: Kohsuke Kawaguchi Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. For us, this issue is very critical. I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default: Launcher.java: 86 // no-op, but left for backward compatibility 87 @Option(name="-ping") 88 public boolean ping = true; // <- Switch to false Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds. This _really_ needs to be fixed asap. Is everyone OK with me reverting this behavior? Kohsuke? Br Gustaf Lundh - Team Hudson @ Sony Ericsson -----Original Message----- From: Lundh, Gustaf [mailto:[hidden email]] Sent: den 8 juli 2010 09:58 To: Dean Yu; [hidden email] Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. Hi, Dean! Thanks for the interesting link. It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working. But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread. Suggestion: A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch. Br Gustaf Lundh - Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by GLundh
+1 - the ping is a nice feature but not worth the pain of these issues.
A.
On Fri, Jul 9, 2010 at 5:21 AM, Lundh, Gustaf <[hidden email]> wrote: For us, this issue is very critical. |
|
In reply to this post by GLundh
I rather see an approach where the ping is retained, but the timeout
exception can be treated as non-fatal. In the case that a ping timeout is not treated as fatal, the Launcher should keep a running count of the number of failed pings and the start time for the outage. In an ideal world, these numbers would then be available via JMX for monitoring by a third system that watched the health of both the master and all slaves. -md begin Lundh, Gustaf quotation: > For us, this issue is very critical. > > I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default: > > Launcher.java: > 86 // no-op, but left for backward compatibility > 87 @Option(name="-ping") > 88 public boolean ping = true; // <- Switch to false > > Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds. > > This _really_ needs to be fixed asap. > > Is everyone OK with me reverting this behavior? Kohsuke? > > Br > Gustaf Lundh - Team Hudson @ Sony Ericsson > > > -----Original Message----- > From: Lundh, Gustaf [mailto:[hidden email]] > Sent: den 8 juli 2010 09:58 > To: Dean Yu; [hidden email] > Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. > > Hi, Dean! > > Thanks for the interesting link. > > It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working. > > > But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread. > > Suggestion: > A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch. > > Br > Gustaf Lundh - Sony Ericsson > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by GLundh
I'm committing this change for next week's release.. hope this will be
helpful: 1) change defaults from ping every 5 min with 1 min timeout, to ping every 10 min with 4 min timeout. 2) add hudson.remoting.Launcher.pingTimeoutSec and hudson.remoting.Launcher.pingIntervalSec system properties to override defaults. Setting either to <= 0 disables the ping thread. - Alan Lundh, Gustaf wrote: > For us, this issue is very critical. > > I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default: > > Launcher.java: > 86 // no-op, but left for backward compatibility > 87 @Option(name="-ping") > 88 public boolean ping = true; // <- Switch to false > > Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds. > > This _really_ needs to be fixed asap. > > Is everyone OK with me reverting this behavior? Kohsuke? > > Br > Gustaf Lundh - Team Hudson @ Sony Ericsson > > > -----Original Message----- > From: Lundh, Gustaf [mailto:[hidden email]] > Sent: den 8 juli 2010 09:58 > To: Dean Yu; [hidden email] > Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. > > Hi, Dean! > > Thanks for the interesting link. > > It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working. > > > But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread. > > Suggestion: > A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch. > > Br > Gustaf Lundh - Sony Ericsson > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Fantastic!
Thanks for the quick fix! Let us know where we should send the bottle of wine. :) Br Gustaf Lundh @ Sony Ericsson
|
| Powered by Nabble | Edit this page |
