|
Our slaves continuously locks up, making our build-cluster unstable. Jobs on these slaves never starts the actual building process and cannot be aborted. This is a critical issue.
I see on the tracker and the mailing list that a decent number of people is running into this issue. So I'm starting this new thread, in hope to bring a new and more active discussion on how the issue should be approach and solved. And also to avoid fragmenting the discussion over the various different issues on the tracker that touches on this behavior. I would say the issue is so critical that it needs to be brought up again and re-prioritized. Since we know it can actually can bring down full building environments. First. Lets take a look at the effects of the issue. ---------------------------------------------------- 1. When a job is started on a slave that has gone into a hanged state, this is the only console output: "Building remotely on slave001" - Nothing more will happen. Stuck forever. The stuck jobs cannot be aborted. Disconnecting a slave (or even removing it) does not help. The slave is stuck. 2. Due to the deadlock, scripts cannot be run on the slaves through the http://<server>/computer/<slave>/script This is of course making debugging attempts harder. 3. Thread dumps from the slave can not be served through the UI either. No pages can be served if there is any type communication with the affected slave through the Request.call(). This has a quite nasty side effect. For instance when users trying to browse a workspace allocated on a hanged slave, the http-request hangs. If a few users is using the build-system, it wont take long until all HTTP-serving threads are occupied and the UI finally hangs fully. The same issue goes with all UI calls that communicates with the slave. Liek when issuing a slave threadDump. 4. In certain circumstances, whole plug-ins / features (SCM Polling?) in Hudson can get locked down during a remote call. For instance when a singleton:ed plug-in is calling a method like hudson.EnvVars.getRemote(EnvVars.java:202) against a deadlocked node. When the plug-in hangs, snowballing can occur, spreading hanged jobs on all slaves. E.g. all jobs using that plug-in can get locked down/stucked. 5. When a slave hangs, it hangs forever. This is interesting, as we will soon see that the effected streams uses TIMED_WAIT. There is no way to break free, except trying to manually interrupting waiting threads (which of course is risky and not even remotely a solution). 6. Issues and discussions of interest: http://issues.hudson-ci.org/browse/HUDSON-5977 http://issues.hudson-ci.org/browse/HUDSON-5413 http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-plugin-td2304264.html http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td1597996.html http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html Next. Let's take a look at the affected threads: ------------------------------------------------ The locked slaves will always have a reader thread looking like this: Channel reader thread: slave001 "Channel reader thread: slave001" Id=611 Group=main TIMED_WAITING on [B@35c5a763 at java.lang.Object.wait(Native Method) - waiting on [B@35c5a763 at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122) at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164) at hudson.remoting.Channel$ReaderThread.run(Channel.java:884) E.g. a "FastPipedOutputStream.write(FastPipedOutputStream.java:147)" that never completes. Interestingly enough, it is a TIMED_WAIT defined to 10 minutes. But it never seems to be able to get out of the FastPipedOutputStream.write()-method. The tasks/jobs getting stuck always looks similiar: Executor #0 for slave001 : executing job-matrix » var1,var2,var3 #459 / waiting for hudson.remoting.Channel@408b8462:slave001 "Executor #0 for slave001 : executing job-matrix » var1,var2,var3 #459 / waiting for hudson.remoting.Channel@408b8462:slave001" Id=80 Group=main TIMED_WAITING on hudson.remoting.UserRequest@85e26c5 at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.UserRequest@85e26c5 at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:555) at hudson.FilePath.act(FilePath.java:742) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.mkdirs(FilePath.java:801) at hudson.model.AbstractProject.checkout(AbstractProject.java:1036) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411) at hudson.model.Run.run(Run.java:1258) at hudson.matrix.MatrixRun.run(MatrixRun.java:130) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:129) The thread is never able to complete the call to hudson.remoting.Request.call(Request.java:122). Interestingly enough a 30min timeout is also defined in the hudson.remoting.Request.call(Request.java:122)-method. However, this does not seem to help the thread to break free at all. I should probably also mention that not only regular jobs will lock up. All monitoring threads will also lock up: "Monitoring thread for Architecture started on Wed Aug 11 14:36:58 CEST 2010 / waiting for hudson.remoting.Channel@408b8462:slave001" Id=5254 Group=main TIMED_WAITING on hudson.remoting.UserRequest@2f27f05f at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.UserRequest@2f27f05f at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:555) at hudson.node_monitors.ArchitectureMonitor$DescriptorImpl.monitor(ArchitectureMonitor.java:44) at hudson.node_monitors.ArchitectureMonitor$DescriptorImpl.monitor(ArchitectureMonitor.java:41) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:200) "Monitoring thread for Clock Difference started on Wed Aug 11 14:36:58 CEST 2010 / waiting for hudson.remoting.Channel@408b8462:slave001" Id=5255 Group=main TIMED_WAITING on hudson.remoting.UserRequest@2b150330 at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.UserRequest@2b150330 at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:555) at hudson.model.Slave.getClockDifference(Slave.java:225) at hudson.node_monitors.ClockMonitor$1.monitor(ClockMonitor.java:53) at hudson.node_monitors.ClockMonitor$1.monitor(ClockMonitor.java:49) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:200) ... Similiar traces can also be found for all served webpages that makes calls to the slave. I fetched a few thread dumps from the hanged slaves using jstack. All dumps from locked slaves looks the same: ----------------- THREAD DUMP STARTS (SCROLL DOWN) ------------------ 2010-08-11 16:02:05 Full thread dump Java HotSpot(TM) Server VM (17.0-b16 mixed mode): "Attach Listener" daemon prio=10 tid=0x08527400 nid=0x90e waiting on condition [0x00000000] java.lang.Thread.State: RUNNABLE "pool-1-thread-105" prio=10 tid=0x08525800 nid=0x56d waiting for monitor entry [0x758fb000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.remoting.Request.call(Request.java:101) - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:160) at hudson.remoting.$Proxy5.fetch2(Unknown Source) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) - locked <0x7c441120> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:112) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1575) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1732) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.UserRequest.deserialize(UserRequest.java:178) at hudson.remoting.UserRequest.perform(UserRequest.java:98) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "pool-1-thread-104" prio=10 tid=0x08523c00 nid=0x3d0 waiting for monitor entry [0x7594c000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.remoting.Request$2.run(Request.java:281) - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) [Cutting out "pool-1-thread-56" to "pool-1-thread-103". Looking exactly the same] "pool-1-thread-55" prio=10 tid=0x083dcc00 nid=0x4009 waiting for monitor entry [0x77809000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.remoting.Request$2.run(Request.java:281) - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "pool-1-thread-54" prio=10 tid=0x083d6400 nid=0x3e36 runnable [0x77753000] java.lang.Thread.State: RUNNABLE at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) - locked <0x7c4926a0> (a java.io.BufferedOutputStream) at java.io.PrintStream.write(PrintStream.java:430) - locked <0x7c4403f8> (a java.io.PrintStream) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1809) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1294) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1146) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at hudson.remoting.Channel.send(Channel.java:417) - locked <0x7c4403a0> (a hudson.remoting.Channel) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:112) - locked <0x7c7df988> (a hudson.remoting.ProxyOutputStream) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:100) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked <0x7c7df9a0> (a java.io.BufferedOutputStream) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72) - locked <0x7c7dfbc8> (a java.util.zip.GZIPOutputStream) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) - locked <0x7c7dfbe8> (a hudson.util.io.TarArchiver$1) at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410) at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388) at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337) at hudson.util.io.TarArchiver.visit(TarArchiver.java:103) at hudson.util.DirScanner$Glob.scan(DirScanner.java:104) at hudson.FilePath.writeToTar(FilePath.java:1543) at hudson.FilePath.access$800(FilePath.java:159) at hudson.FilePath$33.invoke(FilePath.java:1484) at hudson.FilePath$33.invoke(FilePath.java:1481) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "Ping thread for channel hudson.remoting.Channel@14da8f4:channel" daemon prio=10 tid=0x084ae400 nid=0x35fc waiting for monitor entry [0x7785a000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.remoting.Channel.send(Channel.java:411) - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel) at hudson.remoting.Request.callAsync(Request.java:170) at hudson.remoting.Channel.callAsync(Channel.java:582) at hudson.remoting.PingThread.ping(PingThread.java:99) at hudson.remoting.PingThread.run(PingThread.java:81) "Channel reader thread: channel" prio=10 tid=0x08488800 nid=0x35fb runnable [0x778ab000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:199) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x7c4410a8> (a java.io.BufferedInputStream) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:869) "Low Memory Detector" daemon prio=10 tid=0x77a04800 nid=0x35f9 runnable [0x00000000] java.lang.Thread.State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x77a02800 nid=0x35f8 waiting on condition [0x00000000] java.lang.Thread.State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x77a00800 nid=0x35f7 waiting on condition [0x00000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x083a0000 nid=0x35f6 runnable [0x00000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x0838fc00 nid=0x35f5 waiting for monitor entry [0x77c7b000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.remoting.Channel.send(Channel.java:411) - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel) at hudson.remoting.RemoteInvocationHandler.finalize(RemoteInvocationHandler.java:197) at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83) at java.lang.ref.Finalizer.access$100(Finalizer.java:14) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160) "Reference Handler" daemon prio=10 tid=0x0838b000 nid=0x35f4 in Object.wait() [0x77ccc000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x7c443068> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x7c443068> (a java.lang.ref.Reference$Lock) "main" prio=10 tid=0x082ea000 nid=0x35ee in Object.wait() [0xb69ea000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x7c4403a0> (a hudson.remoting.Channel) at java.lang.Object.wait(Object.java:485) at hudson.remoting.Channel.join(Channel.java:650) - locked <0x7c4403a0> (a hudson.remoting.Channel) at hudson.remoting.Launcher.main(Launcher.java:419) at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:365) at hudson.remoting.Launcher.run(Launcher.java:205) at hudson.remoting.Launcher.main(Launcher.java:167) "VM Thread" prio=10 tid=0x08388800 nid=0x35f3 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x082f1400 nid=0x35ef runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x082f2c00 nid=0x35f0 runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x082f4000 nid=0x35f1 runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x082f5800 nid=0x35f2 runnable "VM Periodic Task Thread" prio=10 tid=0x77a0e800 nid=0x35fa waiting on condition JNI global references: 1079 ----------------- THREAD DUMP ENDS ------------------ The interesting part is the locked <0x7c4403a0> (hudson.remoting.Channel). Locked in: "pool-1-thread-54" and for always hanged at: hudson.remoting.Channel.send(Channel.java:417) Funny enough <0x7c4403a0> is also locked in "main" prio=10 tid=0x082ea000 nid=0x35ee in Object.wait() [0xb69ea000]. But I'm sure there is a reasonable explanation for this. I guess it is just how jstack presents previously acquired locks. So I was wondering. Are there any core expert availible to take a look at the deadlock? Is there currently an ongoing plan or discussion on how to solve the clogging issues? Has the channel issue been touched upon on IRC? Any hints on how to continue? There is only so much debugging you can do on a production environment. I'm willing to spend some time on this issue, but some help is needed on the way. Whats the way of going forward? Introducing more timers, to atleast keep the slaves locking up forever? How about starting to draft out som QoS mechanism? Br Gustaf Lundh - Sony Ericsson |
|
> <original post by GLundh. UPDATED>
> Our jobs/slaves continuously locks up, making our build-cluster unstable. > > I see on the tracker and the mailing list that a decent number of people is running > into this issue. > > So I'm starting this new thread, in hope to bring a new and more active discussion > on how the issue should be approach and solved. And also to avoid > fragmenting > the discussion over the various different issues on the tracker that > touches on this behavior. > > I would say the issue is so critical that it needs to be brought up again and > re-prioritized. Since we know it can actually can bring down full building environments. > > Issues and discussions of interest: > http://issues.hudson-ci.org/browse/HUDSON-5977 > http://issues.hudson-ci.org/browse/HUDSON-5413 > > http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-plugin-td2304264.html > http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td1597996.html > http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html > </original post> I have put in some time to investigate the issue and also providing a proof-of-concept solution for finally fixing this bad bug. Let me present a typical case where this congestion issue appear and lock a job/slave: 1. Job is started. 2. The executor thread on the master sends a Request to the slave, issuing some stream operations. Example: Unzip:ing a saved workspace to the slave using the Clone Workspace SCM plugin: (From the Master's ThreadDump) "Executor #0 for test : executing b #19 / waiting for hudson.remoting.Channel@47338:test" prio=10 tid=0x75b28800 nid=0x702e in Object.wait() [0x75769000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x7cebedd8> (a hudson.remoting.UserRequest) at hudson.remoting.Request.call(Request.java:122) - locked <0x7cebedd8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:555) at hudson.FilePath.act(FilePath.java:742) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.unzip(FilePath.java:415) at hudson.FileSystemProvisioner$Default$WorkspaceSnapshotImpl.restoreTo(FileSystemProvisioner.java:227) at hudson.plugins.cloneworkspace.CloneWorkspaceSCM$Snapshot.restoreTo(CloneWorkspaceSCM.java:344) at hudson.plugins.cloneworkspace.CloneWorkspaceSCM.checkout(CloneWorkspaceSCM.java:126) at hudson.model.AbstractProject.checkout(AbstractProject.java:1038) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411) at hudson.model.Run.run(Run.java:1258) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:129) 3. This Request is received by the slaves ChannelReaderThread. As most other Requests, it's being executed in its own thread, leaving the ChannelReaderThread free for new incoming Requests. Since this particular Request involves a stream being written to the slave, ProxyOutputStream.Chunks are being sent rapidly to the slave. These chunks are being written to a buffer (FastPipedOutputStream) which is consumed by the original Request (working in its own thread - remember?). 4. Now. The main problem is that the incoming chunks are being written and executed on the ChannelReaderThread (not in a seperate thread like usual remote Requests). When the stream buffer is full, it waits and blocks the ChannelReaderThread on the slave: (From the Slave's ThreadDump) "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147) - locked <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122) at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890) The problem is that the original Request now stumbles upon a new class, not yet available on the slave, (for instance hudson.util.IOUtils which is needed to unzip a stream) and the remote class loader is invoked. Remember that the original request was responsible for consuming the incoming chunks. Now it's taking a break to do a request of a new class. E.g. the Slave is requesting the class to be sent from the Master: (From the Slave's ThreadDump) "pool-1-thread-1 / waiting for hudson.remoting.Channel@86fe26:channel" prio=10 tid=0x08d32c00 nid=0x7330 in Object.wait() [0x7745c000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest) at hudson.remoting.Request.call(Request.java:122) - locked <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:160) at hudson.remoting.$Proxy5.fetch2(Unknown Source) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader) at hudson.FilePath.unzip(FilePath.java:476) at hudson.FilePath.access$000(FilePath.java:159) at hudson.FilePath$2.invoke(FilePath.java:417) at hudson.FilePath$2.invoke(FilePath.java:415) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Notice the jump to the RemoteClassLoader after the "at hudson.FilePath.unzip(FilePath.java:476)". This is where a new class was stumbled upon (IOUtils). So a request is being sent to the master, asking for the missing class. The master sends it back for the slave to load it. But! As we remember, the ChannelReaderThread is already stuck, trying to write chunks to the already full stream buffer. Hence the Response, containing the class we asked for (in this case IOUtils) cannot be retrieved: (From the Slave's ThreadDump) "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147) - locked <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122) at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890) And we cannot consume from the buffer, making way for the Class Response, since the thread responsible for consuming the stream is waiting for this non existing class. Well. You get the point. ---------- All "hanged jobs" ThreadDumps I have seen so far, points to this exact deadlock. A stuck ChannelReaderThread due to a full FastPipedOutputStream buffer - which cannot be consumed because the consuming thread is waiting for a class to be fetched. ---------- This issue would of course not exist, if the ProxyOutputStreams Chunk/EOF/Unexport/Flush was executed on its own thread (not blocking the ChannelReaderThread). So I did a proof-of-concept solution, adding a ExecutorService on the channel, responsible for handling the ProxyStreamOutput commands in a FIFO manner (newSingleThreadExecutor()). It seems to work perfectly fine. I even minimized the FastPipedInputStreams buffer from 65k to 256 bytes, to make sure it did not congest even in critical situations. I do understand that the core developers may want to take their own approach to solving the issue. Maybe refining the patch for performance or similar. But since this is a critical issue, I would really like for someone to soon take a look at the proof-of-concept solution, and produce a solid patch to be integrated on the core. I can also do it myself, if I can get some code reviews and a little discussion around the issue and a suitable final solution. Best Regards Gustaf Lundh - Team Hudson @ Sony Ericsson |
|
And here is the patch.
Br Gustaf Lundh -----Original Message----- From: Lundh, Gustaf [mailto:[hidden email]] Sent: den 2 september 2010 14:04 To: [hidden email] Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! > <original post by GLundh. UPDATED> > Our jobs/slaves continuously locks up, making our build-cluster unstable. > > I see on the tracker and the mailing list that a decent number of people is running > into this issue. > > So I'm starting this new thread, in hope to bring a new and more active discussion > on how the issue should be approach and solved. And also to avoid > fragmenting > the discussion over the various different issues on the tracker that > touches on this behavior. > > I would say the issue is so critical that it needs to be brought up again and > re-prioritized. Since we know it can actually can bring down full building environments. > > Issues and discussions of interest: > http://issues.hudson-ci.org/browse/HUDSON-5977 > http://issues.hudson-ci.org/browse/HUDSON-5413 > > http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-plugin-td2304264.html > http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td1597996.html > http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html > </original post> I have put in some time to investigate the issue and also providing a proof-of-concept solution for finally fixing this bad bug. Let me present a typical case where this congestion issue appear and lock a job/slave: 1. Job is started. 2. The executor thread on the master sends a Request to the slave, issuing some stream operations. Example: Unzip:ing a saved workspace to the slave using the Clone Workspace SCM plugin: (From the Master's ThreadDump) "Executor #0 for test : executing b #19 / waiting for hudson.remoting.Channel@47338:test" prio=10 tid=0x75b28800 nid=0x702e in Object.wait() [0x75769000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x7cebedd8> (a hudson.remoting.UserRequest) at hudson.remoting.Request.call(Request.java:122) - locked <0x7cebedd8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:555) at hudson.FilePath.act(FilePath.java:742) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.unzip(FilePath.java:415) at hudson.FileSystemProvisioner$Default$WorkspaceSnapshotImpl.restoreTo(FileSystemProvisioner.java:227) at hudson.plugins.cloneworkspace.CloneWorkspaceSCM$Snapshot.restoreTo(CloneWorkspaceSCM.java:344) at hudson.plugins.cloneworkspace.CloneWorkspaceSCM.checkout(CloneWorkspaceSCM.java:126) at hudson.model.AbstractProject.checkout(AbstractProject.java:1038) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411) at hudson.model.Run.run(Run.java:1258) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:129) 3. This Request is received by the slaves ChannelReaderThread. As most other Requests, it's being executed in its own thread, leaving the ChannelReaderThread free for new incoming Requests. Since this particular Request involves a stream being written to the slave, ProxyOutputStream.Chunks are being sent rapidly to the slave. These chunks are being written to a buffer (FastPipedOutputStream) which is consumed by the original Request (working in its own thread - remember?). 4. Now. The main problem is that the incoming chunks are being written and executed on the ChannelReaderThread (not in a seperate thread like usual remote Requests). When the stream buffer is full, it waits and blocks the ChannelReaderThread on the slave: (From the Slave's ThreadDump) "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147) - locked <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122) at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890) The problem is that the original Request now stumbles upon a new class, not yet available on the slave, (for instance hudson.util.IOUtils which is needed to unzip a stream) and the remote class loader is invoked. Remember that the original request was responsible for consuming the incoming chunks. Now it's taking a break to do a request of a new class. E.g. the Slave is requesting the class to be sent from the Master: (From the Slave's ThreadDump) "pool-1-thread-1 / waiting for hudson.remoting.Channel@86fe26:channel" prio=10 tid=0x08d32c00 nid=0x7330 in Object.wait() [0x7745c000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest) at hudson.remoting.Request.call(Request.java:122) - locked <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:160) at hudson.remoting.$Proxy5.fetch2(Unknown Source) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader) at hudson.FilePath.unzip(FilePath.java:476) at hudson.FilePath.access$000(FilePath.java:159) at hudson.FilePath$2.invoke(FilePath.java:417) at hudson.FilePath$2.invoke(FilePath.java:415) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Notice the jump to the RemoteClassLoader after the "at hudson.FilePath.unzip(FilePath.java:476)". This is where a new class was stumbled upon (IOUtils). So a request is being sent to the master, asking for the missing class. The master sends it back for the slave to load it. But! As we remember, the ChannelReaderThread is already stuck, trying to write chunks to the already full stream buffer. Hence the Response, containing the class we asked for (in this case IOUtils) cannot be retrieved: (From the Slave's ThreadDump) "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147) - locked <0xad849f60> (a [B) at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122) at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890) And we cannot consume from the buffer, making way for the Class Response, since the thread responsible for consuming the stream is waiting for this non existing class. Well. You get the point. ---------- All "hanged jobs" ThreadDumps I have seen so far, points to this exact deadlock. A stuck ChannelReaderThread due to a full FastPipedOutputStream buffer - which cannot be consumed because the consuming thread is waiting for a class to be fetched. ---------- This issue would of course not exist, if the ProxyOutputStreams Chunk/EOF/Unexport/Flush was executed on its own thread (not blocking the ChannelReaderThread). So I did a proof-of-concept solution, adding a ExecutorService on the channel, responsible for handling the ProxyStreamOutput commands in a FIFO manner (newSingleThreadExecutor()). It seems to work perfectly fine. I even minimized the FastPipedInputStreams buffer from 65k to 256 bytes, to make sure it did not congest even in critical situations. I do understand that the core developers may want to take their own approach to solving the issue. Maybe refining the patch for performance or similar. But since this is a critical issue, I would really like for someone to soon take a look at the proof-of-concept solution, and produce a solid patch to be integrated on the core. I can also do it myself, if I can get some code reviews and a little discussion around the issue and a suitable final solution. Best Regards Gustaf Lundh - Team Hudson @ Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Hi,
any feedback from Hudson core developers? Personally, it seems to me as a good fix of this issue, but I would appreciate any comment on this from core developers, what they think about it. Thanks Vojtech > And here is the patch. > > Br > Gustaf Lundh > > -----Original Message----- > From: Lundh, Gustaf [mailto:[hidden email]] > Sent: den 2 september 2010 14:04 > To: [hidden email] > Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to > pipe clogging. PROPOSED SOLUTION! > > > <original post by GLundh. UPDATED> > > Our jobs/slaves continuously locks up, making our build-cluster unstable. > > > > I see on the tracker and the mailing list that a decent number of people > > is running into this issue. > > > > So I'm starting this new thread, in hope to bring a new and more active > > discussion on how the issue should be approach and solved. And also to > > avoid fragmenting > the discussion over the various different issues on > > the tracker that touches on this behavior. > > > > I would say the issue is so critical that it needs to be brought up again > > and re-prioritized. Since we know it can actually can bring down full > > building environments. > > > > Issues and discussions of interest: > > http://issues.hudson-ci.org/browse/HUDSON-5977 > > http://issues.hudson-ci.org/browse/HUDSON-5413 > > > > http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM- > > plugin-td2304264.html > > http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td > > 1597996.html > > http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html > > </original post> > > I have put in some time to investigate the issue and also providing a > proof-of-concept solution for finally fixing this bad bug. > > Let me present a typical case where this congestion issue appear and lock a > job/slave: > > 1. Job is started. > > 2. The executor thread on the master sends a Request to the slave, issuing > some stream operations. Example: Unzip:ing a saved workspace to the slave > using the Clone Workspace SCM plugin: > > (From the Master's ThreadDump) > > "Executor #0 for test : executing b #19 / waiting for > hudson.remoting.Channel@47338:test" prio=10 tid=0x75b28800 nid=0x702e in > Object.wait() [0x75769000] java.lang.Thread.State: TIMED_WAITING (on > object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x7cebedd8> (a hudson.remoting.UserRequest) > at hudson.remoting.Request.call(Request.java:122) > - locked <0x7cebedd8> (a hudson.remoting.UserRequest) > at hudson.remoting.Channel.call(Channel.java:555) > at hudson.FilePath.act(FilePath.java:742) > at hudson.FilePath.act(FilePath.java:735) > at hudson.FilePath.unzip(FilePath.java:415) > at > hudson.FileSystemProvisioner$Default$WorkspaceSnapshotImpl.restoreTo(FileS > ystemProvisioner.java:227) at > hudson.plugins.cloneworkspace.CloneWorkspaceSCM$Snapshot.restoreTo(CloneWo > rkspaceSCM.java:344) at > hudson.plugins.cloneworkspace.CloneWorkspaceSCM.checkout(CloneWorkspaceSCM > .java:126) at > hudson.model.AbstractProject.checkout(AbstractProject.java:1038) at > hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479) > at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411) > at hudson.model.Run.run(Run.java:1258) > at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) > at hudson.model.ResourceController.execute(ResourceController.java:88) > at hudson.model.Executor.run(Executor.java:129) > > 3. This Request is received by the slaves ChannelReaderThread. > > As most other Requests, it's being executed in its own thread, leaving the > ChannelReaderThread free for new incoming Requests. > > Since this particular Request involves a stream being written to the slave, > ProxyOutputStream.Chunks are being sent rapidly to the slave. These chunks > are being written to a buffer (FastPipedOutputStream) which is consumed by > the original Request (working in its own thread - remember?). > > 4. Now. The main problem is that the incoming chunks are being written and > executed on the ChannelReaderThread (not in a seperate thread like usual > remote Requests). When the stream buffer is full, it waits and blocks the > ChannelReaderThread on the slave: > > > (From the Slave's ThreadDump) > > "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in > Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on > object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xad849f60> (a [B) > at > hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147 > ) - locked <0xad849f60> (a [B) > at > hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122 > ) at > hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164 > ) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890) > > The problem is that the original Request now stumbles upon a new class, not > yet available on the slave, (for instance hudson.util.IOUtils which is > needed to unzip a stream) and the remote class loader is invoked. > > Remember that the original request was responsible for consuming the > incoming chunks. Now it's taking a break to do a request of a new class. > E.g. the Slave is requesting the class to be sent from the Master: > > (From the Slave's ThreadDump) > > "pool-1-thread-1 / waiting for hudson.remoting.Channel@86fe26:channel" > prio=10 tid=0x08d32c00 nid=0x7330 in Object.wait() [0x7745c000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xad8a0300> (a > hudson.remoting.RemoteInvocationHandler$RPCRequest) at > hudson.remoting.Request.call(Request.java:122) > - locked <0xad8a0300> (a > hudson.remoting.RemoteInvocationHandler$RPCRequest) at > hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.jav > a:160) at hudson.remoting.$Proxy5.fetch2(Unknown Source) > at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122) > at java.lang.ClassLoader.loadClass(ClassLoader.java:307) > - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader) > at java.lang.ClassLoader.loadClass(ClassLoader.java:252) > at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) > - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader) > at hudson.FilePath.unzip(FilePath.java:476) > at hudson.FilePath.access$000(FilePath.java:159) > at hudson.FilePath$2.invoke(FilePath.java:417) > at hudson.FilePath$2.invoke(FilePath.java:415) > at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) > at hudson.remoting.UserRequest.perform(UserRequest.java:114) > at hudson.remoting.UserRequest.perform(UserRequest.java:48) > at hudson.remoting.Request$2.run(Request.java:270) > at > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor. > java:886) at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java > :908) at java.lang.Thread.run(Thread.java:619) > > Notice the jump to the RemoteClassLoader after the "at > hudson.FilePath.unzip(FilePath.java:476)". This is where a new class was > stumbled upon (IOUtils). > > So a request is being sent to the master, asking for the missing class. The > master sends it back for the slave to load it. > > But! As we remember, the ChannelReaderThread is already stuck, trying to > write chunks to the already full stream buffer. Hence the Response, > containing the class we asked for (in this case IOUtils) cannot be > retrieved: > > (From the Slave's ThreadDump) > > "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in > Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on > object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xad849f60> (a [B) > at > hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147 > ) - locked <0xad849f60> (a [B) > at > hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122 > ) at > hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164 > ) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890) > > And we cannot consume from the buffer, making way for the Class Response, > since the thread responsible for consuming the stream is waiting for this > non existing class. Well. You get the point. > > ---------- > All "hanged jobs" ThreadDumps I have seen so far, points to this exact > deadlock. A stuck ChannelReaderThread due to a full FastPipedOutputStream > buffer - which cannot be consumed because the consuming thread is waiting > for a class to be fetched. ---------- > > This issue would of course not exist, if the ProxyOutputStreams > Chunk/EOF/Unexport/Flush was executed on its own thread (not blocking the > ChannelReaderThread). > > So I did a proof-of-concept solution, adding a ExecutorService on the > channel, responsible for handling the ProxyStreamOutput commands in a FIFO > manner (newSingleThreadExecutor()). > > It seems to work perfectly fine. I even minimized the FastPipedInputStreams > buffer from 65k to 256 bytes, to make sure it did not congest even in > critical situations. > > I do understand that the core developers may want to take their own > approach to solving the issue. Maybe refining the patch for performance or > similar. But since this is a critical issue, I would really like for > someone to soon take a look at the proof-of-concept solution, and produce > a solid patch to be integrated on the core. I can also do it myself, if I > can get some code reviews and a little discussion around the issue and a > suitable final solution. > > Best Regards > Gustaf Lundh - Team Hudson @ Sony Ericsson --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
you might have to wait a week or so, as Kohsuke is traveling now..
Glad to see some work on this.. maybe sometime I can remove the workaround code in copyartifact ;-) Vojtech Juranek wrote: > Hi, > any feedback from Hudson core developers? Personally, it seems to me as a good > fix of this issue, but I would appreciate any comment on this from core > developers, what they think about it. > Thanks > Vojtech > > >> And here is the patch. >> >> Br >> Gustaf Lundh >> >> -----Original Message----- >> From: Lundh, Gustaf [mailto:[hidden email]] >> Sent: den 2 september 2010 14:04 >> To: [hidden email] >> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to >> pipe clogging. PROPOSED SOLUTION! >> >> >>> <original post by GLundh. UPDATED> >>> Our jobs/slaves continuously locks up, making our build-cluster unstable. >>> >>> I see on the tracker and the mailing list that a decent number of people >>> is running into this issue. >>> >>> So I'm starting this new thread, in hope to bring a new and more active >>> discussion on how the issue should be approach and solved. And also to >>> avoid fragmenting > the discussion over the various different issues on >>> the tracker that touches on this behavior. >>> >>> I would say the issue is so critical that it needs to be brought up again >>> and re-prioritized. Since we know it can actually can bring down full >>> building environments. >>> >>> Issues and discussions of interest: >>> http://issues.hudson-ci.org/browse/HUDSON-5977 >>> http://issues.hudson-ci.org/browse/HUDSON-5413 >>> >>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM- >>> plugin-td2304264.html >>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td >>> 1597996.html >>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html >>> </original post> >>> >> I have put in some time to investigate the issue and also providing a >> proof-of-concept solution for finally fixing this bad bug. >> >> Let me present a typical case where this congestion issue appear and lock a >> job/slave: >> >> 1. Job is started. >> >> 2. The executor thread on the master sends a Request to the slave, issuing >> some stream operations. Example: Unzip:ing a saved workspace to the slave >> using the Clone Workspace SCM plugin: >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Hi, Alan.
Yes. He sure is out and traveling. I actually met him in the Copenhagen user meet-up last Monday :) (Btw, a big thanks to Kohsuke and Nokia for hosting a great meet-up!) I did not had the chance to discuss the issue that much (quite a few people was there), but he mentioned that he had started some work on the piping issue himself. If he is planning a larger and more time demanding rework of the channel/remoting functionality, perhaps a solution like this could work until the already ongoing work is ready for primetime. Because the severity of the issue, and with not much happening, we are currently thinking of forking Hudson internally to get the patch in place. Running internal forks is really not a great situation to be in. Br Gustaf -----Original Message----- From: Alan Harder [mailto:[hidden email]] Sent: den 8 september 2010 17:26 To: [hidden email] Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! you might have to wait a week or so, as Kohsuke is traveling now.. Glad to see some work on this.. maybe sometime I can remove the workaround code in copyartifact ;-) Vojtech Juranek wrote: > Hi, > any feedback from Hudson core developers? Personally, it seems to me as a good > fix of this issue, but I would appreciate any comment on this from core > developers, what they think about it. > Thanks > Vojtech > > >> And here is the patch. >> >> Br >> Gustaf Lundh >> >> -----Original Message----- >> From: Lundh, Gustaf [mailto:[hidden email]] >> Sent: den 2 september 2010 14:04 >> To: [hidden email] >> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to >> pipe clogging. PROPOSED SOLUTION! >> >> >>> <original post by GLundh. UPDATED> >>> Our jobs/slaves continuously locks up, making our build-cluster unstable. >>> >>> I see on the tracker and the mailing list that a decent number of people >>> is running into this issue. >>> >>> So I'm starting this new thread, in hope to bring a new and more active >>> discussion on how the issue should be approach and solved. And also to >>> avoid fragmenting > the discussion over the various different issues on >>> the tracker that touches on this behavior. >>> >>> I would say the issue is so critical that it needs to be brought up again >>> and re-prioritized. Since we know it can actually can bring down full >>> building environments. >>> >>> Issues and discussions of interest: >>> http://issues.hudson-ci.org/browse/HUDSON-5977 >>> http://issues.hudson-ci.org/browse/HUDSON-5413 >>> >>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM- >>> plugin-td2304264.html >>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td >>> 1597996.html >>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html >>> </original post> >>> >> I have put in some time to investigate the issue and also providing a >> proof-of-concept solution for finally fixing this bad bug. >> >> Let me present a typical case where this congestion issue appear and lock a >> job/slave: >> >> 1. Job is started. >> >> 2. The executor thread on the master sends a Request to the slave, issuing >> some stream operations. Example: Unzip:ing a saved workspace to the slave >> using the Clone Workspace SCM plugin: >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Out of curiosity, what is the use case that makes this a severe issue
for you? Just wondering if a workaround similar to what copyartifact does would help.. In the case of copyartifact, I forced a small size operation to get any classloading to happen when there is only a small amount of traffic in the other direction.. then when I start the potentially large operation, no classloading happens to cause the pipe clogging.. - Alan Lundh, Gustaf wrote: > Hi, Alan. > > Yes. He sure is out and traveling. I actually met him in the Copenhagen user meet-up last Monday :) (Btw, a big thanks to Kohsuke and Nokia for hosting a great meet-up!) > > I did not had the chance to discuss the issue that much (quite a few people was there), but he mentioned that he had started some work on the piping issue himself. > > If he is planning a larger and more time demanding rework of the channel/remoting functionality, perhaps a solution like this could work until the already ongoing work is ready for primetime. > > Because the severity of the issue, and with not much happening, we are currently thinking of forking Hudson internally to get the patch in place. Running internal forks is really not a great situation to be in. > > Br > Gustaf > > > > -----Original Message----- > From: Alan Harder [mailto:[hidden email]] > Sent: den 8 september 2010 17:26 > To: [hidden email] > Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! > > you might have to wait a week or so, as Kohsuke is traveling now.. > > Glad to see some work on this.. maybe sometime I can remove the > workaround code in copyartifact ;-) > > > > Vojtech Juranek wrote: > >> Hi, >> any feedback from Hudson core developers? Personally, it seems to me as a good >> fix of this issue, but I would appreciate any comment on this from core >> developers, what they think about it. >> Thanks >> Vojtech >> >> >> >>> And here is the patch. >>> >>> Br >>> Gustaf Lundh >>> >>> -----Original Message----- >>> From: Lundh, Gustaf [mailto:[hidden email]] >>> Sent: den 2 september 2010 14:04 >>> To: [hidden email] >>> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to >>> pipe clogging. PROPOSED SOLUTION! >>> >>> >>> >>>> <original post by GLundh. UPDATED> >>>> Our jobs/slaves continuously locks up, making our build-cluster unstable. >>>> >>>> I see on the tracker and the mailing list that a decent number of people >>>> is running into this issue. >>>> >>>> So I'm starting this new thread, in hope to bring a new and more active >>>> discussion on how the issue should be approach and solved. And also to >>>> avoid fragmenting > the discussion over the various different issues on >>>> the tracker that touches on this behavior. >>>> >>>> I would say the issue is so critical that it needs to be brought up again >>>> and re-prioritized. Since we know it can actually can bring down full >>>> building environments. >>>> >>>> Issues and discussions of interest: >>>> http://issues.hudson-ci.org/browse/HUDSON-5977 >>>> http://issues.hudson-ci.org/browse/HUDSON-5413 >>>> >>>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM- >>>> plugin-td2304264.html >>>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td >>>> 1597996.html >>>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html >>>> </original post> >>>> >>>> >>> I have put in some time to investigate the issue and also providing a >>> proof-of-concept solution for finally fixing this bad bug. >>> >>> Let me present a typical case where this congestion issue appear and lock a >>> job/slave: >>> >>> 1. Job is started. >>> >>> 2. The executor thread on the master sends a Request to the slave, issuing >>> some stream operations. Example: Unzip:ing a saved workspace to the slave >>> using the Clone Workspace SCM plugin: >>> >>> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Alan Harder-2
> you might have to wait a week or so, as Kohsuke is traveling now..
thanks for info! (in fact, I was wondering why nobody comments on this issue...:-) > Glad to see some work on this.. maybe sometime I can remove the > workaround code in copyartifact ;-) > > Vojtech Juranek wrote: > > Hi, > > any feedback from Hudson core developers? Personally, it seems to me as a > > good fix of this issue, but I would appreciate any comment on this from > > core developers, what they think about it. > > Thanks > > Vojtech > > > >> And here is the patch. > >> > >> Br > >> Gustaf Lundh > >> > >> -----Original Message----- > >> From: Lundh, Gustaf [mailto:[hidden email]] > >> Sent: den 2 september 2010 14:04 > >> To: [hidden email] > >> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due > >> to pipe clogging. PROPOSED SOLUTION! > >> > >>> <original post by GLundh. UPDATED> > >>> Our jobs/slaves continuously locks up, making our build-cluster > >>> unstable. > >>> > >>> I see on the tracker and the mailing list that a decent number of > >>> people is running into this issue. > >>> > >>> So I'm starting this new thread, in hope to bring a new and more active > >>> discussion on how the issue should be approach and solved. And also to > >>> avoid fragmenting > the discussion over the various different issues on > >>> the tracker that touches on this behavior. > >>> > >>> I would say the issue is so critical that it needs to be brought up > >>> again and re-prioritized. Since we know it can actually can bring down > >>> full building environments. > >>> > >>> Issues and discussions of interest: > >>> http://issues.hudson-ci.org/browse/HUDSON-5977 > >>> http://issues.hudson-ci.org/browse/HUDSON-5413 > >>> > >>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SC > >>> M- plugin-td2304264.html > >>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-t > >>> d 1597996.html > >>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html > >>> </original post> > >> > >> I have put in some time to investigate the issue and also providing a > >> proof-of-concept solution for finally fixing this bad bug. > >> > >> Let me present a typical case where this congestion issue appear and > >> lock a job/slave: > >> > >> 1. Job is started. > >> > >> 2. The executor thread on the master sends a Request to the slave, > >> issuing some stream operations. Example: Unzip:ing a saved workspace to > >> the slave > > >> using the Clone Workspace SCM plugin: > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Alan Harder-2
I have seen various plug-ins and functionality that triggers this issue.
Like the Emma publisher (been a while since that one hanged a job though), during SCM pulling routines, Triggering plug-ins and other occasions. Your fix works great for the CopyArtifact plugin (awesome plug-in BTW). However, since it is hard to predict the next remote call to hang the job/slave, it seems more appropriate for us to fix the underlying issue than applying/multiplying the fix all over :) Br Gustaf -----Original Message----- From: Alan Harder [mailto:[hidden email]] Sent: den 8 september 2010 17:49 To: [hidden email] Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! Out of curiosity, what is the use case that makes this a severe issue for you? Just wondering if a workaround similar to what copyartifact does would help.. In the case of copyartifact, I forced a small size operation to get any classloading to happen when there is only a small amount of traffic in the other direction.. then when I start the potentially large operation, no classloading happens to cause the pipe clogging.. - Alan Lundh, Gustaf wrote: > Hi, Alan. > > Yes. He sure is out and traveling. I actually met him in the Copenhagen user meet-up last Monday :) (Btw, a big thanks to Kohsuke and Nokia for hosting a great meet-up!) > > I did not had the chance to discuss the issue that much (quite a few people was there), but he mentioned that he had started some work on the piping issue himself. > > If he is planning a larger and more time demanding rework of the channel/remoting functionality, perhaps a solution like this could work until the already ongoing work is ready for primetime. > > Because the severity of the issue, and with not much happening, we are currently thinking of forking Hudson internally to get the patch in place. Running internal forks is really not a great situation to be in. > > Br > Gustaf > > > > -----Original Message----- > From: Alan Harder [mailto:[hidden email]] > Sent: den 8 september 2010 17:26 > To: [hidden email] > Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! > > you might have to wait a week or so, as Kohsuke is traveling now.. > > Glad to see some work on this.. maybe sometime I can remove the > workaround code in copyartifact ;-) > > > > Vojtech Juranek wrote: > >> Hi, >> any feedback from Hudson core developers? Personally, it seems to me as a good >> fix of this issue, but I would appreciate any comment on this from core >> developers, what they think about it. >> Thanks >> Vojtech >> >> >> >>> And here is the patch. >>> >>> Br >>> Gustaf Lundh >>> >>> -----Original Message----- >>> From: Lundh, Gustaf [mailto:[hidden email]] >>> Sent: den 2 september 2010 14:04 >>> To: [hidden email] >>> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to >>> pipe clogging. PROPOSED SOLUTION! >>> >>> >>> >>>> <original post by GLundh. UPDATED> >>>> Our jobs/slaves continuously locks up, making our build-cluster unstable. >>>> >>>> I see on the tracker and the mailing list that a decent number of people >>>> is running into this issue. >>>> >>>> So I'm starting this new thread, in hope to bring a new and more active >>>> discussion on how the issue should be approach and solved. And also to >>>> avoid fragmenting > the discussion over the various different issues on >>>> the tracker that touches on this behavior. >>>> >>>> I would say the issue is so critical that it needs to be brought up again >>>> and re-prioritized. Since we know it can actually can bring down full >>>> building environments. >>>> >>>> Issues and discussions of interest: >>>> http://issues.hudson-ci.org/browse/HUDSON-5977 >>>> http://issues.hudson-ci.org/browse/HUDSON-5413 >>>> >>>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM- >>>> plugin-td2304264.html >>>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td >>>> 1597996.html >>>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html >>>> </original post> >>>> >>>> >>> I have put in some time to investigate the issue and also providing a >>> proof-of-concept solution for finally fixing this bad bug. >>> >>> Let me present a typical case where this congestion issue appear and lock a >>> job/slave: >>> >>> 1. Job is started. >>> >>> 2. The executor thread on the master sends a Request to the slave, issuing >>> some stream operations. Example: Unzip:ing a saved workspace to the slave >>> using the Clone Workspace SCM plugin: >>> >>> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Lundh, Gustaf wrote: > Your fix works great for the CopyArtifact plugin (awesome plug-in BTW). However, since it is hard to predict the next remote call to hang the job/slave, it seems more appropriate for us to fix the underlying issue than applying/multiplying the fix all over :) > yup.. was just wondering if some workaround code might be easier to maintain for a little while than a forked Hudson... - Alan --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Is this patch attached to an issue in JIRA yet?
While it's distinctly possible that Kohsuke's working on a rewrite of aspects of the remoting logic, that's no reason not to fix a bug in the near term. I'm also curious to see if this might have anything to do with/any similarity to the ongoing problem with extreme slowness in artifact archiving - I'm not near familiar enough with the remoting code to be able to tell, but it seems like these could be related.
In any case, I'd say that if Kohsuke doesn't have an issue with the code itself when he gets back online, I'd strongly support getting this into core. A.
On Wed, Sep 8, 2010 at 9:22 AM, Alan Harder <[hidden email]> wrote:
|
|
In reply to this post by Alan Harder-2
I see the same issue during the post-build artifact publishing phase. I saw your patch for the plugin and tried a similar patch for the that case. It seems to work, but I agree with Gustaf that the underlying issue should be fixed.
I don't understand your reticence. Who said anything about a forked Hudson? This is a serious bug that often requires a restart of the whole farm. -P ----- Original Message ----- From: "Alan Harder" <[hidden email]> To: [hidden email] Sent: Wednesday, September 8, 2010 12:22:14 PM Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! Lundh, Gustaf wrote: > Your fix works great for the CopyArtifact plugin (awesome plug-in > BTW). However, since it is hard to predict the next remote call to > hang the job/slave, it seems more appropriate for us to fix the > underlying issue than applying/multiplying the fix all over :) > yup.. was just wondering if some workaround code might be easier to maintain for a little while than a forked Hudson... - Alan --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Paul Weiss wrote: > I don't understand your reticence. Who said anything about a forked Hudson? > > This is a serious bug that often requires a restart of the whole farm. > Perhaps you read the thread too quickly.. Gustaf did, in reference to working around the problem until this bug is fixed. I was offering a suggestion towards that end, nothing more. I have no objection to the bug being fixed. > -P > > ----- Original Message ----- > From: "Alan Harder" <[hidden email]> > To: [hidden email] > Sent: Wednesday, September 8, 2010 12:22:14 PM > Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION! > > Lundh, Gustaf wrote: > >> Your fix works great for the CopyArtifact plugin (awesome plug-in >> BTW). However, since it is hard to predict the next remote call to >> hang the job/slave, it seems more appropriate for us to fix the >> underlying issue than applying/multiplying the fix all over :) >> >> > > yup.. was just wondering if some workaround code might be easier to > maintain for a little while than a forked Hudson... > > - Alan > > > --------------------------------------------------------------------- To > unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Yes. This is correct. I was merely trying to emphasize how critical this issue is for our environment. And if we cannot get the issue patched, we have to fork Hudson internally at our company. And maintaining a private fork with exclusive patches is really not a preferable situation. Kohsuke has now commented on the issue accordingly: "I'm testing a patch similar to what's proposed by GLundh, except it does throttling to avoid unbounded growth of the queue. Plan to integrate it to the trunk soon, but I'm bit cautious as the remoting affects many things. If there are other people willing to test the modification, I'll post the modified binary." I hope we can get this one into core really soon. And that the code is posted on the SVN. Would really like to help out testing it. Br Gustaf Lundh |
|
Administrator
|
My plan was to cut the RC branch today and then merge back those
changes to the trunk. I agree with the importance of the fix. 2010/9/15 GLundh <[hidden email]>: > > > Alan Harder-2 wrote: >> >> Paul Weiss wrote: >>> I don't understand your reticence. Who said anything about a forked >>> Hudson? >>> >>> This is a serious bug that often requires a restart of the whole farm. >>> >> >> Perhaps you read the thread too quickly.. Gustaf did, in reference to >> working around the problem until this bug is fixed. I was offering a >> suggestion towards that end, nothing more. I have no objection to the >> bug being fixed. >> > > Yes. This is correct. > > I was merely trying to emphasize how critical this issue is for our > environment. And if we cannot get the issue patched, we have to fork Hudson > internally at our company. And maintaining a private fork with exclusive > patches is really not a preferable situation. > > Kohsuke has now commented on the issue accordingly: > > "I'm testing a patch similar to what's proposed by GLundh, except it does > throttling to avoid unbounded growth of the queue. Plan to integrate it to > the trunk soon, but I'm bit cautious as the remoting affects many things. > > If there are other people willing to test the modification, I'll post the > modified binary." > > I hope we can get this one into core really soon. And that the code is > posted on the SVN. Would really like to help out testing it. > > Br > Gustaf Lundh > > -- > View this message in context: http://hudson.361315.n4.nabble.com/Discussion-regarding-HUDSON-5977-Dead-lock-condition-due-to-pipe-clogging-And-more-tp2322497p2540718.html > Sent from the Hudson dev mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > -- Kohsuke Kawaguchi --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
Might make sense to put the changes on a branch for now, so we can build it at ci.hudson-labs.org and get some real testing in on it before it gets to trunk - just a thought.
A.
On Wed, Sep 15, 2010 at 8:28 AM, Kohsuke Kawaguchi <[hidden email]> wrote: My plan was to cut the RC branch today and then merge back those |
|
In reply to this post by Alan Harder-2
> Because the severity of the issue, and with not much happening, we are
> currently thinking of forking Hudson internally to get the patch in > place. Running internal forks is really not a great situation to be in. Why so he did. My apologies. Although, admittedly, when I read this initially I took it to mean "running a patched version", which I do all the time, not actually publishing a new fork for the world to see. In any event, I'm glad this now appears to be converging. -P On Wed, 15 Sep 2010 10:42:53 -0400, Alan Harder <[hidden email]> wrote: > > > Paul Weiss wrote: >> I don't understand your reticence. Who said anything about a forked >> Hudson? >> >> This is a serious bug that often requires a restart of the whole farm. >> > > Perhaps you read the thread too quickly.. Gustaf did, in reference to > working around the problem until this bug is fixed. I was offering a > suggestion towards that end, nothing more. I have no objection to the > bug being fixed. > > >> -P >> >> ----- Original Message ----- >> From: "Alan Harder" <[hidden email]> >> To: [hidden email] >> Sent: Wednesday, September 8, 2010 12:22:14 PM >> Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due >> to pipe clogging. PROPOSED SOLUTION! >> >> Lundh, Gustaf wrote: >> >>> Your fix works great for the CopyArtifact plugin (awesome plug-in >>> BTW). However, since it is hard to predict the next remote call to >>> hang the job/slave, it seems more appropriate for us to fix the >>> underlying issue than applying/multiplying the fix all over :) >>> >>> >> >> yup.. was just wondering if some workaround code might be easier to >> maintain for a little while than a forked Hudson... >> >> - Alan >> >> >> --------------------------------------------------------------------- To >> unsubscribe, e-mail: [hidden email] >> For additional commands, e-mail: [hidden email] >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [hidden email] >> For additional commands, e-mail: [hidden email] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Kohsuke Kawaguchi
Since the patch, things seem to actually have gone worse. Archiving hangs more often than ever, but we also see hangs during other phases. Sometimes in the middle of a build. A very typical stack-dump looks like this: ----------------------------- Example 1, Hang on archiving: ----------------------------- On master: ----------- Channel reader thread: slave001 "Channel reader thread: slave001" Id=742 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:199) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked java.io.BufferedInputStream@6c191b43 at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:925) Executor #0 for slave001 : executing Nightly_Project #18 "Executor #0 for slave001 : executing Nightly_Project #18" Id=214 Group=main TIMED_WAITING on [B@5f91f3e1 at java.lang.Object.wait(Native Method) - waiting on [B@5f91f3e1 at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:171) at hudson.util.HeadBufferingStream.read(HeadBufferingStream.java:61) at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:221) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at org.apache.tools.tar.TarBuffer.readBlock(TarBuffer.java:257) at org.apache.tools.tar.TarBuffer.readRecord(TarBuffer.java:223) at hudson.org.apache.tools.tar.TarInputStream.read(TarInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025) at org.apache.commons.io.IOUtils.copy(IOUtils.java:999) at hudson.util.IOUtils.copy(IOUtils.java:33) at hudson.FilePath.readFromTar(FilePath.java:1565) at hudson.FilePath.copyRecursiveTo(FilePath.java:1491) at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:117) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:580) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:558) at hudson.model.Build$RunnerImpl.post2(Build.java:157) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528) at hudson.model.Run.run(Run.java:1303) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:137) Pipe writer thread: slave001 "Pipe writer thread: slave001" Id=1071 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42218847 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42218847 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) stderr copier for remote agent on slave001 "stderr copier for remote agent on slave001" Id=314 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:177) at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60) Slave: ------ Thread Dump Channel reader thread: channel "Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:199) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked java.io.BufferedInputStream@10fbabc at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:925) main "main" Id=1 Group=main WAITING on hudson.remoting.Channel@bfea1d at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.Channel@bfea1d at java.lang.Object.wait(Object.java:485) at hudson.remoting.Channel.join(Channel.java:706) at hudson.remoting.Launcher.main(Launcher.java:419) at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:365) at hudson.remoting.Launcher.run(Launcher.java:205) at hudson.remoting.Launcher.main(Launcher.java:167) Ping thread for channel hudson.remoting.Channel@bfea1d:channel "Ping thread for channel hudson.remoting.Channel@bfea1d:channel" Id=10 Group=main TIMED_WAITING at java.lang.Thread.sleep(Native Method) at hudson.remoting.PingThread.run(PingThread.java:86) Pipe writer thread: channel "Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1d934a6 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1d934a6 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) pool-1-thread-43 "pool-1-thread-43" Id=308 Group=main WAITING on hudson.remoting.PipeWindow$Real@192aa08 at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.PipeWindow$Real@192aa08 at java.lang.Object.wait(Object.java:485) at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:108) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked java.io.BufferedOutputStream@8bd2f5 at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72) - locked java.util.zip.GZIPOutputStream@19d6ea2 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) - locked hudson.util.io.TarArchiver$1@1beddbb at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410) at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388) at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337) at hudson.util.io.TarArchiver.visit(TarArchiver.java:103) at hudson.util.DirScanner$Glob.scan(DirScanner.java:104) at hudson.FilePath.writeToTar(FilePath.java:1543) at hudson.FilePath.access$800(FilePath.java:159) at hudson.FilePath$33.invoke(FilePath.java:1484) at hudson.FilePath$33.invoke(FilePath.java:1481) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@1cf2917 pool-1-thread-59 "pool-1-thread-59" Id=496 Group=main WAITING on hudson.remoting.PipeWindow$Real@19b2abb at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.PipeWindow$Real@19b2abb at java.lang.Object.wait(Object.java:485) at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:108) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked java.io.BufferedOutputStream@d6bfd4 at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72) - locked java.util.zip.GZIPOutputStream@2a0635 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) - locked hudson.util.io.TarArchiver$1@19aea1a at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410) at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388) at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337) at hudson.util.io.TarArchiver.visit(TarArchiver.java:103) at hudson.util.DirScanner$Glob.scan(DirScanner.java:104) at hudson.FilePath.writeToTar(FilePath.java:1543) at hudson.FilePath.access$800(FilePath.java:159) at hudson.FilePath$33.invoke(FilePath.java:1484) at hudson.FilePath$33.invoke(FilePath.java:1481) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@17f8f55 pool-1-thread-62 "pool-1-thread-62" Id=503 Group=main WAITING on hudson.remoting.PipeWindow$Real@1ec2d3b at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.PipeWindow$Real@1ec2d3b at java.lang.Object.wait(Object.java:485) at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked java.io.BufferedOutputStream@b3e1b4 at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72) - locked java.util.zip.GZIPOutputStream@1eebddc at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) - locked hudson.util.io.TarArchiver$1@c51653 at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410) at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388) at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337) at hudson.util.io.TarArchiver.visit(TarArchiver.java:103) at hudson.util.DirScanner$Glob.scan(DirScanner.java:104) at hudson.FilePath.writeToTar(FilePath.java:1543) at hudson.FilePath.access$800(FilePath.java:159) at hudson.FilePath$33.invoke(FilePath.java:1484) at hudson.FilePath$33.invoke(FilePath.java:1481) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@13caab8 pool-1-thread-73 "pool-1-thread-73" Id=518 Group=main RUNNABLE at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374) at hudson.Functions.getThreadInfos(Functions.java:775) at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:75) at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:71) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@d941af Finalizer "Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@1ac9043 at java.lang.Object.wait(Native Method) - waiting on java.lang.ref.ReferenceQueue$Lock@1ac9043 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Reference Handler "Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@332be0 at java.lang.Object.wait(Native Method) - waiting on java.lang.ref.Reference$Lock@332be0 at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) Signal Dispatcher "Signal Dispatcher" Id=4 Group=system RUNNABLE ------------------------------------ Example 2, Hang on Emma publisher: ------------------------------------ Master: -------- Channel reader thread: slave002 "Channel reader thread: slave002" Id=848 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:199) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked java.io.BufferedInputStream@20ec980d at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:925) Executor #0 for slave002 : executing Nightly_Project_2 #19 / waiting for hudson.remoting.Channel@1291a472:slave002 "Executor #0 for slave002 : executing Nightly_Project_2 #19 / waiting for hudson.remoting.Channel@1291a472:slave002" Id=381 Group=main TIMED_WAITING on hudson.remoting.UserRequest@60459d77 at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.UserRequest@60459d77 at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:607) at hudson.FilePath.act(FilePath.java:742) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.copyTo(FilePath.java:1360) at hudson.FilePath.copyTo(FilePath.java:1335) at hudson.plugins.emma.EmmaPublisher.saveCoverageReports(EmmaPublisher.java:91) at hudson.plugins.emma.EmmaPublisher.perform(EmmaPublisher.java:123) at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:580) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:558) at hudson.model.Build$RunnerImpl.post2(Build.java:157) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528) at hudson.model.Run.run(Run.java:1303) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:137) Pipe writer thread: slave002 "Pipe writer thread: slave002" Id=971 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6577a5f4 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6577a5f4 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) stderr copier for remote agent on slave002 "stderr copier for remote agent on slave002" Id=414 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:177) at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60) Slave: ------ Thread Dump /bin/sh -xe /tmp/hudson5168499638544882529.sh: stdout copier "/bin/sh -xe /tmp/hudson5168499638544882529.sh: stdout copier" Id=325 Group=main WAITING on hudson.remoting.PipeWindow$Real@1f61ce1 at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.PipeWindow$Real@1f61ce1 at java.lang.Object.wait(Object.java:485) at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103) at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:110) at hudson.util.StreamCopyThread.run(StreamCopyThread.java:61) Channel reader thread: channel "Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:199) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked java.io.BufferedInputStream@507660 at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:925) main "main" Id=1 Group=main WAITING on hudson.remoting.Channel@bfea1d at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.Channel@bfea1d at java.lang.Object.wait(Object.java:485) at hudson.remoting.Channel.join(Channel.java:706) at hudson.remoting.Launcher.main(Launcher.java:419) at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:365) at hudson.remoting.Launcher.run(Launcher.java:205) at hudson.remoting.Launcher.main(Launcher.java:167) Ping thread for channel hudson.remoting.Channel@bfea1d:channel "Ping thread for channel hudson.remoting.Channel@bfea1d:channel" Id=10 Group=main TIMED_WAITING at java.lang.Thread.sleep(Native Method) at hudson.remoting.PingThread.run(PingThread.java:86) Pipe writer thread: channel "Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@e20b3f at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@e20b3f at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) pool-1-thread-134 "pool-1-thread-134" Id=837 Group=main RUNNABLE at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374) at hudson.Functions.getThreadInfos(Functions.java:775) at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:75) at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:71) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@1481207 pool-1-thread-83 "pool-1-thread-83" Id=548 Group=main WAITING on hudson.remoting.PipeWindow$Real@fd8c0e at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.PipeWindow$Real@fd8c0e at java.lang.Object.wait(Object.java:485) at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133) at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:108) at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103) at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:110) at hudson.Util.copyStream(Util.java:418) at hudson.FilePath$30.invoke(FilePath.java:1365) at hudson.FilePath$30.invoke(FilePath.java:1360) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899) at hudson.remoting.UserRequest.perform(UserRequest.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@1e368ec Finalizer "Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@8bb582 at java.lang.Object.wait(Native Method) - waiting on java.lang.ref.ReferenceQueue$Lock@8bb582 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Reference Handler "Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@397fa0 at java.lang.Object.wait(Native Method) - waiting on java.lang.ref.Reference$Lock@397fa0 at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) Signal Dispatcher "Signal Dispatcher" Id=4 Group=system RUNNABLE Best Regards Gustaf Lundh |
|
I see similar failures. Multiple hangs, mostly in archiving, but
sometimes it appears to hang when writing to stdout. -P On Tue, 21 Sep 2010 04:53:03 -0400, GLundh <[hidden email]> wrote: > Kohsuke Kawaguchi wrote: >> >> My plan was to cut the RC branch today and then merge back those >> changes to the trunk. >> >> I agree with the importance of the fix. >> >> -- >> Kohsuke Kawaguchi >> > Since the patch, things seem to actually have gone worse. Archiving hangs > more often than ever, but we also see hangs during other phases. > Sometimes > in the middle of a build. --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
| Powered by Nabble | Edit this page |
