60+ hours... is this normal?

log in

Advanced search

Message boards : Number crunching : 60+ hours... is this normal?

1 · 2 · Next
Author Message
jamie
Send message
Joined: 23 Jun 11
Posts: 2
Credit: 23,076
RAC: 9
Message 2050 - Posted: 25 Jun 2012, 23:21:13 UTC

I thought that these longer tasks were supposed to end at 48 hours? I don't mind letting it go but I am curious if there is a problem.

i7-2600 @ 3.4 Ghz

Profile Tom
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 23 Jun 08
Posts: 303
Credit: 105,388
RAC: 0
Message 2051 - Posted: 25 Jun 2012, 23:52:46 UTC - in response to Message 2050.
Last modified: 26 Jun 2012, 0:00:21 UTC

Workunits should definitely NOT take that long. For a quick fix, I suggest either aborting the task or restarting the client. We have seen this issue of workunits getting "stuck" in the past, but have not yet narrowed down the cause. Some other users are currently experiencing similar symptoms (on their Windows hosts).

See http://mindmodeling.org/forum_thread.php?id=557&nowrap=true#1941 or http://mindmodeling.org/forum_thread.php?id=569 for related threads.

And as always, I will post updates as soon as they become available. Thanks for your contributions to the project!!!

jamie
Send message
Joined: 23 Jun 11
Posts: 2
Credit: 23,076
RAC: 9
Message 2053 - Posted: 26 Jun 2012, 2:21:21 UTC - in response to Message 2051.

I suspended the task and rebooted the Windows 7 PC (uptime has been a few months). When I launched BOINC they cleared and either errored or were invalid due to "Completed, too late to validate".

vaughan
Send message
Joined: 27 Jan 08
Posts: 2
Credit: 191,746
RAC: 24
Message 2054 - Posted: 8 Jul 2012, 3:40:05 UTC

Tom I still get some tasks that get stuck at 100%, Win 7 64-bit.

Should we just abort them if they run > 1 hour? Most tasks complete in 15-25 minutes on a Q6600.

Profile Jack.Harris
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 24 Apr 07
Posts: 499
Credit: 626,319
RAC: 27
Message 2055 - Posted: 8 Jul 2012, 18:49:44 UTC - in response to Message 2054.
Last modified: 8 Jul 2012, 18:50:38 UTC

vaughn - my 4 year old mac book takes about about an hour to crunch a single WU

Model Identifier: MacBook5,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 3 MB
Memory: 4 GB
Bus Speed: 1.07 GHz

Each WU could have some (but not much) variance in crunch time.

However, I would first try to pause and unpause the task and if that doesn't work try to shutdown and restart boinc -- so that the process will have to recover from the checkpoint.

I don't want you to loose any credit unjustly -- hopefully the above actions will 'unstick' things.

If not ... and it have been running multiple hours -- abort.

Cheers and thanks for all the crunching help,
Jack
____________
MindModeling@Home is fun

Gehirn
Send message
Joined: 29 Jul 09
Posts: 1
Credit: 11,945
RAC: 0
Message 2059 - Posted: 10 Jul 2012, 7:16:35 UTC

I've any never ending tasks.I abort it.
The normal runtime on my comp. is less than an half hour.
Greetings.

John P. Myers
Send message
Joined: 14 Apr 10
Posts: 2
Credit: 10,130
RAC: 0
Message 2060 - Posted: 11 Jul 2012, 6:32:03 UTC - in response to Message 2059.

I've any never ending tasks.I abort it.
The normal runtime on my comp. is less than an half hour.
Greetings.


Yeah that's fine if you're watching your computer every 30 mins. I just had to abort 3 more WUs that were at 8hrs, 10hrs and 11 hrs runtime. Wasted half my day basically. This is rediculous. Is it really too much to ask to get this fixed or at the very least add a line in the code like:
IF CurRuntime > 3600 THEN CALL ForceCompletion
or whatever equivalent that does the same thing.

I like the project but i can't be wasting my CPU time and electricity like this. I'm sorry.

Tex1954
Send message
Joined: 12 Jun 12
Posts: 14
Credit: 203,400
RAC: 97
Message 2061 - Posted: 11 Jul 2012, 22:10:12 UTC
Last modified: 11 Jul 2012, 22:50:12 UTC

I discovered that with 100% repeatability, if the task is suspended and restarts for ANY reason on my systems, with keep tasks in memory NOT SELECTED, then the tasks will go into overtime while using ZERO CPU time and never finish properly. I've proved this on 3 systems. I have NOT tested this with Keep Suspended tasks in memory option set.

A Boinc Stop/Start when all the other tasks are finished will then cause the task to report, but likely get some error. Also, any other project that bumps/suspends an active mindmodel task causes the same problem... ANY reason I can find that the task starts/stops/starts causes the problem; this includes running CPU benchmarks...

Naturally, stopping/starting boinc once the task has started causes the same problem on ALL tasks already started. However, suspending UN-STARTED tasks and letting the active tasks finish before rebooting or restarting BOINC works fine...


:)

PS: Sometimes they do it when other project tasks start WITHOUT suspending an MM task... soooo, something strange in exit land on the MM tasks...

I had one task go for 1.5 hrs at 100% done and zero CPU's used, suspended the waiting new tasks, restarted BOINC, and now that task reported invalid signature... same as he others.


Win7-Compaq

51 MindModeling@Beta 7/11/2012 5:44:03 PM Restarting task MindModeling-175-4ffc4ade1b69a_0 using ccl_wrap version 175 (sse2) in slot 2
52 MindModeling@Beta 7/11/2012 5:44:16 PM Computation for task MindModeling-175-4ffc4ade1b69a_0 finished
53 MindModeling@Beta 7/11/2012 5:44:25 PM Started upload of MindModeling-175-4ffc4ade1b69a_0_0
54 MindModeling@Beta 7/11/2012 5:44:27 PM [error] Error reported by file upload server: invalid signature

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 50,952
RAC: 0
Message 2064 - Posted: 12 Jul 2012, 17:41:37 UTC - in response to Message 2061.

I discovered that with 100% repeatability, if the task is suspended and restarts for ANY reason on my systems, with keep tasks in memory NOT SELECTED, then the tasks will go into overtime while using ZERO CPU time and never finish properly.


Actually, way back in 2008(!) a suspend/resume bug was found, and never fixed AFAIK. There was a problem with the checkpointing meaning that only the last part of the run of the WU was reported, meaning the credit calculation was way too low.

Maybe it was fixed, but introduced this bug.

I never used to use 'keep tasks in memory', but this and another project at the time needed that option selected to run properly, so I've left it enabled since then.

FWIW, I'm not sure that there's much of a downside to keeping in memory, other than it uses more swap space. Maybe it's important if you have you swap on an SSD?

Cheers,

Al.

Tex1954
Send message
Joined: 12 Jun 12
Posts: 14
Credit: 203,400
RAC: 97
Message 2065 - Posted: 13 Jul 2012, 17:35:52 UTC - in response to Message 2064.
Last modified: 13 Jul 2012, 17:36:25 UTC

I don't mind setting the flag to keep things in memory except when I am running tasks that use over 2Gig Mem each.

Other tasks I run use 1.2Gig Mem each and with a 6 core CPU, one runs out of ram quickly... and windows real sucks when it starts thrashing memory off the virtual drive...

Sigh... so, in most case, I just leave it off...

Nevertheless, this is a Beta project... so things like checkpoint/save state file bugs are to be expected...

No worries... I'm sure the developers will fix it.

:)

AMDave
Send message
Joined: 6 Feb 08
Posts: 2
Credit: 100,161
RAC: 0
Message 2076 - Posted: 24 Jul 2012, 8:11:29 UTC

FYI
I am seeing numerous WU's stuck at various % completion for 8.5+ hours and rising, in the current run.

Occurring on Win7-64 and Linux-64, intel and AMD
BOINC 7.0.28 (current stable)

I have aborted the WUs now as the current advice and observed norm has been 20 mins to 1 hour on these machines.

Do we have advice that the current run has moved away from the norm and to expect much longer WU's?

Profile Tom
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: 23 Jun 08
Posts: 303
Credit: 105,388
RAC: 0
Message 2082 - Posted: 24 Jul 2012, 14:39:20 UTC - in response to Message 2076.

For the current batch of workunits (i.e. jobs prefixed with "veksler"), see http://mindmodeling.org/forum_thread.php?id=554&nowrap=true#2080 regarding variable workunit runtimes. We've since pulled these jobs from the system. For reference, these workunits are prefixed with MindModeling-188, MindModeling-190, and MindModeling-192. The number on the end corresponds to an internal job ID.

The other issue mentioned on this thread, regarding hung processes at 100% completeness, is still under investigation. We really appreciate the help you're providing, especially Tex1954, who gave us steps to reproduce the bug! Unfortunately, when tested in our dev environment given the conditions outlined, we were only about to reproduce the problem 5-10% of the time. This has made it difficult to debug and know with certainty if the bug has been resolved. We have implemented a fix however, and will be pushing it out shortly (I'll let you know when exactly). As always, your continued feedback helps us immensely when addressing these issues.

Thanks!!

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 50
Credit: 883,268
RAC: 238
Message 2084 - Posted: 24 Jul 2012, 18:10:33 UTC - in response to Message 2082.

As an FYI, a number of the 191 tasks are also experiencing long run times > 1 hour. I've been aborting anything that goes over 1 hour run time and with the remaining time still increasing (but it's usually over an hour as well). I've got a few that I'm going to let run on my Win7 Q6600 machine just to see if they ever do complete or if they were like the earlier tasks and go 4+ hours or more.

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 50,952
RAC: 0
Message 2085 - Posted: 24 Jul 2012, 18:28:11 UTC - in response to Message 2084.

As an FYI, a number of the 191 tasks are also experiencing long run times > 1 hour


Yes, I have the same thing. Quite a few 191s crunching for 7-12 hours, still using 100% CPU, but looking to complete way after deadline. I've aborted a bunch that hadn't started, and some that were clearly going to run far too long. For educational (comedy?) purposes, I've left a few to run. I think they might complete, even if after deadline. Might learn something. Might not. Whatever :)

Cheers,

Al.

fractal
Send message
Joined: 26 Jul 08
Posts: 3
Credit: 53,562
RAC: 2
Message 2086 - Posted: 24 Jul 2012, 19:29:42 UTC

I have a couple of 191's that went high priority after lots of hours. I did not write down exactly how many. I found this thread and shut down boinc to restart it. Boinc exited but the worker threads were still running according to ps, so I rebooted the machine.

Both work units restarted from 0 on reboot and are currently running high priority. I will give them an hour or two then abort them.

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 50
Credit: 883,268
RAC: 238
Message 2087 - Posted: 24 Jul 2012, 20:31:07 UTC - in response to Message 2086.
Last modified: 24 Jul 2012, 20:57:27 UTC

A couple of the ones I had on the Q6600 finished in about 2 hours. I have one though that is at 3:09 elapsed and 2:45 remaining (h:m) with both increasing at about the same rate. Going to abort that one.

On the ones that finished, the remaining time kept increasing and then suddenly drop 15-30 minutes, then start increasing again. As long as it falls off a cliff farther than it climbs back up before falling again, it eventually finishes.

Edit:
I also found a number of the 1.77*.exe applications left running in the background with a number of them still consuming CPU. Checking the boincmgr event log, it's filled with "app reporting negative CPU:-1.000000" error messages. Presumably this is because all those backgroud running tasks were taking CPU as well as the ones BOINC thought it was managing. So the amount of CPU actually available to BOINC was zero (all consumed by stuck apps). But windows just happily keeps scheduling them to run. Probably the same on Linux as well.

Maybe this explains the longer run times as well since the elapsed and remaining times kept increasing at 1 second intervals, but clearly, it can't get 1 second of actual work done in that timeframe due to the other apps also consuming CPU.

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 50
Credit: 883,268
RAC: 238
Message 2088 - Posted: 24 Jul 2012, 21:04:17 UTC - in response to Message 2087.

Just as a test, I aborted a long running task on Linux and it leaves behind the extra 1.77 process. So I now have 5 tasks - ah, it's a dual core - running. Since I've had to abort a few tasks on this machine, I suspect these have just been building up. I had to do that more on the Windows machine, so the problem was actually worse there.

fractal
Send message
Joined: 26 Jul 08
Posts: 3
Credit: 53,562
RAC: 2
Message 2089 - Posted: 24 Jul 2012, 23:08:45 UTC - in response to Message 2086.
Last modified: 24 Jul 2012, 23:12:21 UTC

I have a couple of 191's that went high priority after lots of hours. I did not write down exactly how many. I found this thread and shut down boinc to restart it. Boinc exited but the worker threads were still running according to ps, so I rebooted the machine.

Both work units restarted from 0 on reboot and are currently running high priority. I will give them an hour or two then abort them.

Both units finished after a system reboot that caused them to start from scratch. They both finished in under an hour runtime. One of them previously ran for over ten hours and was using 1G of RAM and was hung at 75%. Both of the units had been aborted three other times.

edit: I should probably mention this is on a 64 bit linux machine.

Tex1954
Send message
Joined: 12 Jun 12
Posts: 14
Credit: 203,400
RAC: 97
Message 2090 - Posted: 25 Jul 2012, 1:58:10 UTC - in response to Message 2089.
Last modified: 25 Jul 2012, 2:13:18 UTC

I have a couple of 191's that went high priority after lots of hours. I did not write down exactly how many. I found this thread and shut down boinc to restart it. Boinc exited but the worker threads were still running according to ps, so I rebooted the machine.

Both work units restarted from 0 on reboot and are currently running high priority. I will give them an hour or two then abort them.

Both units finished after a system reboot that caused them to start from scratch. They both finished in under an hour runtime. One of them previously ran for over ten hours and was using 1G of RAM and was hung at 75%. Both of the units had been aborted three other times.

edit: I should probably mention this is on a 64 bit linux machine.


I have the same problem I discovered... 6 WU's going way too long and stuck..


MindModeling@Beta 1.77 ACT-R cognitive modeling environment leveraging Clozure Common Lisp (sse2) MindModeling-191-500dedafb1b51_1 21:14:39 (21:10:07) 87.500 99.64 03:01:44 7/24/2012 11:37:13 PM Running High P. Linux-Compaq
MindModeling@Beta 1.77 ACT-R cognitive modeling environment leveraging Clozure Common Lisp (sse2) MindModeling-191-500e563ba5995_1 15:56:51 (15:53:15) 87.500 99.62 02:16:25 7/25/2012 4:58:35 AM Running High P. Linux-Compaq
MindModeling@Beta 1.77 ACT-R cognitive modeling environment leveraging Clozure Common Lisp (sse2) MindModeling-191-500dd87e36702_0 20:55:07 (20:50:35) 37.500 99.64 01d,02:21:10 7/24/2012 11:37:13 PM Running High P. Linux-Compaq
MindModeling@Beta 1.77 ACT-R cognitive modeling environment leveraging Clozure Common Lisp (sse2) MindModeling-191-500e5b311a770_0 17:17:07 (17:13:20) 37.500 99.63 21:46:32 7/25/2012 3:32:59 AM Running High P. Linux-Compaq
MindModeling@Beta 1.77 ACT-R cognitive modeling environment leveraging Clozure Common Lisp (sse2) MindModeling-191-500e14ed6828e_4 15:51:46 (15:48:11) 37.500 99.62 19:59:00 7/25/2012 4:58:35 AM Running High P. Linux-Compaq
MindModeling@Beta 1.77 ACT-R cognitive modeling environment leveraging Clozure Common Lisp (sse2) MindModeling-191-500dd7fd3a125_0 19:46:02 (19:41:38) 25.000 99.63 01d,10:17:03 7/25/2012 1:03:28 AM Running High P. Linux-Compaq


I will try the reboot thing...

:)

Update: Rebbot woorked. Looks like there is some sync problem that leaves some old stuff running in memory or something... have no real idea, but Linux-64b 3.0.23 running on a 1055T processor.

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 50
Credit: 883,268
RAC: 238
Message 2091 - Posted: 25 Jul 2012, 4:26:34 UTC - in response to Message 2090.

There a still quite a few tasks that have very long run times. I suspect in the morning I will have most cores stuck running these and wind up having to kill them.

1 · 2 · Next
Post to thread

Message boards : Number crunching : 60+ hours... is this normal?


Main page · Your account · Message boards


Copyright © 2013 MindModeling.org