log in

Advanced search

Message boards : Number crunching : New task mostly resulting in computation errors

Previous · 1 · 2
Author Message
Profile Jack.Harris
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 24 Apr 07
Posts: 507
Credit: 761,261
RAC: 0
Message 2042 - Posted: 23 Jun 2012, 19:21:21 UTC - in response to Message 2041.

That file that it can't find is an error log.
We should probably produce and empty error log when no errors occur.
That would keep boinc from producing this warning message when it can't find and rename a missing error log on good work units.

It is an annoying message that looks like an error but the results are still good.

Thanks for all the crunching it is very much appreciated!
____________
MindModeling@Home is fun

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 100,251
RAC: 0
Message 2043 - Posted: 23 Jun 2012, 20:21:05 UTC - in response to Message 2042.

It is an annoying message that looks like an error but the results are still good.


Thanks for the quick feedback, especially at the weekend! I'll keep on crunching :)

Cheers,

Al.


Profile KPX
Send message
Joined: 7 Feb 08
Posts: 2
Credit: 149,799
RAC: 0
Message 2044 - Posted: 24 Jun 2012, 6:52:59 UTC

So how long are the units supposed to run? They have an initial estimate of 15 minutes, and one of them now runs for 20+ hours!!! What should I do, abort? Isn't it a waste of computer time? Or is it doing something relevant? Please respond, I'll do as advised.

Profile Jack.Harris
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 24 Apr 07
Posts: 507
Credit: 761,261
RAC: 0
Message 2045 - Posted: 24 Jun 2012, 11:20:08 UTC - in response to Message 2044.

definitely abort
something must have went very wrong for that WU if it is at 20+ hours

thanks for the crunching support and sorry for the rouge WU

____________
MindModeling@Home is fun

Tex1954
Send message
Joined: 12 Jun 12
Posts: 52
Credit: 2,525,800
RAC: 33
Message 2046 - Posted: 24 Jun 2012, 14:29:55 UTC - in response to Message 2045.

definitely abort
something must have went very wrong for that WU if it is at 20+ hours

thanks for the crunching support and sorry for the rouge WU


I had 2 work units sort of stop overnight. One had run 12 hours and one over 4 hours. They were NOT using any CPU resources!

Sooo, I exited BOINC Manager and restarted and both WU's started again, ran a few seconds then reported complete without error.

Windows 7 64b OS, BOINC Manager 7.0.28, AMD 1055T, 8-Gig RAM.

Looks like there might be a problem when it completes and doesn't exit/report properly.

:D

Profile mimeq
Send message
Joined: 29 Mar 11
Posts: 2
Credit: 12,953
RAC: 0
Message 2047 - Posted: 24 Jun 2012, 14:43:47 UTC - in response to Message 2046.


I had 2 work units sort of stop overnight. One had run 12 hours and one over 4 hours. They were NOT using any CPU resources!


I have the same problem:



http://mindmodeling.org/beta/results.php?hostid=20605&offset=0&show_names=0&state=5&appid=

Tex1954
Send message
Joined: 12 Jun 12
Posts: 52
Credit: 2,525,800
RAC: 33
Message 2048 - Posted: 24 Jun 2012, 14:48:06 UTC - in response to Message 2047.
Last modified: 24 Jun 2012, 15:11:42 UTC


I had 2 work units sort of stop overnight. One had run 12 hours and one over 4 hours. They were NOT using any CPU resources!


I have the same problem:



Woopsy, looks like they may have reported, but something went wrong...


Win7-Compaq

158 MindModeling@Beta 6/24/2012 9:37:36 AM Started upload of MindModeling-164-4fe635dd6b516_0_0
159 MindModeling@Beta 6/24/2012 9:37:37 AM [error] Error reported by file upload server: invalid signature
160 MindModeling@Beta 6/24/2012 9:37:37 AM Giving up on upload of MindModeling-164-4fe635dd6b516_0_0: permanent upload error
161 MindModeling@Beta 6/24/2012 9:37:40 AM Sending scheduler request: To report completed tasks.
162 MindModeling@Beta 6/24/2012 9:37:40 AM Reporting 1 completed tasks, not requesting new tasks
163 MindModeling@Beta 6/24/2012 9:37:42 AM Scheduler request completed
164 MindModeling@Beta 6/24/2012 9:37:51 AM Computation for task MindModeling-164-4fe63556f2e86_0 finished
165 MindModeling@Beta 6/24/2012 9:37:51 AM Starting task MindModeling-164-4fe63551dc7ed_0 using ccl_wrap version 175 (sse2) in slot 4
166 MindModeling@Beta 6/24/2012 9:37:58 AM Started upload of MindModeling-164-4fe63556f2e86_0_0
167 MindModeling@Beta 6/24/2012 9:37:59 AM [error] Error reported by file upload server: invalid signature
168 MindModeling@Beta 6/24/2012 9:37:59 AM Giving up on upload of MindModeling-164-4fe63556f2e86_0_0: permanent upload error

However, looks like the task actually reported correctly at some point... found them in the valid tasks display... sigh.. really weird...

http://mindmodeling.org/beta/result.php?resultid=556949
http://mindmodeling.org/beta/result.php?resultid=556841

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 56
Credit: 3,099,843
RAC: 41
Message 2049 - Posted: 24 Jun 2012, 20:25:47 UTC - in response to Message 2048.

I've had several stuck tasks recently (oddly, none overnight). It seems to occur less now with the longer running tasks, but still happens. This problem only occurs on Windows machines as in previous batches. The task just report failed to create shared mem segment once a second once the problem happens.

Profile Tom
Volunteer moderator
Avatar
Send message
Joined: 23 Jun 08
Posts: 490
Credit: 238,767
RAC: 0
Message 2052 - Posted: 26 Jun 2012, 0:11:39 UTC - in response to Message 2049.
Last modified: 26 Jun 2012, 0:12:44 UTC

The "shared mem segment" message is misleading. The message is output by our BOINC wrapper (in a loop) if it is unable to create a shared memory segment for the screensaver app (regardless if the WU is "stuck" or not). This error message will be removed from newer versions of our application. But for now, it may suggest the BOINC wrapper is waiting for it's hung subprocess to exit, or failing to recognize that the subprocess exited cleanly.

More to come.

Tex1954
Send message
Joined: 12 Jun 12
Posts: 52
Credit: 2,525,800
RAC: 33
Message 2056 - Posted: 9 Jul 2012, 1:01:42 UTC
Last modified: 9 Jul 2012, 1:03:03 UTC

Maybe off topic, but would this server change have anything to do with the stats websites not picking up the points?

On BOINCStats and AllProjectStats I show no progress last few days... but here, I've about doubled what I had..

Anyways, I found a driver bug in two systems that caused some errors due to network access and updated them. Since then, plus two BIOS updates, looks like things moving smoothly now...

Had a couple errors after BIOS updates because I forgot to set some voltages correctly... That may have been the cause of the "stuck" tasks as well since I no longer have them... errors or sticky tasks...

:D

Tex1954
Send message
Joined: 12 Jun 12
Posts: 52
Credit: 2,525,800
RAC: 33
Message 2057 - Posted: 9 Jul 2012, 8:22:50 UTC

Welp, as of this morning, looks like the stats sites picked up the points... don't know why the delay...

:D

Profile Jack.Harris
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 24 Apr 07
Posts: 507
Credit: 761,261
RAC: 0
Message 2058 - Posted: 9 Jul 2012, 22:36:58 UTC - in response to Message 2057.
Last modified: 9 Jul 2012, 22:38:14 UTC

I saw last night that the stats weren't being pushed.
Evidently our 'dump_db' process had died (probably related to a disk becoming full) and the dead process left a lock file that blocked other iterations of dump_db from being executed.

dump_db controls writing out stats to file so that they can be collected by external services.

Anyway -- The dead lock file -- that was causing a deadlock -- has been removed and things should flow again.

Today we also dealt with the issue causing the filesystem to fill, so this shouldn't happen again soon.

Sorry for the delay and thanks for the post -- posts somethings get these things noticed quicker.

Cheers and happy crunching
Jack
____________
MindModeling@Home is fun

Tex1954
Send message
Joined: 12 Jun 12
Posts: 52
Credit: 2,525,800
RAC: 33
Message 2062 - Posted: 11 Jul 2012, 22:19:13 UTC - in response to Message 2058.
Last modified: 11 Jul 2012, 22:30:16 UTC

I saw last night that the stats weren't being pushed.
Evidently our 'dump_db' process had died (probably related to a disk becoming full) and the dead process left a lock file that blocked other iterations of dump_db from being executed.

dump_db controls writing out stats to file so that they can be collected by external services.

Anyway -- The dead lock file -- that was causing a deadlock -- has been removed and things should flow again.

Today we also dealt with the issue causing the filesystem to fill, so this shouldn't happen again soon.

Sorry for the delay and thanks for the post -- posts somethings get these things noticed quicker.

Cheers and happy crunching
Jack


Oh please! No sorries! It's a BETA project and you are keeping everybody updated and doing your best!

I think I can speak for all of us and say YOU ARE DOING A GREAT JOB!!!!

Thanks for your hard work!

:)

Previous · 1 · 2

Message boards : Number crunching : New task mostly resulting in computation errors


Main page · Your account · Message boards


Copyright © 2020 MindModeling.org