log in

Advanced search

Message boards : Number crunching : New task mostly resulting in computation errors

1 · 2 · Next
Author Message
Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 56
Credit: 3,099,843
RAC: 41
Message 2004 - Posted: 10 May 2012, 0:30:32 UTC

It looks like the new batch of work is resulting in mostly computation errors.

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 100,251
RAC: 0
Message 2005 - Posted: 10 May 2012, 5:52:54 UTC - in response to Message 2004.

Not 'mostly' but ALL - Not one single successful WU here, but 932 errors!

Al.

Profile Tom
Volunteer moderator
Avatar
Send message
Joined: 23 Jun 08
Posts: 490
Credit: 238,767
RAC: 0
Message 2008 - Posted: 11 May 2012, 13:47:11 UTC

Correct. There was an bug in one of the jobs we submitted, workunits with prefix MindModeling-137, but all workunits for that job have since been cancelled. The more recent job, MindModeling-140, should be far more stable.

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 56
Credit: 3,099,843
RAC: 41
Message 2009 - Posted: 12 May 2012, 13:57:17 UTC - in response to Message 2008.

Yep, that batch went a whole let better!

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 56
Credit: 3,099,843
RAC: 41
Message 2012 - Posted: 19 May 2012, 4:09:14 UTC - in response to Message 2009.

The new batch appears to be going to computation errors on Windows and Mac. My Linux machines appear to be completing the tasks, though I have noticed one error. However, on Windows and Mac, the tasks only run for 4 seconds or so before erroring.

Profile philip-in-hongkong
Send message
Joined: 25 Apr 08
Posts: 4
Credit: 320,897
RAC: 2
Message 2013 - Posted: 19 May 2012, 8:59:45 UTC - in response to Message 2012.
Last modified: 19 May 2012, 9:01:20 UTC

I got a few WUs. Some ended with errors. But some like, this one http://mindmodeling.org/beta/workunit.php?wuid=299728, run for 4+ hours and still show 0% - 10% Progress. How long it uses to take to finish a WU or they are stuck?

Philip

Profile Jack.Harris
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 24 Apr 07
Posts: 507
Credit: 761,261
RAC: 0
Message 2014 - Posted: 19 May 2012, 12:33:04 UTC - in response to Message 2013.

The new batch should have only been sent to Linux users -- any WU that were sent to Windows / Mac users were done in error. We apologize for the mistake. We are updating our submission interface and automatic testing environment in an attempt to not have this mistake happen again.

Sorry again for the bad windows and mac WUs
____________
MindModeling@Home is fun

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 56
Credit: 3,099,843
RAC: 41
Message 2015 - Posted: 19 May 2012, 15:45:45 UTC - in response to Message 2014.

NP, it is a beta project after all. But I guess the WUs weren't too useful to you since it looks like a lot of them went 4 and out and were abandoned. Are you going to resubmit the run or just use the ones that worked?

Profile mickydl*
Send message
Joined: 26 May 11
Posts: 5
Credit: 259,544
RAC: 0
Message 2025 - Posted: 5 Jun 2012, 6:27:40 UTC

I get nothing but errors. So far 20 valid and over 100 invalid :(

mickydl*

Profile Tom
Volunteer moderator
Avatar
Send message
Joined: 23 Jun 08
Posts: 490
Credit: 238,767
RAC: 0
Message 2028 - Posted: 14 Jun 2012, 15:19:15 UTC - in response to Message 2025.

mickydl*,

I checked the application you are running against other hosts running the same version, and couldn't find any systemic issues or patterns emerging that seem to be affecting other users in general. The application itself is written in Lisp (CCL), and the error message that is consistent across all your workunits is the following:

> Error of type SIMPLE-ERROR: Operation not permitted > While executing: CCL::GET-DESCRIPTOR-FOR, in process listener(1).


This indicates a permissions issue (see http://trac.clozure.com/ccl/ticket/371 for a psuedo-reference). Could the permissions in your BOINC folders be toggled in such a way that disallows our application to run? You may want to check those directories to be sure. I also suggest resetting the project, allowing you to re-download all of the input files and start from scratch.

It also appears to only affect your AMD machine. That may be the next place to look...

Profile mickydl*
Send message
Joined: 26 May 11
Posts: 5
Credit: 259,544
RAC: 0
Message 2030 - Posted: 14 Jun 2012, 19:59:26 UTC - in response to Message 2028.

Hi Tom,

I checked the privileges. As far as I can see all executables have the execute-permission.

-rwxr-xr-x 1 boinc users 5878620 8. Feb 21:54 1.37_wrapper_6.12_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc users 611115 8. Feb 21:47 1.37_x86_64-pc-linux-gnu_ccl -rwxr-xr-x 1 boinc users 611115 11. Mai 20:51 1.75_x86_64-pc-linux-gnu_ccl -rwxr-xr-x 1 boinc users 8891654 11. Mai 20:51 mm_wrapper_1.09_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc users 11214861 11. Mai 20:51 mm_wrapper_graphics_1.09_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc users 1432128 11. Mai 20:51 x86_64-pc-linux-gnu_7za

Looking at the error-message more closely I noticed that the problem seems to occur after an attempt to execute an unzip. I don't suppose you supply the unzip executable as part of your project (at least I can't find any in the project folder) and so you probably rely on unzip being present on the target machine.
Could you tell me what exactly you are trying to execute there (exact name like unzip or gunzip) and if you have any hard coded paths in that call. I have had a similar problem in another project that was trying to execute an unzip with a hard-coded path in the call.

Thanks,
mickydl*

Profile Tom
Volunteer moderator
Avatar
Send message
Joined: 23 Jun 08
Posts: 490
Credit: 238,767
RAC: 0
Message 2031 - Posted: 15 Jun 2012, 16:12:05 UTC - in response to Message 2030.
Last modified: 15 Jun 2012, 16:13:07 UTC

We use a file compression utility known as 7za (see http://linux.die.net/man/1/7za), and we deploy it as part of the application.

-rwxr-xr-x 1 boinc users 1432128 11. Mai 20:51 x86_64-pc-linux-gnu_7za


According to your output, the actually unzipping completes:

About to execute (MM-UNZIP xSHJ1961.zip) Finished executing (MM-UNZIP xSHJ1961.zip)


It's immediately afterward that the job fails. It's in this phase that we'd expect to run an executable from the unzipped directory. It's as if the 7za unzip function is not extracting the right permissions and we're unable to navigate to the program. Why this is happening only for you and no one else is not yet clear to me...

Profile mickydl*
Send message
Joined: 26 May 11
Posts: 5
Credit: 259,544
RAC: 0
Message 2032 - Posted: 15 Jun 2012, 18:37:57 UTC - in response to Message 2031.

I'll try to get hold of a work unit the next time you have work available. Then I can unpack the files manually and check if they are linked against a library that is missing on my machine.

Michael

Profile Tom
Volunteer moderator
Avatar
Send message
Joined: 23 Jun 08
Posts: 490
Credit: 238,767
RAC: 0
Message 2033 - Posted: 15 Jun 2012, 21:12:59 UTC - in response to Message 2032.

Great! If you create a file called "debug" in the MindModeling project directory you can suspend the application for all future MindModeling workunits. Removing the debug file will cause any suspended workunits to resume. It's a simple loop that checks if the file exists, and if so, sleeps.

This should allow you to explore the files a little bit easier.

Gary Wilson
Send message
Joined: 25 Nov 08
Posts: 56
Credit: 3,099,843
RAC: 41
Message 2036 - Posted: 22 Jun 2012, 12:49:56 UTC - in response to Message 2033.

It looks like the Windows tasks are still having the thread deadlock issue as I had 3 that were stuck overnight, so only one core was running this morning.

Also, I have one Linux machine that produces only comp errors but right at the beginning. Some kind of file issue. But the machine runs other projects just fine.

Other than that, it's great to see a huge task batch come through.

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 100,251
RAC: 0
Message 2037 - Posted: 22 Jun 2012, 17:51:31 UTC

The new huge batch is good news. Well, sort of ...

I think your database is getting clogged up with so many results coming in. I'm unable to access my tasks list or even view my computers via my account page. Well, maybe if I waited for more than 10 minutes it would work, but I get easily bored ;)

Anyway, from what I saw earlier today all of my WUs are running to completion and validating (all XP 32-bit at the moment).

My only problem is the short deadline forcing panic mode on all hosts and limiting the amount of work I'm able to get (no work fetch when stuck in high priority) and also blocking other projects. Given the size of this batch and the length of time it's going to take to complete, maybe the deadline could be increaded to 2 or 3 days?

Anyway, nice to have some work to crunch :)

Cheers,

Al.

TRuEQ & TuVaLu
Send message
Joined: 17 Dec 10
Posts: 1
Credit: 447,924
RAC: 0
Message 2038 - Posted: 22 Jun 2012, 19:48:57 UTC

Most of my tasks validated ok
win 32 vista
win 64 vista
on 3 computors

Profile Tom
Volunteer moderator
Avatar
Send message
Joined: 23 Jun 08
Posts: 490
Credit: 238,767
RAC: 0
Message 2039 - Posted: 22 Jun 2012, 21:37:24 UTC - in response to Message 2037.

I think your database is getting clogged up with so many results coming in. I'm unable to access my tasks list or even view my computers via my account page. Well, maybe if I waited for more than 10 minutes it would work, but I get easily bored ;)


We noticed performance issues as well, and have taken the following steps to address the issue:
(1) We doubled the size of each workunit to alleviate some of the download/upload traffic.
(2) We eliminated some internal database queries that were bottlenecking the system.

Anyway, from what I saw earlier today all of my WUs are running to completion and validating (all XP 32-bit at the moment).


Excellent!

My only problem is the short deadline forcing panic mode on all hosts and limiting the amount of work I'm able to get (no work fetch when stuck in high priority) and also blocking other projects. Given the size of this batch and the length of time it's going to take to complete, maybe the deadline could be increaded to 2 or 3 days?


I've temporarily increased the task deadline to 2 days for all newly created workunits. Since we don't generate all the workunits for a job up front, you should start downloading these longer workunits very soon (but maybe not immediately). We'll have to be more dynamic about this in the future. Currently, we hard code the delay_bound value, so all workunits have the same deadline despite the size of the job. In the future, we'll set the deadline as a function of the total estimated runtime of a job.

-Tom

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 100,251
RAC: 0
Message 2040 - Posted: 22 Jun 2012, 22:20:50 UTC - in response to Message 2039.


Anyway, from what I saw earlier today all of my WUs are running to completion and validating (all XP 32-bit at the moment).


Excellent!


Now that I can see my results page again I can confirm that I've had no errors at all :) Just one WU pending at the moment. Nice!

I've temporarily increased the task deadline to 2 days for all newly created workunits. Since we don't generate all the workunits for a job up front, you should start downloading these longer workunits very soon (but maybe not immediately). We'll have to be more dynamic about this in the future. Currently, we hard code the delay_bound value, so all workunits have the same deadline despite the size of the job. In the future, we'll set the deadline as a function of the total estimated runtime of a job.


Good news. I don't think I've got to the longer deadline WUs yet, but it will be neat when the deadlines adjust to suit the volume of work. Should make things much easier for the crunchers. I don't mind short bursts of 'urgent' work, but days of it at a time can be a bit tricky to manage :)

Looking at the percentage complete of the current batch I'm looking forward to a week of solid crunching :)

Cheers,

Al.

Profile ChertseyAl
Avatar
Send message
Joined: 13 Mar 08
Posts: 30
Credit: 100,251
RAC: 0
Message 2041 - Posted: 23 Jun 2012, 12:16:35 UTC

I'm seeing a lot of errors in my message logs like this:

23/06/12 12:10:53|MindModeling@Beta|Starting MindModeling-164-4fe57dac18ac5_0
23/06/12 12:10:56|MindModeling@Beta|Starting task MindModeling-164-4fe57dac18ac5_0 using ccl_wrap version 175
23/06/12 12:46:19|MindModeling@Beta|[error] Can't rename output file MindModeling-164-4fe57dac18ac5_0_1
23/06/12 12:46:20|MindModeling@Beta|Computation for task MindModeling-164-4fe57dac18ac5_0 finished
23/06/12 12:46:20|MindModeling@Beta|Starting MindModeling-164-4fe57e070ce58_0
23/06/12 12:46:24|MindModeling@Beta|Starting task MindModeling-164-4fe57e070ce58_0 using ccl_wrap version 175
23/06/12 12:46:27|MindModeling@Beta|Started upload of MindModeling-164-4fe57dac18ac5_0_0
23/06/12 12:46:31|MindModeling@Beta|Finished upload of MindModeling-164-4fe57dac18ac5_0_0

The WU for that one was:

http://mindmodeling.org/beta/workunit.php?wuid=432068
http://mindmodeling.org/beta/result.php?resultid=517531

I'm getting loads of these, but they are difficult to find as I'm crunching so many :)

I don't see any errored or invalid WUs in my tasks list, so presumably it's not important?

Cheers,

Al.

1 · 2 · Next

Message boards : Number crunching : New task mostly resulting in computation errors


Main page · Your account · Message boards


Copyright © 2020 MindModeling.org