New task mostly resulting in computation errors |
| log in |
Message boards : Number crunching : New task mostly resulting in computation errors
1 · 2 · Next
| Author | Message |
|---|---|
|
It looks like the new batch of work is resulting in mostly computation errors. | |
| ID: 2004 · Rating: 0 · rate:
| |
|
Not 'mostly' but ALL - Not one single successful WU here, but 932 errors! | |
| ID: 2005 · Rating: 0 · rate:
| |
|
Correct. There was an bug in one of the jobs we submitted, workunits with prefix MindModeling-137, but all workunits for that job have since been cancelled. The more recent job, MindModeling-140, should be far more stable. | |
| ID: 2008 · Rating: 0 · rate:
| |
|
Yep, that batch went a whole let better! | |
| ID: 2009 · Rating: 0 · rate:
| |
|
The new batch appears to be going to computation errors on Windows and Mac. My Linux machines appear to be completing the tasks, though I have noticed one error. However, on Windows and Mac, the tasks only run for 4 seconds or so before erroring. | |
| ID: 2012 · Rating: 0 · rate:
| |
|
I got a few WUs. Some ended with errors. But some like, this one http://mindmodeling.org/beta/workunit.php?wuid=299728, run for 4+ hours and still show 0% - 10% Progress. How long it uses to take to finish a WU or they are stuck? | |
| ID: 2013 · Rating: 0 · rate:
| |
|
The new batch should have only been sent to Linux users -- any WU that were sent to Windows / Mac users were done in error. We apologize for the mistake. We are updating our submission interface and automatic testing environment in an attempt to not have this mistake happen again. | |
| ID: 2014 · Rating: 0 · rate:
| |
|
NP, it is a beta project after all. But I guess the WUs weren't too useful to you since it looks like a lot of them went 4 and out and were abandoned. Are you going to resubmit the run or just use the ones that worked? | |
| ID: 2015 · Rating: 0 · rate:
| |
|
I get nothing but errors. So far 20 valid and over 100 invalid :( | |
| ID: 2025 · Rating: 0 · rate:
| |
|
mickydl*, > Error of type SIMPLE-ERROR: Operation not permitted > While executing: CCL::GET-DESCRIPTOR-FOR, in process listener(1). This indicates a permissions issue (see http://trac.clozure.com/ccl/ticket/371 for a psuedo-reference). Could the permissions in your BOINC folders be toggled in such a way that disallows our application to run? You may want to check those directories to be sure. I also suggest resetting the project, allowing you to re-download all of the input files and start from scratch. It also appears to only affect your AMD machine. That may be the next place to look... | |
| ID: 2028 · Rating: 0 · rate:
| |
|
Hi Tom, -rwxr-xr-x 1 boinc users 5878620 8. Feb 21:54 1.37_wrapper_6.12_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc users 611115 8. Feb 21:47 1.37_x86_64-pc-linux-gnu_ccl -rwxr-xr-x 1 boinc users 611115 11. Mai 20:51 1.75_x86_64-pc-linux-gnu_ccl -rwxr-xr-x 1 boinc users 8891654 11. Mai 20:51 mm_wrapper_1.09_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc users 11214861 11. Mai 20:51 mm_wrapper_graphics_1.09_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc users 1432128 11. Mai 20:51 x86_64-pc-linux-gnu_7za Looking at the error-message more closely I noticed that the problem seems to occur after an attempt to execute an unzip. I don't suppose you supply the unzip executable as part of your project (at least I can't find any in the project folder) and so you probably rely on unzip being present on the target machine. Could you tell me what exactly you are trying to execute there (exact name like unzip or gunzip) and if you have any hard coded paths in that call. I have had a similar problem in another project that was trying to execute an unzip with a hard-coded path in the call. Thanks, mickydl* | |
| ID: 2030 · Rating: 0 · rate:
| |
|
We use a file compression utility known as 7za (see http://linux.die.net/man/1/7za), and we deploy it as part of the application. -rwxr-xr-x 1 boinc users 1432128 11. Mai 20:51 x86_64-pc-linux-gnu_7za According to your output, the actually unzipping completes: About to execute (MM-UNZIP xSHJ1961.zip) Finished executing (MM-UNZIP xSHJ1961.zip) It's immediately afterward that the job fails. It's in this phase that we'd expect to run an executable from the unzipped directory. It's as if the 7za unzip function is not extracting the right permissions and we're unable to navigate to the program. Why this is happening only for you and no one else is not yet clear to me... | |
| ID: 2031 · Rating: 0 · rate:
| |
|
I'll try to get hold of a work unit the next time you have work available. Then I can unpack the files manually and check if they are linked against a library that is missing on my machine. | |
| ID: 2032 · Rating: 0 · rate:
| |
|
Great! If you create a file called "debug" in the MindModeling project directory you can suspend the application for all future MindModeling workunits. Removing the debug file will cause any suspended workunits to resume. It's a simple loop that checks if the file exists, and if so, sleeps. | |
| ID: 2033 · Rating: 0 · rate:
| |
|
It looks like the Windows tasks are still having the thread deadlock issue as I had 3 that were stuck overnight, so only one core was running this morning. | |
| ID: 2036 · Rating: 0 · rate:
| |
|
The new huge batch is good news. Well, sort of ... | |
| ID: 2037 · Rating: 0 · rate:
| |
|
Most of my tasks validated ok | |
| ID: 2038 · Rating: 0 · rate:
| |
I think your database is getting clogged up with so many results coming in. I'm unable to access my tasks list or even view my computers via my account page. Well, maybe if I waited for more than 10 minutes it would work, but I get easily bored ;) We noticed performance issues as well, and have taken the following steps to address the issue: (1) We doubled the size of each workunit to alleviate some of the download/upload traffic. (2) We eliminated some internal database queries that were bottlenecking the system. Anyway, from what I saw earlier today all of my WUs are running to completion and validating (all XP 32-bit at the moment). Excellent! My only problem is the short deadline forcing panic mode on all hosts and limiting the amount of work I'm able to get (no work fetch when stuck in high priority) and also blocking other projects. Given the size of this batch and the length of time it's going to take to complete, maybe the deadline could be increaded to 2 or 3 days? I've temporarily increased the task deadline to 2 days for all newly created workunits. Since we don't generate all the workunits for a job up front, you should start downloading these longer workunits very soon (but maybe not immediately). We'll have to be more dynamic about this in the future. Currently, we hard code the delay_bound value, so all workunits have the same deadline despite the size of the job. In the future, we'll set the deadline as a function of the total estimated runtime of a job. -Tom | |
| ID: 2039 · Rating: 0 · rate:
| |
Now that I can see my results page again I can confirm that I've had no errors at all :) Just one WU pending at the moment. Nice! I've temporarily increased the task deadline to 2 days for all newly created workunits. Since we don't generate all the workunits for a job up front, you should start downloading these longer workunits very soon (but maybe not immediately). We'll have to be more dynamic about this in the future. Currently, we hard code the delay_bound value, so all workunits have the same deadline despite the size of the job. In the future, we'll set the deadline as a function of the total estimated runtime of a job. Good news. I don't think I've got to the longer deadline WUs yet, but it will be neat when the deadlines adjust to suit the volume of work. Should make things much easier for the crunchers. I don't mind short bursts of 'urgent' work, but days of it at a time can be a bit tricky to manage :) Looking at the percentage complete of the current batch I'm looking forward to a week of solid crunching :) Cheers, Al. | |
| ID: 2040 · Rating: 0 · rate:
| |
|
I'm seeing a lot of errors in my message logs like this: | |
| ID: 2041 · Rating: 0 · rate:
| |
Message boards :
Number crunching :
New task mostly resulting in computation errors