log in

Advanced search

Message boards : Number crunching : Many computation errors.

Previous · 1 · 2
Author Message
[AF>EDLS]zOU
Send message
Joined: 16 Nov 15
Posts: 3
Credit: 75,986
RAC: 0
Message 3575 - Posted: 17 Nov 2015, 21:27:56 UTC - in response to Message 3574.

Thank you

this is worrying as I have a 1 error for 3 valids...

· Valid (1577) · Invalid (0) · Error (491)

adrianxw
Send message
Joined: 12 Jul 15
Posts: 6
Credit: 1,349
RAC: 0
Message 3578 - Posted: 18 Nov 2015, 7:20:53 UTC

It is more than 4 months ago that I reported this, the problem, or problems, are not going away. Looking at the problemss is not the same as fixing them. There is something fundementally wrong with your application.

[AF>EDLS]zOU
Send message
Joined: 16 Nov 15
Posts: 3
Credit: 75,986
RAC: 0
Message 3580 - Posted: 20 Nov 2015, 7:45:30 UTC - in response to Message 3578.

It is more than 4 months ago that I reported this, the problem, or problems, are not going away. Looking at the problemss is not the same as fixing them. There is something fundementally wrong with your application.

Well, this is a beta :)

Profile Brandon
Project administrator
Project developer
Project tester
Avatar
Send message
Joined: 5 Jan 15
Posts: 256
Credit: 1,456,117
RAC: 0
Message 3581 - Posted: 20 Nov 2015, 20:10:31 UTC

Hello [AF>EDLS]zOU and adrianxw,

We apologize if it seems like we are not resolving these issues with work-units; however, we have multiple modellers running a variety of distinct models. We do our best to catch these problems during the testing phase, but, occasionally, errors do make it out into distribution.

In this particular case, a new model iteration submitted by a collaborator caused an unexpected error to occur. We have found the issue, suspended the job, and are working with the modeller to resolve the situation.

Thanks for supporting us and happy crunching,

Brandon

adrianxw
Send message
Joined: 12 Jul 15
Posts: 6
Credit: 1,349
RAC: 0
Message 3724 - Posted: 6 Apr 2016, 9:15:02 UTC
Last modified: 6 Apr 2016, 9:48:47 UTC

I re-enabled this to see how things were. It, after a while, downloaded a workunit, (13990155), and started to run. Estimated run time was 1:55:50. Just short of two hours elapsed, that jumped to 4d:02:10:56 increasing at about 50 seconds per update. The Progress is, and has been, 2.000% since I started watching it this morning. I have set no new tasks, but will leave this running to see what happens.

In the time it has taken me to write this, it has increased to 4d:08:55:45.

Edit:

It completed and reported before I could get the actual runtime, says 9,232.04 on the results page. Waiting for validation state. Nothing odd in the log. On the plus side, it DID complete and not crash.

adrianxw
Send message
Joined: 12 Jul 15
Posts: 6
Credit: 1,349
RAC: 0
Message 3725 - Posted: 6 Apr 2016, 10:34:46 UTC
Last modified: 6 Apr 2016, 10:34:46 UTC

Too late to edit, just to add, this system has Windows 10 on it.

Profile Brandon
Project administrator
Project developer
Project tester
Avatar
Send message
Joined: 5 Jan 15
Posts: 256
Credit: 1,456,117
RAC: 0
Message 3726 - Posted: 6 Apr 2016, 13:31:38 UTC
Last modified: 6 Apr 2016, 13:32:00 UTC

Hello adrianxw,

thanks for letting me know about the abnormal run times I will take a look at the job.

Thanks for you support and happy crunching,

Brandon

StevanHP
Send message
Joined: 22 Feb 16
Posts: 1
Credit: 141,098
RAC: 0
Message 3730 - Posted: 10 Apr 2016, 17:04:46 UTC
Last modified: 10 Apr 2016, 17:04:46 UTC

Getting only computation errors:

Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)
</message>
<stderr_txt>
wrapper: starting
Setting up task
Checking if I should run task...
1460303728 - Task Done Found at C:\ProgramData\BOINC/projects/mindmodeling.org\python2.7_1.1_done
Skipping task
Setting up task
17:55:28 (11016): wrapper: running 7za.exe (x -y crisp-20160222-1-1-g875a42b.zip)
Setting up task
17:55:29 (11016): wrapper: running C:\ProgramData\BOINC/projects/mindmodeling.org\python2.7_1.1_env\bin\python2.7.exe (mm_python_bridge.py crisp-20160222-1-1-g875a42b antisaccade run_mm_new 2)

Acquiring IV names: alpha attn_mean timer_mean labile_mean cue_timer_rate gap_timer_rate timer_states labile_stdev attn_stdev cue_cancel_prob gap_cancel_prob mm_samples mm_sample_offset
Running IV Values: 0.40 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.0358064516129
Running IV Values: 0.45 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.0516129032258
Running IV Values: 0.50 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.0674193548387
Running IV Values: 0.55 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.0832258064516
Running IV Values: 0.60 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.0990322580645
Running IV Values: 0.65 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.114838709677
Running IV Values: 0.70 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.13064516129
Running IV Values: 0.75 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.146451612903
Running IV Values: 0.80 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.162258064516
Running IV Values: 0.85 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.178064516129
Running IV Values: 0.90 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.193870967742
Running IV Values: 0.95 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.209677419355
Running IV Values: 1.00 0.200 0.220 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.225483870968
Running IV Values: 0.00 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.241290322581
Running IV Values: 0.05 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.257096774194
Running IV Values: 0.10 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.272903225806
Running IV Values: 0.15 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.288709677419
Running IV Values: 0.20 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.304516129032
Running IV Values: 0.25 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.320322580645
Running IV Values: 0.30 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.336129032258
Running IV Values: 0.35 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.351935483871
Running IV Values: 0.40 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.367741935484
Running IV Values: 0.45 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.383548387097
Running IV Values: 0.50 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.39935483871
Running IV Values: 0.55 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.415161290323
Running IV Values: 0.60 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.430967741935
Running IV Values: 0.65 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.446774193548
Running IV Values: 0.70 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.462580645161
Running IV Values: 0.75 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.478387096774
Running IV Values: 0.80 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.494193548387
Running IV Values: 0.85 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.51
Running IV Values: 0.90 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.525806451613
Running IV Values: 0.95 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.541612903226
Running IV Values: 1.00 0.100 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.557419354839
Running IV Values: 0.00 0.110 0.230 0.220 1.05 1.00 40 16 16 0 0
Updating Fraction Done to: 0.573225806452
Running IV Values: 0.05 0.110 0.230 0.220 1.05 1.00 40 16 16 0 0Exception MemoryError: MemoryError() in 'garbage collection' ignored
Fatal Python error: unexpected exception during garbage collection

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
app exit status: 0x3
18:39:29 (11016): called boinc_finish

</stderr_txt>
]]>

What is going on and how can I fix this?

P.S. Thats on my Win 10 desktop. The Win 7 laptop seems to be chrunching just fine.

Gunde
Send message
Joined: 8 Feb 15
Posts: 26
Credit: 7,169,421
RAC: 0
Message 3731 - Posted: 10 Apr 2016, 22:53:18 UTC
Last modified: 10 Apr 2016, 22:53:18 UTC

Got a few MemoryError to for Native Python v2.7 Application (Windows Only) v1.10 (sse2):

Exit status 195 (0xc3) EXIT_CHILD_FAILED

Exception MemoryError: MemoryError() in 'garbage collection' ignored
Fatal Python error: unexpected exception during garbage collection

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
app exit status: 0x3
23:59:36 (122408): called boinc_finish

And:

Exit status 195 (0xc3) EXIT_CHILD_FAILED

Traceback (most recent call last):
File "mm_python_bridge.py", line 564, in <module>
debug("Running parameter combination " + str(variableNamesToStringValues) + "failed: " + str(e), True)
MemoryError
app exit status: 0x1
00:26:57 (24380): called boinc_finish

Profile TyphooN [Gridcoin]
Send message
Joined: 27 May 14
Posts: 1
Credit: 852,377
RAC: 0
Message 3735 - Posted: 13 Apr 2016, 20:54:40 UTC - in response to Message 3731.
Last modified: 13 Apr 2016, 20:54:40 UTC

I am getting many of the same errors similar to this workunit. If an admin could take a look at my account, I have plenty of error tasks to look at that are all failing in the same way. This is happening on both Intel and AMD systems. When the crash happens, python crashes and all I can do is close it. After that the WU has a computation error. Right now my account is showing 89 error tasks, and 364 Valid tasks so it is a relatively high % of failed tasks to valid tasks. I haven't had this problem on this project until recently, so I suspect something in one of the apps has changed.

https://mindmodeling.org//result.php?resultid=14154979

Running IV Values: 0.50 0.130 0.160 0.100 1.55 1.10 40 16 16 0 0
Updating Fraction Done to: 0.873548387097
Running IV Values: 0.55 0.130 0.160 0.100 1.55 1.10 40 16 16 0 0Traceback (most recent call last):
File "mm_python_bridge.py", line 564, in <module>
debug("Running parameter combination " + str(variableNamesToStringValues) + "failed: " + str(e), True)
MemoryError
app exit status: 0x1
10:58:00 (3716): called boinc_finish

Profile Brandon
Project administrator
Project developer
Project tester
Avatar
Send message
Joined: 5 Jan 15
Posts: 256
Credit: 1,456,117
RAC: 0
Message 3736 - Posted: 18 Apr 2016, 14:28:26 UTC
Last modified: 18 Apr 2016, 14:28:26 UTC

Hey Guys,

TyphooN taking a look at the jobs that failed on your system, the ones that failed seemed to use a lot of ram compared to the others. So i wonder if it could possibly be a memory issue. However your work units are looking a lot better today.

47an looking at your machines it looked like the same issue as well. The work units that are using a larger amount of memory are causing you issues as well.

Both of your computers are not seeing as many or any failed work units today, but I will work to resolve this issue with memory with the Modeler.

Thank you guys for your support and happy crunching,

Brandon

Gunde
Send message
Joined: 8 Feb 15
Posts: 26
Credit: 7,169,421
RAC: 0
Message 3737 - Posted: 18 Apr 2016, 23:51:53 UTC - in response to Message 3736.
Last modified: 18 Apr 2016, 23:51:53 UTC

Thanks for look this up, I did a reset few days ago and that didn't help so stop some the host was bad.
I thought this problem could be related to DDR4 and ECC memory but I fetched a few new task all got valid. So i will continue and se if it could be other applications cause this problem.

Could be right about increase of memory for few workunits. Have been running a few compute stick's with Atom x5 and those have no error, but closer look they got an popup with memory to low and pyton got crashed. 2 GB available for boinc and running 4 task.

Might been possible in near future to separate those batch and those applications. The most batch for Python 2.7 all platforms have been working great. Could run fulltime for those and set beta for new applications or when new setting have been done.
Would make it easier to get back with feedback and run those task in more controlled environment and hold back other task in this time. Could take days before anyone notice a crash why it's happens. I looking forward to see those those (mt) again.


Keep up the good work and responds, it's a great project.

Gunde
Send message
Joined: 8 Feb 15
Posts: 26
Credit: 7,169,421
RAC: 0
Message 3758 - Posted: 26 May 2016, 17:39:33 UTC
Last modified: 26 May 2016, 17:39:33 UTC

Hi again Brandon

We got few of these memmory error again at the end of the job.
Looking at host all task use 1892MB when it was running, it looks like they finished but end with error. Here is one of them:

https://mindmodeling.org//workunit.php?wuid=24620477

Profile Brandon
Project administrator
Project developer
Project tester
Avatar
Send message
Joined: 5 Jan 15
Posts: 256
Credit: 1,456,117
RAC: 0
Message 3760 - Posted: 27 May 2016, 15:20:15 UTC
Last modified: 27 May 2016, 15:20:15 UTC

Hye 47an,

I think you are correct, I think it was the large amount of memory each one of these work units in the last batch of work for this job required. I will work with the modeler to try and resolve this so you don't run into this issue again.

Thanks for bringing this to my attention.

Thanks for your support and happy crunching,

Brandon

Derion
Send message
Joined: 22 Nov 15
Posts: 30
Credit: 1,144,661
RAC: 3,441
Message 3768 - Posted: 3 Jun 2016, 10:03:56 UTC
Last modified: 3 Jun 2016, 10:03:56 UTC

Hmm, the issue still happening, every task will end with error.

Profile Brandon
Project administrator
Project developer
Project tester
Avatar
Send message
Joined: 5 Jan 15
Posts: 256
Credit: 1,456,117
RAC: 0
Message 3770 - Posted: 3 Jun 2016, 13:25:48 UTC
Last modified: 3 Jun 2016, 13:26:44 UTC

Hey Derion,

This is an issue with the new Julia app, the job has been cancelled and we are working on a fix for it.

We will let you guys when we are done.

Thanks for letting me know,

Brandon

Profile KPX
Send message
Joined: 7 Feb 08
Posts: 2
Credit: 149,799
RAC: 0
Message 3772 - Posted: 3 Jun 2016, 23:13:12 UTC
Last modified: 3 Jun 2016, 23:13:12 UTC

It's true that most (not all) of the sse2 units end with error, but the mt units I was getting a day or two ago seemed to work fine. Why not use those?

Michael
Send message
Joined: 27 May 16
Posts: 3
Credit: 2,228
RAC: 0
Message 3797 - Posted: 9 Jul 2016, 18:51:41 UTC - in response to Message 3772.
Last modified: 9 Jul 2016, 18:51:41 UTC

Nearly all WUs with "Computation Error"
Intel Core I5-4690 K with 8gb ram WIN 10 64 bit Gforce GTX 770

Profile Brandon
Project administrator
Project developer
Project tester
Avatar
Send message
Joined: 5 Jan 15
Posts: 256
Credit: 1,456,117
RAC: 0
Message 3799 - Posted: 10 Jul 2016, 1:43:33 UTC
Last modified: 10 Jul 2016, 1:43:33 UTC

Hi Michael,

I took a look at the error on your work units, and I don't know German very well but it looks like is says it cannot find model file when it tries to run. Could there possibly be a permissions issue on your computer for the folder that file is in?

Best, Brandon

Michael
Send message
Joined: 27 May 16
Posts: 3
Credit: 2,228
RAC: 0
Message 3807 - Posted: 11 Jul 2016, 22:05:18 UTC - in response to Message 3799.
Last modified: 11 Jul 2016, 22:05:18 UTC

Hi Brandon,
thank you for your reply, my English is also not very good.
For some WU´s credit is granted. Other projects also work fine. Hmmm I will try to locate the folder, where the model file is in.

Previous · 1 · 2

Message boards : Number crunching : Many computation errors.


Main page · Your account · Message boards


Copyright © 2018 MindModeling.org