log in |
1)
Message boards :
Number crunching :
Full Node Native R WU's only running single core
(Message 4614)
Posted 9 Mar 2021 by marmot My 16 thread machine is running very quietly this afternoon and upon checking the process manager your Native R v2.15.1 Application - Full Node (Cross Platform) WU are only using 1 thread instead of the Full Node of 16 it should be. Any fix we can apply in the app_config.xml? |
2)
Message boards :
Number crunching :
RAC disappeared on Oct 11
(Message 3949)
Posted 21 Oct 2016 by marmot My RAC dropped from 10,167 to under 200 on October 11. Why would it plummet like that in a single day? |
3)
Message boards :
Number crunching :
Immediate error on new tasks (Native Python v2.7 Application v1.11)
(Message 3929)
Posted 7 Oct 2016 by marmot What's the ETA on a patch? 12:30 PM CST here and just noticed something fishy and came here to find an error count breaking 2500 errors so far. |
4)
Message boards :
Number crunching :
One WU says it will take 4+ days to complete - normal time is < 45 minutes
(Message 3677)
Posted 7 Feb 2016 by marmot The 2.5 hour timeout was actually put in probably more than a year ago as that has long been an issue on Windows machines. So the task isn't really finishing all the way, it's just hitting a self imposed time limit to keep them from running forever. Just let them run. They will all stop at 2.5 hours anyway and you get credit for them. But this limit isn't working. I have a WU at 3 hours 31 minutes and it reports 7 days remaining. The server gives a timed out message and even after clicking update on the local machine the work unit isn't being completed or server-side cancelled. I will have to manually abort this WU. Here my machine is still working on it and it's already been given to someone else's machine: http://mindmodeling.org//workunit.php?wuid=21923780 |
5)
Message boards :
Number crunching :
The Deadline is too short !!!
(Message 3676)
Posted 7 Feb 2016 by marmot This is happening to me today. I have several WU that were asked for and the machine needed 3 hours to do and got 2 done and then the server aborted the rest (timed out) while they were still working on them. Deadline was about 3 to 5 hours on about 20 WU's. The machine was still working on these (it had been working on prior one that was past the short deadline) but I got this in the logs: 12826265 21911390 6 Feb 2016, 23:46:44 UTC 7 Feb 2016, 6:19:28 UTC Timed out - no response 0.00 0.00 --- Native Python v2.7 Application (Windows Only) v1.10 (sse2) 12826370 21903657 6 Feb 2016, 23:45:53 UTC 7 Feb 2016, 6:18:37 UTC Timed out - no response 0.00 0.00 --- Native Python v2.7 Application (Windows Only) v1.10 (sse2) Looks like the last 50+ WU's all have 5 to 6 hour deadlines. |
6)
Message boards :
Number crunching :
WU remains in RAM even when BOINC.EXE settings deny this behavior
(Message 3669)
Posted 4 Feb 2016 by marmot I'll take a look at this and let you know if I find anything. Have you seen the same thing? I have verified this on another 2 machines now, Windows Vista, 7, 8 and 10. The WU will not leave RAM when BOINC is overall suspended nor if MindModeling is specifically suspended. There can be up to an unclaimed 900MB of working RAM locked unless you use a process manager to reclaim it. Of course using process manager to kill the WU ends in a computation error and lost time. This is bad behavior for an application that is sharing workspace with other projects or user applications for non-BOINC purposes. |
7)
Message boards :
Number crunching :
WU RAM Private BYTEs/Working Set growing until machines hits swapfile crisis
(Message 3662)
Posted 27 Jan 2016 by marmot I've been thinking this over. Let's say there are 4 current WU in RAM processing. If the WU's detect the computer has a maximum available RAM of 3 GB on a system with 4 GB of physical memory then the running WU's in RAM send current Private Bytes information between each process. They can poll each other for current progress and give the furthest along the go ahead to use all 1,069 MB of RAM it needs while reducing the other 3 WU's Private BYTES. Claiming all workspace at once would be detrimental to throughput on low RAM systems for Mindmodeling WU's. But the WU's need to poll each other because when the case comes that all 4 WU's are reaching their peak workspace requirements (and that's likely given they would all start at the same time when BOINC client first starts up) they bring a low RAM system to it's knees and leave it gasping for breath. BTW, I now have a hard drive failing on a 4GB system that was running MindModeling last week. I didn't notice it was crippled into swap crisis for days at a time (it's so quiet) and the drive is now clunking and steadily losing sectors. I would have never run this project if I'd known the ultimate needs of the WU were 1GB each or at least I could have limited it to 2 max concurrent. You really need to give a warning in the preferences section of RAM requirements! |
8)
Message boards :
Number crunching :
WU remains in RAM even when BOINC.EXE settings deny this behavior
(Message 3659)
Posted 26 Jan 2016 by marmot "Leave non-GPU tasks in memory while suspended" make sure you don't have that checked on. That box is not checked. I was careful and verified that before posting this bug report. |
9)
Message boards :
Number crunching :
One WU says it will take 4+ days to complete - normal time is < 45 minutes
(Message 3654)
Posted 25 Jan 2016 by marmot Yeah, I just aborted a WU like this. It was suspended in RAM after the machine ran out of all available memory (Swap+Physical) and I think it's related to the growing memory issue. |
10)
Message boards :
Number crunching :
WU remains in RAM even when BOINC.EXE settings deny this behavior
(Message 3653)
Posted 25 Jan 2016 by marmot If you suspend the client through BOINC Manager or Mindmodeling WU get's suspended because of a 30 or 60 minute work swap, they remain in RAM taking up resources that another project needs even though the setting to allow work to remain in RAM is specifically unchecked in the BOINC manager. Verified on 3 machines. |
11)
Message boards :
Number crunching :
WU RAM Private BYTEs/Working Set growing until machines hits swapfile crisis
(Message 3652)
Posted 25 Jan 2016 by marmot Another machine had this WU app version sitting in RAM suspended. Not sure if BOINC.EXE or Windows decided to suspend it as the machine ran out of RAM and swap space, but the WU was taking up a slot and sitting idle. I unsuspended and it reports 2 hours progress and 4 days 7 hours till completion. The completion date kept lengthening so I aborted that one. |
12)
Message boards :
Number crunching :
WU RAM Private BYTEs/Working Set growing until machines hits swapfile crisis
(Message 3650)
Posted 25 Jan 2016 by marmot I was noticing degradation in the turnaround time of WU from a few machines and heavy swapfile usage so had to spend the last week trying to diagnose the issue. It turned out to be from two different projects. Citizengrid Gibbs WU are variable in size with a few coming down that are 950MB a piece. I thought that after app_config.xml and work fetch changes the issue was solved, but it wasn't... I got desperate enough to just sit and watch my HP-8560p(Win 7 SP1)'s Process Explorer application for 30 to 45 minutes (while listening to the radio) and noticed this project's Python 2.7 WU's (\python2.7_1.1_env\bin\python2.7.exe" mm_python_bridge.py crisp-20160110-1-14-g350165a antisaccade run_mm 2) gradually growing their Private and Working set bytes. I had noticed this before but didn't dream that it would steadily grow from 45MB up to 968MB. Is this a memory leak or intentional? If it's intentional, it would be great if some place in the preferences the maximum WU size warning was given so we can set our app_config.xml limits on WU numbers accepted in advance or opt out of particular WU apps with the foreknowledge that they require too much RAM for certain machines. What would be highly desirable for us volunteer crunchers as we allocate our RAM and computer resources is for Mindmodeling WU's to claim their maximum private bytes from the beginning and not grow their Bytes as the WU progressed. This kind of swapfile crisis can be especially degragading to machines that are only using solid state drives and so knowing the amount of RAM WU's require helps people protect their hardware investments as they volunteer their equipment. EDIT: Maybe this is a memory leak as the same application on this Dell m6500 (Win 7 no Windows updates beyond SP1) is holding steady at 37mb, 35mb and 31mb Working sets on 3 different WU's. But the same growing RAM behavior is currently occurring on the 1090t (Win X patched to last month) machine in 3 separate WU's. EDIT2: I watch the WU on the 1090t grow until it hit 1069 MB working set then crashed. It was the WU which was reporting 27 minutes progress and 19 minutes remaining at 80% progress in the BOINC Manager. I'll check to see if it is reported as an error in the client's data set. The WU is MindModeling_4959_327356a61defe3e31 and is reporting as valid and successful even though it dropped out at 80% completion and grew too large to stay in memory. If they are all going to grow to 1070mb then I'll have to limit machines to <project_max-concurrent>2</project_max_concurrent> to protect the hard drives. |