log in

Advanced search

Questions and Answers : Windows : Hanging Process

Author Message
ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1272 - Posted: 18 Mar 2009, 17:09:44 UTC
Last modified: 18 Mar 2009, 17:15:20 UTC

I noticed that often MM WUs get stuck in my systems. It looks like the wrapper loses sight of the worker application, which seems to finish normally, and then fails to report progress and status to the BOINC client, thus remains taking a slot until way past its due time (e.g., this WU).

I suspect that it happens when a MM WU is suspended, albeit kept in memory. It seems that the worker application keeps on running and the wrapper, suspended, misses the completion signal from it.

Killing the wrapper solves things, but it still happens with about 10% of the WUs.

Please, advise.
____________

ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1274 - Posted: 18 Mar 2009, 18:36:02 UTC
Last modified: 18 Mar 2009, 18:42:51 UTC

Here's an update on something that I just observed: 2 MM WUs, one running and the other suspended in memory.

Observing the system using BOINCView and Process Explorer, suddenly, the other suspended WU vanished from Process Explorer (all, the wrapper, the worker and the watchdog), yet both BOINCView and the official BOINC Manager report that the other is still active.

I wonder if the watchdog killed the other WU even though it was suspended...

Here's the other's error output:

ACTR: boinc_init_options complete ACTR: boinc_get_init_data(actr_aid) complete ACTR: Trace 1 ACTR: Trace 2 ACTR: Trace 3 ACTR: Trace 4 ACTR: Trace 5 ACTR: Trace 6 ACTR: Trace 8 ACTR: Trace 9 ACTR: Trace 10 -- Lisp Running ACTR: Trace 11 -- Watchdog Running (if Win32) No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting

I'll wait for the one to finish to see what happens to the other.
____________

ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1275 - Posted: 18 Mar 2009, 18:54:55 UTC
Last modified: 18 Mar 2009, 18:55:41 UTC

Now that the one WU finished, the other WU was restarted, as can be seen by its error output:

ACTR: boinc_init_options complete ACTR: boinc_get_init_data(actr_aid) complete ACTR: Trace 1 ACTR: Trace 2 ACTR: Trace 3 ACTR: Trace 4 ACTR: Trace 5 ACTR: Trace 6 ACTR: Trace 8 ACTR: Trace 9 ACTR: Trace 10 -- Lisp Running ACTR: Trace 11 -- Watchdog Running (if Win32) No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting ACTR: boinc_init_options complete ACTR: boinc_get_init_data(actr_aid) complete ACTR: Trace 1 ACTR: Trace 2 ACTR: Trace 3 ACTR: Trace 4 ACTR: Trace 5 ACTR: Trace 6 ACTR: Trace 8 ACTR: Trace 9 ACTR: Trace 10 -- Lisp Running ACTR: Trace 11 -- Watchdog Running (if Win32)

Notice that the usual start-up log was appended to the existing error output.
____________

ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1276 - Posted: 18 Mar 2009, 19:15:00 UTC

After restarting, the other WU finished successfully.

I'll continue keeping an eye out for hanged wrappers.
____________

ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1277 - Posted: 18 Mar 2009, 23:18:13 UTC

Here's the error output of a stuck process:

ACTR: boinc_init_options complete ACTR: boinc_get_init_data(actr_aid) complete ACTR: Trace 1 ACTR: Trace 2 ACTR: Trace 3 ACTR: Trace 4 ACTR: Trace 5 ACTR: Trace 6 ACTR: Trace 8 ACTR: Trace 9 ACTR: Trace 10 -- Lisp Running ACTR: Trace 11 -- Watchdog Running (if Win32) No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting ACTR: boinc_init_options complete ACTR: boinc_get_init_data(actr_aid) complete ACTR: Trace 1 ACTR: Trace 2 ACTR: Trace 3 ACTR: Trace 4 ACTR: Trace 5 ACTR: Trace 6 ACTR: Trace 8 ACTR: Trace 9 ACTR: Trace 10 -- Lisp Running ACTR: Trace 11 -- Watchdog Running (if Win32) No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting ACTR: boinc_init_options complete ACTR: boinc_get_init_data(actr_aid) complete ACTR: Trace 1 ACTR: Trace 2 ACTR: Trace 3 ACTR: Trace 4 ACTR: Trace 5 ACTR: Trace 6

I'll leave it alone and see how the BOINC client deals with it.

PS: this is the WU in question.


____________

ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1279 - Posted: 19 Mar 2009, 22:07:04 UTC

It looks like the zombie processes are eventually purged.

____________

ebahapo
Avatar
Send message
Joined: 31 Jan 08
Posts: 12
Credit: 22,898
RAC: 0
Message 1281 - Posted: 20 Mar 2009, 21:19:37 UTC - in response to Message 1279.
Last modified: 20 Mar 2009, 21:20:01 UTC

It looks like the zombie processes are eventually purged.

Well, not really. I have a couple of dangling MM processes, one for 24h, the other for 48h. They keep two files in their BOINC slots:

  • boinc_lockfile
  • stderr.txt


The latter is updated every few seconds with a new line:

No heartbeat from core client for 30 sec - exiting

Consequently, the file just keeps growing in size. One is at 4MB, the other, at 8MB.

Bottom line: I cannot afford to run this project.

Please, advise.
____________

Questions and Answers : Windows : Hanging Process


Main page · Your account · Message boards


Copyright © 2023 MindModeling.org