Skip to content

PsPlatform does not retrieve logs, and log recovery process always exits before the running job finishes

Hello @dbeltran,

Autosubmit Version

4.1.11, master

Summary

Found while testing !475, using a PsPlatform with a single job A that runs sleep 3600.

The logs are created on the remote platform, but they are never transferred to the local computer (related to #1469, perhaps?). Furthermore, since I had a breakpoint exactly what the recover_platform_job_logs exits with Log.info(f"{identifier} Exiting."), I could confirm that the process exits before the task A finishes.

Finally, looking at watch -n 2 pgrep --list-full autosubmit, the log recovery became defunct/zombie briefly, and was later reaped (no biggie, IMHO).

Assigned 4.1.12, but we can postpone this one as slurm/local are working.

Steps to reproduce

EXPERIMENT:
   DATELIST: 20221101 #Startdate
   MEMBERS: fc0
   CHUNKSIZEUNIT: month
   # SPLITSIZEUNIT: day
   NUMCHUNKS: 2
   CHUNKSIZE: 2
   #SPLITSIZE: 1
   CHUNKINIT: ''
   SPLITPOLICY: flexible
   CALENDAR: standard
JOBS:
  A:
    FILE: wait_1_hour.sh
    RUNNING: chunk
    PLATFORM: fake-ssh
PLATFORMS:
  fake-ssh:
    TYPE: ps
    HOST: fake-ssh
    SCRATCH_DIR: /tmp
    USER: autosubmit

The remote platform is launched with $ docker run --rm --name=openssh-server --hostname=fake-ssh -e PUID=1000 -e PGID=1000 -e TZ=Etc/UTC -e PUBLIC_KEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAKttlfnSMiQNWz65F9QtQwN4dFLY+G3a66aDqUx5qrd root@9bbf3b8e3f0d" -e SUDO_ACCESS=false -e PASSWORD_ACCESS=false -e USER_PASSWORD=autosubmit -e USER_NAME=autosubmit -e LOG_STDOUT= -p 2222:2222 lscr.io/linuxserver/openssh-server:latest. The keys were generated before.

Note that I am using the code without modifications (yesterday I was toying with timeouts, but today I have no changes in master).

What is the current bug behavior?

No logs transferred, no error message, the log recovery exited before the task finished running.

What is the expected correct behavior?

Logs transferred, and the log recovery process was (I think) expected to be executed alongside with the main scheduler process, stopping at the same time or later when the task it's monitoring logs finishes, and the platform has nothing else to run.

Relevant logs and/or screenshots(if applicable)

NA

Any other relevant information(if applicable)

()