PsPlatform does not retrieve logs, and log recovery process always exits before the running job finishes
Hello @dbeltran,
Autosubmit Version
4.1.11, master
Summary
Found while testing !475, using a PsPlatform with a single job A that runs sleep 3600.
The logs are created on the remote platform, but they are never transferred to the local computer (related to #1469, perhaps?). Furthermore, since I had a breakpoint exactly what the recover_platform_job_logs exits with Log.info(f"{identifier} Exiting."), I could confirm that the process exits before the task A finishes.
Finally, looking at watch -n 2 pgrep --list-full autosubmit, the log recovery became defunct/zombie briefly, and was later reaped (no biggie, IMHO).
Assigned 4.1.12, but we can postpone this one as slurm/local are working.
Steps to reproduce
EXPERIMENT:
DATELIST: 20221101 #Startdate
MEMBERS: fc0
CHUNKSIZEUNIT: month
# SPLITSIZEUNIT: day
NUMCHUNKS: 2
CHUNKSIZE: 2
#SPLITSIZE: 1
CHUNKINIT: ''
SPLITPOLICY: flexible
CALENDAR: standard
JOBS:
A:
FILE: wait_1_hour.sh
RUNNING: chunk
PLATFORM: fake-ssh
PLATFORMS:
fake-ssh:
TYPE: ps
HOST: fake-ssh
SCRATCH_DIR: /tmp
USER: autosubmit
The remote platform is launched with $ docker run --rm --name=openssh-server --hostname=fake-ssh -e PUID=1000 -e PGID=1000 -e TZ=Etc/UTC -e PUBLIC_KEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAKttlfnSMiQNWz65F9QtQwN4dFLY+G3a66aDqUx5qrd root@9bbf3b8e3f0d" -e SUDO_ACCESS=false -e PASSWORD_ACCESS=false -e USER_PASSWORD=autosubmit -e USER_NAME=autosubmit -e LOG_STDOUT= -p 2222:2222 lscr.io/linuxserver/openssh-server:latest. The keys were generated before.
Note that I am using the code without modifications (yesterday I was toying with timeouts, but today I have no changes in master).
What is the current bug behavior?
No logs transferred, no error message, the log recovery exited before the task finished running.
What is the expected correct behavior?
Logs transferred, and the log recovery process was (I think) expected to be executed alongside with the main scheduler process, stopping at the same time or later when the task it's monitoring logs finishes, and the platform has nothing else to run.
Relevant logs and/or screenshots(if applicable)
NA
Any other relevant information(if applicable)
()