Autosubmit is unable to run under BSC machines except bschubs.

Hello @dbeltran and @bdepaula,

I have an issue posted in the slack hpcissues by @cmouchel that I guess it is better to follow here. Basically, I cannot run my experiments due to an error related to the database, following the error code. Because it appears after this esarchive quouta issue, I'm not sure if it specific to experiments or something more general.

Autosubmit Version

3.15.19

Expid affected(If applicable)

a82c a82d a82e a82f

Which task has issues? Where is the log(If applicable)

all experiment

Summary

Since last weekend (esarchive quota issue) I have been unable of restart my experiments.

Steps to reproduce

autosumit run EXPID

with EXPID=a82[cdef]

What is the current bug behavior?

do not run.

What is the expected correct behavior?

run experiment

Relevant logs and/or screenshots(if applicable)

from the autosubmit01 and 02 machines:

autosubmit_a82c.conf OK
Configuration files OK

 [CRITICAL] We have detected that there is another Autosubmit instance using the experiment
. Stop other Autosubmit instances that are using the experiment or delete autosubmit.lock file located on tmp folder [eCode=7000]

Any other relevant information (if applicable)

there is no autosubmit.lock file in the folder
from the bsceshub it tries to run (at least takes more time to test), but it fails by memory once it runs, or before. For example:

Job a82d_20180724_047_11_SIM is COMPLETED
Job a82d_20180724_047_11_SIM finished at 2024-12-05 15:15:12
[ERROR] Trace: [Errno 12] Cannot allocate memory
 [CRITICAL] There is a bug in the code, please contact via gitlab [eCode=7070]

jescriba@bscearth343:/esarchive/autosubmit> tail a82c/mywd/nohup.out
load impi/2021.10.0 (PATH, MANPATH, LD_LIBRARY_PATH)
load mkl/2023.2.0 (LD_LIBRARY_PATH)
load UCX/1.15.0 (PATH, LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH,
CPLUS_INCLUDE_PATH) 
load bsc/1.0 (PATH, MANPATH)
[eCode=6006]
[marenostrum5] Connection successful to host glogin1.bsc.es,glogin2.bsc.es,glogin3.bsc.es,glogin4.bsc.es
[ERROR] Trace:  
 [CRITICAL] This seems like a bug in the code, please contact AS developers [eCode=7070]
More info at https://autosubmit.readthedocs.io/en/v3.15.0/faq.html

I've tried without success:

autosubmit recovery -f -s (works) but not change the issue when I try to run
autosubmit create
autosubmit dbfix ( Error: near line [all lines]: database is locked )
autosubmit pklfix
manually temporal cleaning/removing/renaming of folders pkl and tmp

As I said before, I cannot properly test this in the hub, because it seems that AS with this number of jobs fails by memory (sometimes killed by the OS). On contrary, the AS machines seems to allow experiments of about 11000 jobs (at maximum). If you think the experiment is too large, I cannot split more the workflow because it becomes unmanageable in practice (I'm running bi-monthly now, and I need 5 years... and repeat all runs 2 times more). In any case, the size limits of AS experiments could deserve a dedicated issue different to this one.

fyi @cmouchel @gmontane @avilamir

()