Skip to content

v3 MPI desynchronization

We are having a desynchronization problem in HERMES_GR v3 that I don't know if it comes from HERMES directly or from NES.

What do I mean with desynchronization?

At certain point I do a broadcast (comm.bcast(data, root=0)) and I'm receiving a wrong message.

I created a sync_check function that creates a custom object and broadcast it and raise an error if the object is not the expected one:

class MyObject():
    def __init__(self):
        self.name = "TestObject"


def sync_check(comm: Comm=COMM_WORLD, msg: str= "", abort: bool=False) -> None:
    """
    Prints a synchronization message with the MPI rank, flushes stdout,
    and applies a strong MPI barrier.

    Parameters
    ----------
    comm : MPI.Comm
        The MPI communicator to synchronize across.
    msg : str
        The message to print before the barrier.
    abort : bool
        If True it stops the execution.
    """
    comm.Barrier()
    print(f"[Rank {comm.Get_rank()}] {msg}", flush=True)
    comm.Barrier()

    check_bcast(comm, msg=msg)

    if abort:
        print(f"[Rank {comm.Get_rank()}] ABORTING {msg}", flush=True)
        comm.Abort(1)

    return None

def check_bcast(comm: Comm=COMM_WORLD, msg=""):

    if comm.Get_rank() == 0:
        a = MyObject()
    else:
        a = None
    a = comm.bcast(a, root=0)
    if not isinstance(a, MyObject):
        print(f"FAIL {msg}: Received object {a}", flush=True)
        comm.Abort(1)
    return True

So at certain point if I do that check I'm getting a dictionary with the latitudes information instead of MyObject.

The really weird thing is that I'm only doing blocking communication in the whole HERMES/NES environment. If I'm not wrong using blocking communication does not allow to have a message in the limbo waiting for be received so I don't know why this latitude message comes from...

A brief explanation how it works (or how it can work)

I create a NES (gridded/NetCDF) file using domain decomposition with MPI.COMM_WORLD. (in the future I may use other smaller communicator but from now we will keep it simple with COMM_WORLD).

At certain point master (rank == 0) creates a copy of this object and metadata and I substitute the internal communicator by COMM_SELF to be able to write in serial that file. (At that point I have both objects parallel one with the data (and metadata) and the serial one with only the metadata. I gather the parallel data and I write it using the serial one.)

At other steps of the program I also need to apply this methodology to create an auxiliary serial file to play with. e.g. while calculating the weight matrix for an horizontal interpolation using the nearest neighbor methodology (I need all input and output points in the same process to calculate the minor distances between points.)


If I run HERMES step by step in different jobs (create the weight matrix and use it in different jobs) it works but I also want to be able to run everything in a single run from scratch.

I don't really know why this error is happening because for me is not possible to be happening due to the definition of non-blocking communications that I'm using.

I have some hypothesis where (not why) it is happening. It may comes from the copy function of the NES object because the error is sensible to changes in that function.

So different approaches can be applied:

  1. Reproduce the error in a snippet and investigate it further with prints and so on.
  2. To use extrae to check all MPI messages that are occurring.
  3. Change all broadcast messages to send/recv with tags.
  4. ...

I talked to @fmacchia and @gmontane and they said I should use "the Phone a Friend lifeline" — that’s you, @hross ! Hoping you’ve got some useful tips for me.

FYI: @cpinero @avinas @lrizza @jgehlen