Skip to content

Fix field output issues on ARCHER/shared filesystem

Dave Moxey requested to merge fix/fld-parallel into master

This MR fixes parallel .fld output for some shared filesystem. The present strategy is that every processor attempts to remove the entire directory or create one if it does not exist. On ARCHER this is causing filesystem errors to be thrown past a certain number of cores (around 2,000). This MR fixes this by adopting a different strategy:

Removing existing files:

  • If --shared-filesystem is not enabled, proceed as normal, so as not to break distributed output (e.g. on cx1)
  • Otherwise, each process will try to delete a file named output.fld/P<rank>.fld.
  • The root process will then clean up any other files and remove the directory (or serial .fld file).

Creating a directory

  • If --shared-filesystem is not enabled, proceed as normal, so as not to break distributed output (e.g. on cx1)
  • Otherwise, the rank 0 process will create the directory.

Additional m_comm->Block() calls have been added to synchronise where necessary, although this might be optimised a bit.

Merge request reports