Problem with 4.6.x MPI, thread affinity, slurm and node-uneven task spread

Discussion:

Åke Sandgren

2014-10-02 13:57:32 UTC

Hi!

Just managed to pin down a weird problem which is caused by uneven
spread of tasks over nodes and thread affinity causing jobs to hang in
gmx_set_thread_affinity.

This happens on our 48-core nodes using a 100 task job that when
submitted through slurm (without specifying distribution manually) gets
distributed over 3 nodes with 6+47+47 tasks.
We are also using cgroups to allow for multiple jobs per node, so the
node with 6 tasks has an affinity mask set for only the 6 cores on a
single NUMA. The nodes with 47 tasks have the whole node allocated and
thus gets a full 48-core affinity mask.

(Actually due to a bug(/feature?) in slurm the tasks on the node with
only 6 cores allocated actually get a single-core per task affinity, but
that's not relevant here.)

Anyway, when the code gets to line 1629 in runner.c (this is 4.6.7) and
the call to gmx_check_thread_affinity_set we start having problems.

The loop to set bAllSet ends up setting bAllSet to TRUE for the tasks on
the two fully allocated nodes and FALSE on the tasks on the third node.
This in turn changes hw_opt->thread_affinity to threadaffOFF on those 6
tasks, but leaves it at threadaffAUTO for the other 2x47 tasks.

gmx_set_thread_affinity then promptly returns for those poor 6 tasks and
tries in vain to do a MPI_Comm_split with 6 tasks missing from the
equation...

I suggest to gather the bAllSet result from all nodes in
gmx_check_thread_affinity_set and make sure all tasks have the same view
of the world...
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

--
Gromacs Developers mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or send a mail to gmx-developers-***@gromacs.org.

Szilárd Páll

2014-10-02 14:32:00 UTC

Permalink

Thanks for the detailed report! Could you please file a redmine issue?
redmine.gromacs.org
--
Szilárd

Hi!
Just managed to pin down a weird problem which is caused by uneven spread of
tasks over nodes and thread affinity causing jobs to hang in
gmx_set_thread_affinity.
This happens on our 48-core nodes using a 100 task job that when submitted
through slurm (without specifying distribution manually) gets distributed
over 3 nodes with 6+47+47 tasks.
We are also using cgroups to allow for multiple jobs per node, so the node
with 6 tasks has an affinity mask set for only the 6 cores on a single NUMA.
The nodes with 47 tasks have the whole node allocated and thus gets a full
48-core affinity mask.
(Actually due to a bug(/feature?) in slurm the tasks on the node with only 6
cores allocated actually get a single-core per task affinity, but that's not
relevant here.)
Anyway, when the code gets to line 1629 in runner.c (this is 4.6.7) and the
call to gmx_check_thread_affinity_set we start having problems.
The loop to set bAllSet ends up setting bAllSet to TRUE for the tasks on the
two fully allocated nodes and FALSE on the tasks on the third node.
This in turn changes hw_opt->thread_affinity to threadaffOFF on those 6
tasks, but leaves it at threadaffAUTO for the other 2x47 tasks.
gmx_set_thread_affinity then promptly returns for those poor 6 tasks and
tries in vain to do a MPI_Comm_split with 6 tasks missing from the
equation...
I suggest to gather the bAllSet result from all nodes in
gmx_check_thread_affinity_set and make sure all tasks have the same view of
the world...
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
--
Gromacs Developers mailing list
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or

Åke Sandgren

2014-10-02 14:46:16 UTC

Permalink

Post by SzilÃ¡rd PÃ¡ll
Thanks for the detailed report! Could you please file a redmine issue?

Done, 1613
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se