Maximum matrix sizes #4268

Heinrich-BR · 2024-04-25T15:16:24Z

Hello MFEM developers,

I'm testing out MFEM's scaling on large clusters and for that, I'm pushing some of the examples to see how big they can be made before they break, simply by parallel refining the mesh further and further. However, I'm noticing that in general, they stop working about one or two orders of magnitude before I would expect memory issues or integer overflow to cause problems.

For instance, take the example ex1p. If you set the serial refinement level to -1 and the parallel refinement level to 7, the example ends up with about 17 million unknowns and segfaults in the mfem::internal::quadrature_interpolator::TensorDerivatives function. With a parallel refinement level of 6, the example runs normally to the end. I know that it is not a matter of running out of memory since the cluster I am using has more than enough for problems much larger than this. I have tried building MFEM and its dependencies with 64-bit integers and with mixed integers, but nothing seems to allow me to go further than this. Splitting the problem into many MPI ranks does help me go one parallel refinement level further before breaking again, but it would require a very large number of ranks to reach the problem sizes I am interested in, and in principle it should be doable with only a few.

I would like to ask, then, is this a known limit for MFEM or is there perhaps some build configuration that I'm missing?

The text was updated successfully, but these errors were encountered:

jandrej · 2024-04-25T15:49:47Z

Are you using the "performance" examples or is this the regular examples/ex1p?

Heinrich-BR · 2024-04-25T16:06:06Z

I have tried with both, and I'm having issues in either case.

Edit: But just to clarify, the particular example I gave with the 17 million unknowns was on the regular examples/ex1p

sebastiangrimberg · 2024-04-25T16:19:07Z

Can you step through on a debugger to see what is going on? My suspicion, if I understand correctly that the 17 million dof case is run on a single MPI rank, is that there is a quadrature data array being allocated (for example, for the element Jacobians) which might be larger than 2B entries and you are overflowing on the index. This happens for 3D Jacobians when NE * NQ * 9 > 2^31-1 and should show up if you walk through the execution in gdb or lldb. MFEM does not use 64-bit integers for objects local to an MPI rank so would be unaffected by building Hypre, etc. with 64-bit integer support and this seems consistent with the observation that increasing the number of MPI ranks makes the error go away.

pazner · 2024-04-25T19:18:45Z

Can you provide some more information, e.g. which initial mesh you are using, and how many MPI ranks?

If I understand your configuration correctly, the serial mesh is refined until it has 10,000 elements, and then it is partitioned, and 7 parallel refinements are performed. For a 2D quad mesh, this will result in $10^4 \times 4^7 \approx 1.6 \times 10^8$ elements, so 17 million unknowns seems very low. Maybe you are printing out the size of the local problem rather than the global problem (which would make sense if you are using ~ 10 MPI ranks). For a 3D hex mesh, 7 additional refinements will result in roughly $2 \times 10^{10}$ elements.

Also, for better parallel load balancing, it is probably better to partition the mesh as much as possible in serial (before running out of memory on a single rank), and only then partition and switch to parallel refinements. The parallel refinements do not do any repartitioning or load balancing.

v-dobrev · 2024-04-25T21:51:57Z

@Heinrich-BR, can you post the exact command line you use (and any modifications you made to examples/ex1p.cpp) when you get the segfault? I can try to reproduce it. Also which version/git-hash of MFEM are you using?

Heinrich-BR · 2024-04-26T09:13:21Z

Hi everyone! Thank you for your support! Let me try to answer everything.

Can you step through on a debugger to see what is going on? My suspicion, if I understand correctly that the 17 million dof case is run on a single MPI rank, is that there is a quadrature data array being allocated (for example, for the element Jacobians) which might be larger than 2B entries and you are overflowing on the index. This happens for 3D Jacobians when NE * NQ * 9 > 2^31-1 and should show up if you walk through the execution in gdb or lldb. MFEM does not use 64-bit integers for objects local to an MPI rank so would be unaffected by building Hypre, etc. with 64-bit integer support and this seems consistent with the observation that increasing the number of MPI ranks makes the error go away.

That's very interesting @sebastiangrimberg , and it does sound very likely. I've stepped through with a debugger before which is how I found where the issue was happening to start with, but I'll try again and keep out an eye for the quadrature data array.

Can you provide some more information, e.g. which initial mesh you are using, and how many MPI ranks?

I am using the beam-hex.mesh mesh on only one MPI rank, at least to get to this issue. I have run this on other meshes and run into the same issue, although the number of refinements I can get away with changes with the mesh. For instance, using the star.mesh allows me to get further. The simulation breaks when I have around 300 million unknowns. This, I believe, supports @sebastiangrimberg 's idea that it might be that the size of the quadrature data array is what's controlling this limit. Since star.mesh is a 2D mesh embedded in 3D space, I would imagine that the number of quadrature points per element would be smaller.

If I understand your configuration correctly, the serial mesh is refined until it has 10,000 elements, and then it is partitioned, and 7 parallel refinements are performed.

Not exactly, I set the number of serial refinements to be -1 (i.e. no serial refinement at all), so all of the refinement is parallel. I don't think the refinement being serial or parallel makes any difference in the case with 1 MPI rank anyway. The important part is that starting from the original mesh, there were 7 refinements in total.

@Heinrich-BR, can you post the exact command line you use (and any modifications you made to examples/ex1p.cpp) when you get the segfault? I can try to reproduce it. Also which version/git-hash of MFEM are you using?

Of course! I'm using the latest version of MFEM (as of this week), so commit 31bd94e. The only modifications I made to ex1p.cpp were:

set ref_levels on line 139 to -1 (instead of the expression that causes it to refine until about 10000 elements).
set par_ref_levels on line 153 to 7.

With this, I ran the example with the command

mpirun -np 1 ${MFEM_PATH}/examples/ex1p -d cpu --full-assembly --no-visualization -o 1 -m ${MFEM_PATH}/data/beam-hex.mesh

Hopefully this helps you reproduce the error! Thank you everyone for your support and have a great weekend!

Heinrich-BR · 2024-04-26T12:33:55Z

Can you step through on a debugger to see what is going on? My suspicion, if I understand correctly that the 17 million dof case is run on a single MPI rank, is that there is a quadrature data array being allocated (for example, for the element Jacobians) which might be larger than 2B entries and you are overflowing on the index. This happens for 3D Jacobians when NE * NQ * 9 > 2^31-1 and should show up if you walk through the execution in gdb or lldb. MFEM does not use 64-bit integers for objects local to an MPI rank so would be unaffected by building Hypre, etc. with 64-bit integer support and this seems consistent with the observation that increasing the number of MPI ranks makes the error go away.

Just as an update regarding this, I've looked into it with a debugger again and retrieved some numbers:

$NE=16,777,216$
$NQ=27$
$NE\times NQ\times 9=4,076,863,488$

Given this is almost twice as much as $2^{31}-1$, it does seem likely that it is what's crashing. So if I understand this problem correctly, we can only have at most $(2^{31}-1)/(9N_Q)$ entries per MPI rank in any particular mesh. For the example I'm using with the beam-hex.mesh, the limit is then 8.8 million unknowns per MPI rank, correct? The issue then, is that this limit is relatively low for many HPC applications. If I wanted to do a run with, say, a billion unknowns, I'd need to split it into more than 100 MPI tasks, which becomes unwieldy especially if it's on GPUs, where typically you might wish to pin one MPI task on each GPU.

But of course, I'm just speculating here in case this is indeed the issue. It's quite possible I've missed something. Let me know what you find!

pazner · 2024-05-07T20:23:13Z

Hi @Heinrich-BR,

Sorry it took so long to get back to you.

It is important to note that you are using the --full-assembly flag in your example (probably because you are interested in running on GPUs?). Without --full-assembly, I believe the example should run properly, but that would use the "legacy assembly", which isn't supported on GPU.

Because of the way MFEM's full and partial assembly work, the geometric factors (including the $9 \times 9$ Jacobian matrices) are required to be precomputed at all quadrature points. Even if the integer overflow wasn't an issue, this would still pretty quickly fill up your GPU memory. This does put a (perhaps restrictive?) limit on the problem size per MPI rank.

Note, however, that for higher-order problems, the ratio between number of quadrature points and number of degrees of freedom approaches 1. For $p=1$, this ratio is almost 27, as you observed. For $p=3$ or $p=4$, the ratio will be closer to 4, and so you should be able to handle larger problems without overflow (on the other hand, the system matrix will be denser; this motivates using partial assembly and a matrix-free preconditioner).

Heinrich-BR · 2024-05-15T09:33:01Z

Hi @pazner, thank you for the response! Indeed it is the case that I'm using the --full-assembly option because I'm interested in running on GPUs. Also, you're right regarding the GPU memory. I've managed to fit several MPI ranks within 1 GPU and this bypasses the integer overflow issue. However, as predicted, I quickly max out the GPU memory. So this is the issue currently.

If I put together everything suggested here (and thank you everyone for the very helpful comments), by splitting the problem into many MPI tasks, increasing the order of the polynomials, and using partial assembly, I can run the problem on 4 GPUs with 80GB memory each, and get to about 57 million global unknowns on ex1p. Beyond that, it runs out of memory. I've had the goal of doing a performance analysis of MFEM's scaling on GPUs up to about a billion unknowns, but if these numbers are correct, it seems I would need something like 18 nodes, each with 4 GPUs, in order to have the memory to run this. So is there some trick that perhaps I'm missing that could minimise the memory usage on the GPU in order to make such a large problem fit in fewer nodes, or have I simply reached a ceiling for this problem?

v-dobrev · 2024-05-15T11:28:19Z

Note that the default quadrature rule in example 1 uses $(p+2)^3$ quadrature points on hexes which is typically an overkill -- you can switch to $(p+1)^3$ quadrature points (see below) and that should let you solve problems with $27/8$ times more elements, for $p=1$.

To change the default quadrature, you can replace the line

   a.AddDomainIntegrator(new DiffusionIntegrator(one));

with something like this:

   auto diff_integ = new DiffusionIntegrator(one);
   const int integ_order = 2*fec->GetOrder()+1;
   const Geometry::Type geom = pmesh.GetElementGeometry(0);
   diff_integ->SetIntegrationRule(IntRules.Get(geom, integ_order));
   a.AddDomainIntegrator(diff_integ);

There are ways to get around this limitation, e.g. by assembling the element matrices in batches, however, we have not had practical interest from users in running problems of such size per MPI rank. In many cases, the practical problem sizes per GPU are around a few millions of unknowns. Bigger problems are just solved on more MPI ranks, i.e. on more GPUs.

If you are interested in pushing the limit, we can discuss ways to support bigger problems per GPU in MFEM.

v-dobrev · 2024-05-15T11:36:33Z

By the way, computing in batches is something we want to incorporate in MFEM since it will allow us to work with mixed meshes (meshes with different types of elements, e.g. triangles + quads) as well as FE spaces with variable polynomial orders.

tzanio added examples & miniapps hpc labels Apr 25, 2024

tzanio assigned tzanio and jandrej Apr 25, 2024

pazner self-assigned this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum matrix sizes #4268

Maximum matrix sizes #4268

Heinrich-BR commented Apr 25, 2024

jandrej commented Apr 25, 2024

Heinrich-BR commented Apr 25, 2024 •

edited

sebastiangrimberg commented Apr 25, 2024

pazner commented Apr 25, 2024

v-dobrev commented Apr 25, 2024

Heinrich-BR commented Apr 26, 2024

Heinrich-BR commented Apr 26, 2024

pazner commented May 7, 2024

Heinrich-BR commented May 15, 2024

v-dobrev commented May 15, 2024

v-dobrev commented May 15, 2024

Maximum matrix sizes #4268

Maximum matrix sizes #4268

Comments

Heinrich-BR commented Apr 25, 2024

jandrej commented Apr 25, 2024

Heinrich-BR commented Apr 25, 2024 • edited

sebastiangrimberg commented Apr 25, 2024

pazner commented Apr 25, 2024

v-dobrev commented Apr 25, 2024

Heinrich-BR commented Apr 26, 2024

Heinrich-BR commented Apr 26, 2024

pazner commented May 7, 2024

Heinrich-BR commented May 15, 2024

v-dobrev commented May 15, 2024

v-dobrev commented May 15, 2024

Heinrich-BR commented Apr 25, 2024 •

edited