Matrix-free slower than classsical matrix-assembled method #4212

Wayne901 · 2024-03-26T07:44:22Z

Hi guys,
I'm interested in the matrix-free (partial assembly) feature of MFEM. With partial assembly enabled, I tried a Jacobian smoother preconditioner, an LOR preconditioner, as well as precondition by the assembled high-order matrix.
However, each CG iteration step costs more time than the classical method with 2nd order basis. I thought it's normal that matrix-free solver requires more iterations, but it's supposed to be faster in each matrix-vector multiplication.
For example, I test Poisson' equation on a 3d fichera mesh with 4th order basis function. Classical FEM solver with GSSmoother requires 80 iterations and 36 seconds. Matrix-free CG solver requires 80 iterations and 47 seconds preconditioned by assembled matrix, 138 iterations and 45 seconds preconditioned by LOR, and 195 iterations and 59 seconds preconditioned by OperatorJacobiSmoother.
It seems the matrix-vector evaluation of matrix-free methods are not (very) faster than sparse matrix-vector multiplication.

Is the basis order too low to observe the improvement of matrix-free method? Or it is hard to observe without AVX vectorization?

v-dobrev · 2024-03-26T17:34:20Z

Was this on one of MFEM's examples? How big was the problem you tested in terms of DOFs? I assume you are running on CPU?

Wayne901 · 2024-03-27T01:56:06Z

I was running ex1 and add different preconditioners for it. The DoFs are about 24 million. I'm running on CPU inside a vmware virtual machine, but I would hope to run it on our server with AVX-512 arichitecture eventually.
I noticed in ex26, multigrid preconditioner needs far less runtime than Jacobi smoother and LOR preconditioner. Do Nedelec and RT elements support multigrid methods in MFEM?

v-dobrev · 2024-03-27T21:45:31Z

It looks like you are not using hypre preconditioners. For good performance you definitely want to use HYPRE's BoomerAMG (or AMS for H(curl) or ADS for H(div)) even if you run just on 1 processor.

For partial assembly (AssemblyLevel::PARTIAL) the best preconditioners should be (1) p-multigrid (illustrated in ex26p) using BoomerAMG as the coarse solver and (2) LOR + BoomerAMG solver (illustrated in miniapps/solvers/plor-solvers.cpp)

For H(div) and H(curl) problems, you can use LOR + ADS and LOR + AMS, respectively (illustrated in miniapps/solvers/plor-solvers.cpp). If the degree is not very high, for H(div) you can also try the hybridization solver (using BoomerAMG as the solver for the hybridized system) illustrated in ex4p.

v-dobrev · 2024-03-27T21:54:27Z

If you want the best performance with partial assembly on CPU and you are solving H1 diffusion, you can try the template code, illustrated in miniapps/performance/ex1p.cpp -- you will need to instantiate the specific order you want in the source file -- it is a template parameter. This code is vectorized too, as long as you enable a compiler option like -march=native and use the config option MFEM_USE_SIMD=YES.

Wayne901 · 2024-03-28T06:39:30Z

Thank you Dobrev. Eventually I hope to solve problems in H(curl) space with partial assembly methods. I'll look miniapps/solvers/plor-solvers.cpp first.

Wayne901 · 2024-04-02T07:09:39Z

I compared the serial LOR preconditioner from performance/ex1.cpp and miniapps/solvers/lor_solvers.cpp for H1 space. With same refinement levels and basis function order, the LORSolver<GSSmoother> in lor_solvers.cpp is significantly faster than the GSSmoother preconditioner on lor bilinearform in performance/ex1.cpp.
option 1:

       BilinearForm *a_pc = NULL;
       SparseMatrix A_pc;

       Mesh mesh_lor;
       FiniteElementCollection *fec_lor = NULL;
       FiniteElementSpace *fespace_lor = NULL;
       int basis_lor = BasisType::GetType('G');
       mesh_lor = Mesh::MakeRefined(mesh, order, basis_lor);
       fec_lor = new H1_FECollection(1, dim);
       fespace_lor = new FiniteElementSpace(&mesh_lor, fec_lor);

       a_pc = new BilinearForm(fespace_lor);
       a_pc->AddDomainIntegrator(new DiffusionIntegrator(one));
       a_pc->UsePrecomputedSparsity();
       a_pc->Assemble();
       a_pc->FormSystemMatrix(ess_tdof_list, A_pc);

       GSSmoother M(A_pc);
       PCG(*A, M, B, X, 1, 400, 1e-12, 0.0);

and option2

       LORSolver<GSSmoother> solv_lor(a, ess_tdof_list);
       PCG(*A, solv_lor, B, X, 1, 400, 1e-12, 0.0);

The later needs 46 iterations, while the former needs 69. What's special about LORSolver that it's more efficient than a GSSmoother with lor bilinearform as operator?

pazner · 2024-04-10T16:09:51Z

The later needs 46 iterations, while the former needs 69. What's special about LORSolver that it's more efficient than a GSSmoother with lor bilinearform as operator?

I haven't looked closely at the code that you posted, but my guess would be that this is because LORSolver uses collocated quadrature to assemble the low-order-refined system. It is counterintuitive (but true) that using a reduced quadrature for the low-order system gives you a better preconditioner for the high-order system with full quadrature (see e.g. this paper).

v-dobrev added question fem labels Mar 26, 2024

v-dobrev self-assigned this Mar 26, 2024

pazner self-assigned this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix-free slower than classsical matrix-assembled method #4212

Matrix-free slower than classsical matrix-assembled method #4212

Wayne901 commented Mar 26, 2024 •

edited

v-dobrev commented Mar 26, 2024

Wayne901 commented Mar 27, 2024 •

edited

v-dobrev commented Mar 27, 2024

v-dobrev commented Mar 27, 2024

Wayne901 commented Mar 28, 2024

Wayne901 commented Apr 2, 2024 •

edited

pazner commented Apr 10, 2024

Matrix-free slower than classsical matrix-assembled method #4212

Matrix-free slower than classsical matrix-assembled method #4212

Comments

Wayne901 commented Mar 26, 2024 • edited

v-dobrev commented Mar 26, 2024

Wayne901 commented Mar 27, 2024 • edited

v-dobrev commented Mar 27, 2024

v-dobrev commented Mar 27, 2024

Wayne901 commented Mar 28, 2024

Wayne901 commented Apr 2, 2024 • edited

pazner commented Apr 10, 2024

Wayne901 commented Mar 26, 2024 •

edited

Wayne901 commented Mar 27, 2024 •

edited

Wayne901 commented Apr 2, 2024 •

edited