Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix-free slower than classsical matrix-assembled method #4212

Open
Wayne901 opened this issue Mar 26, 2024 · 7 comments
Open

Matrix-free slower than classsical matrix-assembled method #4212

Wayne901 opened this issue Mar 26, 2024 · 7 comments
Assignees

Comments

@Wayne901
Copy link

Wayne901 commented Mar 26, 2024

Hi guys,
I'm interested in the matrix-free (partial assembly) feature of MFEM. With partial assembly enabled, I tried a Jacobian smoother preconditioner, an LOR preconditioner, as well as precondition by the assembled high-order matrix.
However, each CG iteration step costs more time than the classical method with 2nd order basis. I thought it's normal that matrix-free solver requires more iterations, but it's supposed to be faster in each matrix-vector multiplication.
For example, I test Poisson' equation on a 3d fichera mesh with 4th order basis function. Classical FEM solver with GSSmoother requires 80 iterations and 36 seconds. Matrix-free CG solver requires 80 iterations and 47 seconds preconditioned by assembled matrix, 138 iterations and 45 seconds preconditioned by LOR, and 195 iterations and 59 seconds preconditioned by OperatorJacobiSmoother.
It seems the matrix-vector evaluation of matrix-free methods are not (very) faster than sparse matrix-vector multiplication.

Is the basis order too low to observe the improvement of matrix-free method? Or it is hard to observe without AVX vectorization?

@v-dobrev
Copy link
Member

Was this on one of MFEM's examples? How big was the problem you tested in terms of DOFs? I assume you are running on CPU?

@Wayne901
Copy link
Author

Wayne901 commented Mar 27, 2024

I was running ex1 and add different preconditioners for it. The DoFs are about 24 million. I'm running on CPU inside a vmware virtual machine, but I would hope to run it on our server with AVX-512 arichitecture eventually.
I noticed in ex26, multigrid preconditioner needs far less runtime than Jacobi smoother and LOR preconditioner. Do Nedelec and RT elements support multigrid methods in MFEM?

@v-dobrev
Copy link
Member

It looks like you are not using hypre preconditioners. For good performance you definitely want to use HYPRE's BoomerAMG (or AMS for H(curl) or ADS for H(div)) even if you run just on 1 processor.

For partial assembly (AssemblyLevel::PARTIAL) the best preconditioners should be (1) p-multigrid (illustrated in ex26p) using BoomerAMG as the coarse solver and (2) LOR + BoomerAMG solver (illustrated in miniapps/solvers/plor-solvers.cpp)

For H(div) and H(curl) problems, you can use LOR + ADS and LOR + AMS, respectively (illustrated in miniapps/solvers/plor-solvers.cpp). If the degree is not very high, for H(div) you can also try the hybridization solver (using BoomerAMG as the solver for the hybridized system) illustrated in ex4p.

@v-dobrev
Copy link
Member

If you want the best performance with partial assembly on CPU and you are solving H1 diffusion, you can try the template code, illustrated in miniapps/performance/ex1p.cpp -- you will need to instantiate the specific order you want in the source file -- it is a template parameter. This code is vectorized too, as long as you enable a compiler option like -march=native and use the config option MFEM_USE_SIMD=YES.

@Wayne901
Copy link
Author

Thank you Dobrev. Eventually I hope to solve problems in H(curl) space with partial assembly methods. I'll look miniapps/solvers/plor-solvers.cpp first.

@Wayne901
Copy link
Author

Wayne901 commented Apr 2, 2024

I compared the serial LOR preconditioner from performance/ex1.cpp and miniapps/solvers/lor_solvers.cpp for H1 space. With same refinement levels and basis function order, the LORSolver<GSSmoother> in lor_solvers.cpp is significantly faster than the GSSmoother preconditioner on lor bilinearform in performance/ex1.cpp.
option 1:

       BilinearForm *a_pc = NULL;
       SparseMatrix A_pc;

       Mesh mesh_lor;
       FiniteElementCollection *fec_lor = NULL;
       FiniteElementSpace *fespace_lor = NULL;
       int basis_lor = BasisType::GetType('G');
       mesh_lor = Mesh::MakeRefined(mesh, order, basis_lor);
       fec_lor = new H1_FECollection(1, dim);
       fespace_lor = new FiniteElementSpace(&mesh_lor, fec_lor);

       a_pc = new BilinearForm(fespace_lor);
       a_pc->AddDomainIntegrator(new DiffusionIntegrator(one));
       a_pc->UsePrecomputedSparsity();
       a_pc->Assemble();
       a_pc->FormSystemMatrix(ess_tdof_list, A_pc);

       GSSmoother M(A_pc);
       PCG(*A, M, B, X, 1, 400, 1e-12, 0.0);

and option2

       LORSolver<GSSmoother> solv_lor(a, ess_tdof_list);
       PCG(*A, solv_lor, B, X, 1, 400, 1e-12, 0.0);

The later needs 46 iterations, while the former needs 69. What's special about LORSolver that it's more efficient than a GSSmoother with lor bilinearform as operator?

@pazner
Copy link
Member

pazner commented Apr 10, 2024

The later needs 46 iterations, while the former needs 69. What's special about LORSolver that it's more efficient than a GSSmoother with lor bilinearform as operator?

I haven't looked closely at the code that you posted, but my guess would be that this is because LORSolver uses collocated quadrature to assemble the low-order-refined system. It is counterintuitive (but true) that using a reduced quadrature for the low-order system gives you a better preconditioner for the high-order system with full quadrature (see e.g. this paper).

@pazner pazner self-assigned this Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants