Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SymmetricMatrixCoefficient::ProjectSymmetric bug #4271

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

j-signorelli
Copy link
Contributor

@j-signorelli j-signorelli commented Apr 26, 2024

The matrix defined for projection of SymmetricMatrixCoefficient onto a QuadratureFunction appears to be of incorrect size.

The current code creates a vdim x vdim matrix, where vdim is the dimension of the QF = height*(height+1)/2.

For wanting to output a symmetric 3x3 tensor (6 values), this then creates a 6x6 matrix instead of a 3x3 one. This PR fixes that.

PR Author Editor Reviewers Assignment Approval Merge
#4271 @j-signorelli @tzanio @najlkin + @paul-hilscher 5/11/24 5/16/24 ⌛due 6/1/24
PR Checklist
  • Code builds.
  • Code passes make style.
  • Update CHANGELOG:
    • Is this a new feature users need to be aware of? New or updated example or miniapp?
    • Does it make sense to create a new section in the CHANGELOG to group with other related features?
  • Update INSTALL:
    • Had a new optional library been added? If so, what range of versions of this library are required? (Make sure the external library is compatible with our BSD license, e.g. it is not licensed under GPL!)
    • Have the version ranges for any required or optional libraries changed?
    • Does make or cmake have a new target?
    • Did the requirements or the installation process change? (rare)
  • Update continuous integration server configurations if necessary (e.g. with new version requirements for each of MFEM's dependencies)
    • .github
    • .appveyor.yml
  • Update .gitignore:
    • Check if make distclean; git status shows any files that were generated from the source by the project (not an IDE) but we don't want to track in the repository.
    • Add new patterns (just for the new files above) and re-run the above test.
  • New examples:
    • All sample runs at the top of the example source file work.
    • Update examples/makefile:
      • Add the example code to the appropriate SEQ_EXAMPLES and PAR_EXAMPLES variables.
      • Add any files generated by it to the clean target.
      • Add the example binary and any files generated by it to the top-level .gitignore file.
    • Update examples/CMakeLists.txt:
      • Add the example code to the ALL_EXE_SRCS variable.
      • Make sure THIS_TEST_OPTIONS is set correctly for the new example.
    • List the new example in doc/CodeDocumentation.dox.
    • If new examples directory (e.g.examples/pumi), list it in doc/CodeDocumentation.conf.in
    • Companion pull request for documentation in mfem/web repo:
      • Update or add example-specific documentation, see e.g. the src/examples.md.
      • Add the description, labels and screenshots in src/examples.md and src/img.
      • In examples.md, list the example under the appropriate categories, add new categories if necessary.
      • Add a short description of the example in the "Extensive Examples" section of features.md.
  • New miniapps:
    • All sample runs at the top of the miniapp source file work.
    • Update top-level makefile and makefile in corresponding miniapp directory.
    • Add the miniapp binary and any files generated by it to the top-level .gitignore file.
    • Update CMake build system:
      • Update the CMakeLists.txt file in the miniapps directory, if the new miniapp is in a new directory.
      • Add/update the CMakeLists.txt file in the new miniapp directory.
      • Consider adding a new test for the new miniapp.
    • List the new miniapp in doc/CodeDocumentation.dox
    • If new miniapps directory (e.g.miniapps/nurbs), add it to MINIAPP_SUBDIRS in the makefile.
    • If new miniapps directory (e.g.miniapps/nurbs), list it in doc/CodeDocumentation.conf.in
    • Companion pull request for documentation in mfem/web repo:
      • Update or add miniapp-specific documentation, see e.g. the src/meshing.md and src/electromagnetics.md files.
      • Add the description, labels and screenshots in src/examples.md and src/img.
      • The miniapps go at the end of the page, and are usually listed only under a specific "Application (PDE)" category.
      • Add a short description of the miniapp in the "Extensive Examples" section of features.md.
  • New capability:
    • All new public, protected, and private classes, methods, data members, and functions have full Doxygen-style documentation in source comments. Documentation should include descriptions of member data, function arguments and return values, template parameters, and prerequisites for calling new functions.
    • Pointer arguments and return values must specify whether ownership is being transferred or lent with the call.
    • Any new functions should include descriptions of their intended use e.g. for internal use only, user-facing, etc., along with references to example code whenever possible/appropriate.
    • Consider adding new sample runs in existing examples to highlight the new capability.
    • Consider saving cool simulation pictures with the new capability in the Confluence gallery (LLNL only) or submitting them, via pull request, to the gallery section of the mfem/web repo.
    • If this is a major new feature, consider mentioning it in the short summary inside README (rare).
    • List major new classes in doc/CodeDocumentation.dox (rare).
  • Update this checklist, if the new pull request affects it.
  • Run make unittest to make sure all unit tests pass.
  • Run the tests in tests/scripts.
  • (LLNL only) After merging:
    • Update internal tests to include the new features.

@j-signorelli
Copy link
Contributor Author

I am not sure why the linker was not able to find symbols for arm64 for the checks -- if anyone has any suggestions, please let me know!

@sebastiangrimberg
Copy link
Contributor

sebastiangrimberg commented Apr 26, 2024

I am not sure why the linker was not able to find symbols for arm64 for the checks -- if anyone has any suggestions, please let me know!

GitHub actions recently started rolling out new runners for macOS jobs which use macOS 14 and Arm64 instances instead of x84_64 for macos-latest. Looks like this is what's causing the problem here -> The cached builds are being restored are from an older run on an x86_64 architecture, while the checks for this PR are running on arm64. I could be wrong, but think we need a PR to update the cache keys here which will force rebuilding of the Hypre and METIS dependencies in the cache for PR workflows.

@tzanio
Copy link
Member

tzanio commented May 11, 2024

This PR is now under review (see the table in the PR description). To help with the review process, please do not force push to the branch.

@tzanio tzanio added this to Review Now in Pull Requests via automation May 11, 2024
@tzanio tzanio added this to the mfem-4.8 milestone May 11, 2024
Copy link
Contributor

@najlkin najlkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bug for sure, there seems to be confusion between plain Vector or Memory wrappers and DenseSymmetricMatrix, which takes the size of the matrix as an argument of UseExternalData and not size of the data.

@paul-hilscher
Copy link
Member

paul-hilscher commented May 13, 2024

@j-signorelli ,

would it make sense to add a small unit test ? test_coefficient.cpp does cover MatrixCoefficient but it seems we do not have tests for SymmetricMatrixCoefficient / SymmetricMatrixConstantCoefficient (ideally including a test for ProjectSymmetry).

@j-signorelli
Copy link
Contributor Author

@paul-hilscher:
Done, but it required additional changes + bug fixes that probably should be re-reviewed.

  • SymmetricMatrixConstantCoefficient was broken due to there not being any DenseSymmetricMatrix &DenseSymmetricMatrix::operator=(const DenseSymmetricMatrix &m) -- I defined this
  • mat was being used as an auxiliary variable in SymmetricMatrixCoefficient while derived classes SymmetricMatrixConstantCoefficient and SymmetricMatrixFunctionCoefficient each had their own mat variables. I renamed the auxiliary one mat_aux for clarification, and moved GetMatrix() to only be a member function of SymmetricMatrixConstantCoefficient, as it seems to have intended to be based on the documentation for it.
  • I removed the override of Print() for DenseSymmetricMatrix (did this as I was debugging, but it's more convenient to just use Matrix::Print() than not have anything implemented anyways)

@j-signorelli j-signorelli requested a review from najlkin May 14, 2024 19:57
fem/coefficient.hpp Show resolved Hide resolved
tests/unit/fem/test_coefficient.cpp Outdated Show resolved Hide resolved
Copy link
Member

@paul-hilscher paul-hilscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and the tests! Great stuff.

@j-signorelli
Copy link
Contributor Author

@paul-hilscher @najlkin Thank you for your feedback -- if there is anything else to adjust please let me know, otherwise this should be ready to merge into next @tzanio

Copy link
Member

@tzanio tzanio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j-signorelli !

@tzanio
Copy link
Member

tzanio commented May 17, 2024

Merged in next for testing...

@tzanio
Copy link
Member

tzanio commented May 17, 2024

There is a failure with this PR in our regression testing on Lassen (V100 machine) when building optimized + CUDA. The error is below, can you double-check the code?

INFO: Test filter: ~[Parallel] 
Filters: ~[Parallel]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
unit_tests is a Catch v2.13.9 host application.
Run with -? for options
-------------------------------------------------------------------------------
Symmetric Matrix Coefficient
-------------------------------------------------------------------------------
/usr/WS2/mfem/gitlab-runner/kolev/builds/zkQzXxmM/001/gitlab/mfem/mfem-autotest/tests/unit/fem/test_coefficient.cpp:307
...............................................................................
/usr/WS2/mfem/gitlab-runner/kolev/builds/zkQzXxmM/001/gitlab/mfem/mfem-autotest/tests/unit/fem/test_coefficient.cpp:332: FAILED:
  REQUIRE( qf.DistanceTo(values) == MFEM_Approx(0.0) )
with expansion:
  0.5017307833147915 == Approx( 0 )
===============================================================================
test cases:     297 |     296 passed | 1 failed
assertions: 2828545 | 2828544 passed | 1 failed
make[1]: *** [makefile:240: unit_tests-test-seq] Error 1

@najlkin
Copy link
Contributor

najlkin commented May 17, 2024

I think the problem might be that DistanceTo sums totally different numbers, it would be better to sum the relative errors, i.e., instead of (x_i-y_i)^2 sum (x_i-y_i)^2/y_i^2 😉

@sebastiangrimberg
Copy link
Contributor

sebastiangrimberg commented May 17, 2024

There is a failure with this PR in our regression testing on Lassen (V100 machine) when building optimized + CUDA. The error is below, can you double-check the code?

I'm pretty sure this is because Vector::DistanceTo is assumes the vector is not using the device while the added test uses a QuadratureFunction which does use the device by default. The following patch to linalg/vector.hpp would hopefully fix this:

diff --git a/linalg/vector.hpp b/linalg/vector.hpp
index 00bb625f9..292ac6fec 100644
--- a/linalg/vector.hpp
+++ b/linalg/vector.hpp
@@ -678,24 +678,24 @@ inline real_t Distance(const Vector &x, const Vector &y)

 inline real_t Vector::DistanceSquaredTo(const real_t *p) const
 {
-   return DistanceSquared(data, p, size);
+   return DistanceSquared(HostRead(), p, size);
 }

 inline real_t Vector::DistanceSquaredTo(const Vector &p) const
 {
    MFEM_ASSERT(p.Size() == Size(), "Incompatible vector sizes.");
-   return DistanceSquared(data, p.data, size);
+   return DistanceSquared(HostRead(), p.HostRead(), size);
 }

 inline real_t Vector::DistanceTo(const real_t *p) const
 {
-   return Distance(data, p, size);
+   return Distance(HostRead(), p, size);
 }

 inline real_t Vector::DistanceTo(const Vector &p) const
 {
    MFEM_ASSERT(p.Size() == Size(), "Incompatible vector sizes.");
-   return Distance(data, p.data, size);
+   return Distance(HostRead(), p.HostRead(), size);
 }

 /// Returns the inner product of x and y

Probably for a debug GPU build one of the asserts in the type-casts in mem_manager.hpp would have failed to let us know of this.

@najlkin
Copy link
Contributor

najlkin commented May 17, 2024

Yes, that would explain why it happens only in this context. I presumed these basic methods are safe, but MFEM can always surprise 😄 . It goes quite far beyond the scope of this PR, so it would be perhaps cleaner to just turn off the device processing and open a separate PR for this.
Besides that, I would still recommend the relative errors, but it is not critical and only makes the test slightly insensitive.

@sebastiangrimberg
Copy link
Contributor

It's a (minor) bug fix so I would encourage adding the patch as part of this PR. But I will let the editor @tzanio decide.

@v-dobrev
Copy link
Member

v-dobrev commented May 17, 2024

These "distance" methods were meant for small vectors on CPU where the overhead of HostRead (in terms of code and execution time) is not desirable. We can add an MFEM_ASSERT to check they are not used on device. For big vectors that can be on device, one can use subtract and Normlinf() or if one really wants $\ell_2$ (i.e. Eucledian) norm, Norml2().

@v-dobrev
Copy link
Member

Hmm, I'm still not sure how but in my tests with a CUDA build, the unit tests pass. I'm using a debug build but that should not change the result unless it produces an error from a failed check which it does not.

@v-dobrev
Copy link
Member

It turns out the unit test failure is due to a compiler issue: when using -O3, IBM XL C++ enables optimizations that can cause issues in some cases and it looks like this is one of those cases. Switching to -O2 optimization with the same compiler fixes the problem.

@v-dobrev
Copy link
Member

I noticed a potential issue when SymmetricMatrixCoefficient::ProjectSymmetric is called with QuadratureFunction &qf that is valid on device. Since we are accessing its entries using aliases, and all of that is done on host, it is best to call

   qf.HostWrite();

sometime before the start of the loop over elements, here:

for (int iel = 0; iel < ne; ++iel)

@tzanio
Copy link
Member

tzanio commented May 18, 2024

@j-signorelli what do you think about the discussion and suggestions above ☝️ ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pull Requests
  
Review Now
Development

Successfully merging this pull request may close these issues.

None yet

6 participants