Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between atomic self energies from regression and from literature #72

Open
wiederm opened this issue Feb 28, 2024 · 4 comments

Comments

@wiederm
Copy link
Member

wiederm commented Feb 28, 2024

We need to double-check the source of the relatively significant difference in the atomic self-energies of the QM9 dataset.

@chrisiacovella
Copy link
Member

chrisiacovella commented Feb 28, 2024

I'm (mostly) copying my comment from the PR:

I wouldn't say these are really all that off. All but hydrogen have less than a 1% difference (note I'm regressing the entire QM9 dataset). Hydrogen actually has the smallest numerical difference (i.e., delta), but because it is nearly 2 orders of magnitude smaller than the other elements, it just seems more significant.

All values in kJ/mol

element |  DFT              | regressed         | delta         | % diff
H       |   -1313.466862    |   -1585.320503    |  271.8536413  |  20.69741151
C       |  -99366.70746     |  -99963.18962     |  596.4821665  |   0.600283719
N       | -143309.938       | -143744.2422      |  434.3042769  |   0.303052449
O       | -197082.0672      | -197504.4753      |  422.4081034  |   0.21433107
F       | -261811.5456      | -262199.961       |  388.415491   |   0.148356899

But since the regressed values are all consistently higher than the DFT values, if we use them we are going to end up with positive values for small molecules and much smaller energies for large molecules. That is, the differences are in fact large relative to the scale of the formation energy (which seems to be closer to the scale of the hydrogen atom for a random sampling of molecules from qm9). But that might just be as accurate as one can get give the scale of the input energies being 10^6 kJ/mol.

A few quick calculations of the formation energy to demonstrate differences.

Methane formation energy
DFT self energy: -1656.8639 kJ/mol
regressed self energy: 27.0328 kJ/mol

Ammonia formation energy:
DFT self energy: -1158 kJ/mol
regressed self energy: 91 kJ/mol

water:
DFT self energy: -891 kJ/mol
regressed self energy: 74.55 kJ/mol

acetylene:
DFT: -1612 kJ/mol
regressed: 123 kJ/mol

CCC(=O)C=O
DFT: -4922 KJ/mol
regressed: -60 KJ/mol

Two questions:

  • Does this actually matter if we can just add the same constant self energy values back on later?
  • Would it just be better to use DFT values with the appropriate level of theory. These would not be very difficult or exhaustive if they were not provided by a dataset.

@chrisiacovella
Copy link
Member

chrisiacovella commented Mar 1, 2024

Following up on our discussion today: we should test this!

Does it have any impact on the accuracy of the trained model? On the training time? On the stability?

We could also take the DFT values and make uniform perturbations: increase/decrease by 5%, 10%, 20%, etc. What if all these values were off by an order of magnitude (changing the order of the "computed" energy)? If model training is fast, this would be an easy set of tests to determine how sensitive training is to the values used and whether this is something to be very concerned about.

Ultimately, using DFT calculated values is probably the most rigorous/straight forward and should not in any way be an expensive calculation, but if these values are not available, this could be a substantial roadblock. I think if we can determine the degree to how much these values influence the results, we could some very clear suggestions as what to do (e.g., if this does not make much of a difference, one can likely use any reasonable literature values, even if calculated with a different engine or even different level of theory).

@chrisiacovella
Copy link
Member

Additional comment regarding linear regression. @wiederm noted that for a larger dataset, like ani2x, we will likely run into issues with being able to hold it fully load it in memory or too many datapoints to efficiently fit.

Based on the preliminary fitting with QM9 using 100 data points vs. the entire dataset, we likely will not need to use the entire dataset to get reasonable estimates, but at the same time, how we choose this subset may be very important.

I think there are two considerations when picking out a smaller subset to regress:

  • Make sure that we are including sufficient data for all atomic species
    • E.g., in the 100 data point example, the first 100 molecules in the data set were chosen; none of these first 100 molecules contained F so we were unable to get the self-energy values for that species
  • Make sure that we are sufficiently sampling the entire distribution of molecule sizes.
    • Some datasets appear to start with all the smaller and simpler molecules at smaller indices, and larger, more complex at larger indices (probably due to the combinatorial process of generating the smiles)...other appear more randomly distributed. This could have a large impact on the fitting of, e.g., species such as oxygen or nitrogen, which tend to appear in lower frequency in a molecule as compared to hydrogen or carbon.

To satisfy these two requirements I think we can do the following.

  • read in a batch of datapoints
  • sort this batch by molecule size.
    • we'll want to create a separate sorted list/array/dictionary (whatever data structure we use) for each atomic species found in the dataset. Also we might want to sort by frequency of the given atomic species in the molecule
  • Divide the list up in N chunk, randomly picking M molecules from each chunk
    -- we'd need to put in some logic for cases when there are fewer than N molecules in a list, etc. but that should be easy (that logic could also be to just continue to append that list until we have at least n*m points added).

Repeat for each batch, generating a more manageable array for fitting that ensures we are reasonably sampling the dataset.

Rather than sorting a list, we could also generate a histogram and then do the sampling based upon stdevs from the mean, but that is probably not necessary.

@chrisiacovella
Copy link
Member

Back to the point as to whether the values are "right", these are the ANI2x regressed values:

https://github.com/isayev/ASE_ANI/blob/master/ani_models/ani-2x_8x/sae_linfit.dat

To directly compare, I converted to kJ/mol and truncating to 2 decimal places, appending to the table I shared above. ani2x and qm9 use the same level of theory, so this comparison makes sense.

	DFT	         qm9_lin	dft-qm9_lin % diff      ani2x_lin	    dft-ani2x_lin       % diff  qm9_lin-ani2x_lin
H	-1313.47	-1585.32	271.85	    20.70	-1569.68	    256.21	        19.51	-15.64
C	-99366.71	-99963.19	596.48	    0.60	-100003.57	    636.86	        0.64	40.38
N	-143309.94	-143744.24	434.30	    0.30	-143646.28	    336.35	        0.23	-97.96
O	-197082.07	-197504.48	422.41	    0.21	-197414.16	    332.10	        0.17	-90.31
F	-261811.55	-262199.96	388.42	    0.15	-262034.07	    222.52	        0.08	-165.89

Condensed to just the energies:

	DFt	        qm9 lin		ani2x lin
H	-1313.47	-1585.32	-1569.68
C	-99366.71	-99963.19	-100003.57
N	-143309.94	-143744.24	-143646.28
O	-197082.07	-197504.48	-197414.16
F	-261811.55	-262199.96	-262034.07

I'd say overall the results are consist for the regression of the datasets and again, not too far off from the DFT values. This just might be the limits of accuracy of doing a regression.

It might be good to actually do this in batches, doing the fitting for each batch, and getting a mean value for the self energy (to be able to also assess variability in the fitting itself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants