Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SQLAlchemy to work with the database #38

Open
adalke opened this issue Oct 13, 2021 · 1 comment
Open

Use SQLAlchemy to work with the database #38

adalke opened this issue Oct 13, 2021 · 1 comment

Comments

@adalke
Copy link
Contributor

adalke commented Oct 13, 2021

In 2019 kzfm gave an example of using SQLAlchemy to work with the database.

In my work to replace the JSON-Lines fragment format with a SQLite-based format I quickly discovered that I am hand-writing an ORM. Poorly.

I propose switching using SQLAlchemy in this updated version I am working on.

At this point I don't know how much work that requires. kzfm showed that defining the structure and querying the generated mmpdb database was simple.

@adalke
Copy link
Contributor Author

adalke commented Oct 14, 2021

I developed a prototype using SQLAlchemy to generate the new fragment database instead of using hand-built SQL queries

My test set was 100 SMILES. With low-level SQL queries and 4 processors it takes ~28.6 seconds. Going through SQLAlchemy's ORM took ~37.5 seconds -- 30% slower.

Profiling shows that most of the additional time is in session.commit(). However, even if I comment out the session.add() code (which tells SQLAlchemy to add the new objects to the database when appropriate), the overall time was still slower than the hand-written code.

Further profiling shows about 15% of the additional time was in _initialize_instance. This is part of SQLAlchemy's ORM. I replaced mmpdb's existing fragment types with slightly modified versions that inherit from registry.generate_base() base class. This has its own __init__, with a higher overhead than the basic Python-class-with-slot-definition I used.

There were some improvements to the API. I got to strip out some boilerplate code to return a FragmentRecord from the cache file. On the other hand, I had to lean how to "detach" objects to move them from one database to the other, and I had to learn the special query syntax.

When we get to things like merging multiple datasets into one, I think it would be easier to work directly to the SQLite connection object, attach databases, in the instance, and do cross-data INSERT with SELECT. I don't want to have to figure it out in SQLAlchemy.

Long-term, I think there's still a place for SQLAlchemy, as with kzfm's example.

The most likely is to define the database schema(s) in a better way than my shaky template system, and use it to manage schema creation, and to replace the vendored peewee code we use now, and use the more low-level SQLAlchemy calls to write the tables, rather than SQLAlchemy's ORM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant