Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Pinned vs loose requirements, application vs library use cases #204

Open
ashleysommer opened this issue Sep 5, 2023 · 5 comments

Comments

@ashleysommer
Copy link
Collaborator

ashleysommer commented Sep 5, 2023

In the python world, packages (installable python modules) are usually in the form of Applications or Libraries.

Libraries are building block packages that users can use to build their application. And Application packages are python programs and utilities installed by end users.

Mechanically there is not much difference between a python library and and python application, and some packages can have aspects of both. PySHACL was originally designed and developed to be a python Application, it has a commandline interface to execute the validation functionality. But PySHACL also has aspects of a python library, it's modules can be imported into other python code, and called programmatically. The README file includes examples of both use cases. (Note, you can also call the module from the commandline, that is a third mechanism, but equal in functionality to the Application mode).

From my experience interacting with PySHACL users on Github threads, on the SHACL Discord server, and in person, I estimate around 50% of users use PySHACL as an Application, installing it only to execute it from the commandline interface. Around 40% of users treat it as a library, incorporating PySHACL into their broader codebase. The remaining 10% have a hybrid use case, utilising both the commandline interface, and the module imports.

The tricky part comes when defining the runtime requirements of the package. Usually python Applications maintain a tightly controlled (or Pinned) list of required library versions to ensure best operation, and they ship with a package lockfile to inform the package manager what versions of requirements to install. Libraries on the other hand tend to have as loose as possible requirements, to ensure maximum compatibility with other codebases, and often do not ship a lockfile in the package, so they leave it up to the developer to choose exactly which library versions they use in their application.

PySHACL has always needed to balance on this line of loose requirements for library use cases, but shipping a lockfile to ensure the application use case works out of the box. For example, we have always tried to maintain backward compatibility with RDFLib for the last three RDFLib releases, because it is surprisingly common for developers to need to use PySHACL in an codebase with an older version of RDFLib, so we leave the RDFLib requirement loose.

This thread is a discussion about the direction PySHACL should take here. Would it be a good idea to split the PySHACL codebase into two packages? One is the PySHACL library, and the other is the PySHACL CLI Tool?

Related: #203, #197

@ajnelson-nist
Copy link
Contributor

My own opinion, not on behalf of anyone else:

One of the projects I work with is in that 10% of using pySHACL like a library and application. That community has some applications that benefit from accessing pyshacl.validate.validate() as a library call. They also provide a "wrapped" version of the pyshacl command that includes functionality specific to their ontology + shape needs, and my current understanding is the extra functionality is just not compatible with re-exporting upstream to pySHACL.
(This includes shipping pre-built ontology files, and performing concept "typo checks" in a way that does not entail the end user be developing a Python application. (So, RDFLib's DefinedNamespace mechanism isn't exactly an option.) It is currently convenient for this community to specially tailor logic in a wrapper, but I'm not sure if there's a design that lets this cleanly go upstream and become generic.)

I think it is to that community's benefit that pySHACL favor looser requirements, for both the "Library" mode and the "Application" mode you're suggesting. If requirements are too tightly specified, a worst-but-not-unworkable case can occur where pySHACL's CLI becomes incompatible with another application an end user in this "10%" group needs. The workaround of making a Python virtual environment per application is (probably) always technically available, but it would be a step of documentation I would not enjoy drafting. I prefer nudging a dependency's floor, or placing a (hopefully temporary) ceiling, in the downstream adopter.

Another argument for the looser requirement specification is CI that runs on a schedule could catch when a dependency introduces an incompatibility. I don't pin dependencies often enough to have an opinion on whether that style of issue detection is also practical to do with pinning.

@aucampia
Copy link
Member

aucampia commented Sep 6, 2023

Usually python Applications maintain a tightly controlled (or Pinned) list of required library versions to ensure best operation, and they ship with a package lockfile to inform the package manager what versions of requirements to install.

This is not exactly the case, at least not for most python applications I use. Generally, Python applications are installed from wheels, and in most cases locked versions don't affect wheels. I have extracted the dependency specs from the following wheels: black codespell cookiecutter copier cruft csvkit databricks-cli doit flake8 httpie ipython jc jello jrnl mypy pgcli pip pip-tools pipenv pipx poetry pycln pylint pyshacl pytest pyupgrade rdflib sherlock tox virtualenv xonsh youtube-dl yq, it can be seen here https://github.com/aucampia/examples/blob/4da918e062d5591fd59ec25a408df7eff9e3772f/202309-python_apps/output/combined.csv

Data was extracted with this script.

Most version restrictions from wheels are not that strict, they are rather broad, the strictest ones are filtered here: https://github.com/aucampia/examples/blob/4da918e062d5591fd59ec25a408df7eff9e3772f/202309-python_apps/output/restricted.csv

In there, the exact version is selected in only 4 cases. Other cases follow semantic versioning restrictions for the most part, and in some cases this even spans multiple major versions, like this.

PySHACL has always needed to balance on this line of loose requirements for library use cases, but shipping a lockfile to ensure the application use case works out of the box.

While the pySHACL project has a lockfile (i.e. poetry.lock) this does not really affect users in most cases as it does not affect the wheel, and will really only be effective if users use pySHACL by cloning the repo and using poetry install, which are not part of the documented installation instructions [ref].

There is, of course, the problem of validating that the version ranges in wheel you distribute are correct. For RDFLib I validate this by testing RDFLib with the minimum versions of the dependencies in addition to testing it with the latest versions of the dependencies: https://github.com/RDFLib/rdflib/blob/16047eb2f70d061dc7bee564a05e6ba880c7f0e2/devtools/constraints.min

This thread is a discussion about the direction PySHACL should take here. Would it be a good idea to split the PySHACL codebase into two packages? One is the PySHACL library, and the other is the PySHACL CLI Tool?

I don't think the situation is that unique, and most cases of other tools I know of that are both a library and a CLI tool work fairly well by distrusting one wheel with fairly broad version restrictions. I think the best option is to just validate your version ranges as I do for RDFLib.

There are some other things to note though:

  • Currently your dev tools (ruff, black, mypy) end up in your wheel. This is not the norm (you can check this again) - it does happen, but I am not aware of any case where it will actually be useful for people who use the wheel. Users of pySHACL should not be running mypy, black or ruff against pySHACL. And contributors should use the versions locked in poetry.lock which should be the same versions that is used in your CI.
  • Currently you use mypy in CI without locking the version
    poetry run pip3 install "mypy>=0.800" "types-setuptools"

    This is likely to keep breaking. A better option would be to rather use the locked version from poetry.lock, and then have dependabot update it, that way if a new version is broken it won't break your CI and people can still contribute without first having to fix the CI. And you can also clearly see what updates are breaking because the Dependabot PR that makes the breaking update will fail to pass CI like here: build(deps): bump networkx from 2.6.3 to 3.1 rdflib#2458

@aucampia
Copy link
Member

aucampia commented Sep 6, 2023

There is, of course, the problem of validating that the version ranges in wheel you distribute are correct. For RDFLib I validate this by testing RDFLib with the minimum versions of the dependencies in addition to testing it with the latest versions of the dependencies: https://github.com/RDFLib/rdflib/blob/16047eb2f70d061dc7bee564a05e6ba880c7f0e2/devtools/constraints.min

This stopped working after the move to poetry, but I will fix it tonight or tomorrow.

@aucampia
Copy link
Member

aucampia commented Sep 6, 2023

There is, of course, the problem of validating that the version ranges in wheel you distribute are correct. For RDFLib I validate this by testing RDFLib with the minimum versions of the dependencies in addition to testing it with the latest versions of the dependencies: https://github.com/RDFLib/rdflib/blob/16047eb2f70d061dc7bee564a05e6ba880c7f0e2/devtools/constraints.min

This stopped working after the move to poetry, but I will fix it tonight or tomorrow.

Fix here:

@ashleysommer
Copy link
Collaborator Author

Thanks @aucampia for your comments.
You're right, I completely forgot about the published wheels, which of course do fix their own dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants