Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indetermistic output for blog pages related to tags #274

Open
kayhayen opened this issue Jan 28, 2024 · 6 comments
Open

Indetermistic output for blog pages related to tags #274

kayhayen opened this issue Jan 28, 2024 · 6 comments

Comments

@kayhayen
Copy link

Describe the bug

I am trying to revert to an approach where I compare the output created by sphinx for changes I do not want. I have had some disasters strike in terms on unnoticed bad changes already a couple of times. I used to diff outputs when I was using nikola, but I gave up, because of many issues.

With sphinx my current config builds stable, except for ablog. And I have a lot of changes related to tags. This is a diff I am getting when I rebuild from scratch 2 times in a row. It never builds the same. I did not try to disable Python hash randomization, it might be a bandaid, but of course only so much

diff -ru output.1/blog/2010.html output/blog/2010.html
--- output.1/blog/2010.html     2024-01-27 18:12:49.779910129 +0100
+++ output/blog/2010.html       2024-01-27 18:14:41.720750591 +0100
@@ -257,13 +257,13 @@
   
   
   
-  <a href="tag/python.html">Python</a>
+  <a href="tag/git.html">git</a>
   
   
   
   
   
-  <a href="tag/git.html">git</a>
+  <a href="tag/python.html">Python</a>

I get many of these changes, the blog pages and its RSS feed are of course of very large importance to my site.

I am willing to hunt this down on my own. I am using sphinx-build 7.2.6 and ablog==0.11.6 believing these to be fairly recent.

There is a devcontainer that automatically builds on install for my web site: https://github.com/Nuitka/Nuitka-website

I am suspecting, that a set object is being used. Since I am on Python3.10 there, dictionaries are no longer unordered really, but this could also be unordered usage of a file system result, I couldn't tell yet.

I am using pipenv to install. I am sure I have seen it on 3.9 in my WSL too, during a migration from Debian 3.9 WSL2 pipenv config of old to new 3.10 based one for use in devcontainers.

I will be looking at your templates and what data is used to produce the archive (and I think other ablog pages are affected too), to see how unsorted it is.

To Reproduce

No response

Screenshots

No response

System Details

==============================
sunpy Installation Information
==============================

General
#######
OS: Ubuntu (22.04, Linux 5.15.90.1-microsoft-standard-WSL2)
Arch: 64bit, (x86_64)
sunpy: 5.1.1
Installation path: /home/vscode/.local/share/virtualenvs/Nuitka-website-rRTh7jbj/lib/python3.10/site-packages/sunpy-5.1.1.dist-info

Required Dependencies
#####################
astropy: 6.0.0
numpy: 1.26.3
packaging: 23.2
parfive: 2.0.2

Optional Dependencies
#####################
asdf: Missing asdf>=2.8.0; extra == "asdf" or "docs" or "tests"
asdf-astropy: Missing asdf-astropy>=0.1.1; extra == "asdf" or "docs" or "tests"
beautifulsoup4: Missing beautifulsoup4>=4.8.0; extra == "docs" or "net" or "tests"
cdflib: Missing cdflib!=0.4.0,!=1.0.0,>=0.3.20; extra == "docs" or "tests" or "timeseries"
dask: Missing dask[array]>=2021.4.0; extra == "dask" or "docs" or "tests"
drms: Missing drms<0.7.0,>=0.6.1; extra == "docs" or "net" or "tests"
glymur: Missing glymur!=0.9.5,>=0.9.1; extra == "docs" or "jpeg2000" or "tests"
h5netcdf: Missing h5netcdf>=0.11; extra == "docs" or "tests" or "timeseries"
h5py: Missing h5py>=3.1.0; extra == "docs" or "tests" or "timeseries"
lxml: 5.1.0
matplotlib: Missing matplotlib>=3.5.0; extra == "docs" or "map" or "tests" or "timeseries" or "visualization"
mpl-animators: Missing mpl-animators>=1.0.0; extra == "docs" or "map" or "tests" or "visualization"
pandas: Missing pandas>=1.2.0; extra == "docs" or "tests" or "timeseries"
python-dateutil: 2.8.2
reproject: Missing reproject; extra == "docs" or "docs-gallery" or "map" or "tests"
scikit-image: Missing scikit-image>=0.18.0; extra == "docs" or "image" or "tests"
scipy: Missing scipy!=1.10.0,>=1.7.0; extra == "docs" or "image" or "map" or "tests"
sqlalchemy: Missing sqlalchemy>=1.3.4; extra == "database" or "docs" or "tests"
tqdm: 4.66.1
zeep: Missing zeep>=3.4.0; extra == "docs" or "net" or "tests"

Installation method

pip

@kayhayen
Copy link
Author

So, I found the culprit, post tags are indeed loosing their ordering, so the postcard2 template produces a different ordering of the tags for each rendering of the page. In my case, I have 3 tags specified, but 2 of them switch over easily in that it seems. The use of ordered-set would resolve that, but I can see how you would hate adding a dependency. I have a fallback in Nuitka, which does do a ordered set too, that I used to test this.

from collections.abc import MutableSet

class OrderedSet(MutableSet):
    is_fallback = True

    def __init__(self, iterable=()):
        self.end = end = []
        end += (None, end, end)  # sentinel node for doubly linked list
        self.map = {}  # key --> [key, prev, next]
        if iterable:
            self |= iterable

    def __len__(self):
        return len(self.map)

    def __contains__(self, key):
        return key in self.map

    def add(self, key):
        if key not in self.map:
            end = self.end
            curr = end[1]
            curr[2] = end[1] = self.map[key] = [key, curr, end]

    def update(self, keys):
        for key in keys:
            self.add(key)

    def discard(self, key):
        if key in self.map:
            key, prev, next = self.map.pop(key)
            prev[2] = next
            next[1] = prev

    def __iter__(self):
        end = self.end
        curr = end[2]
        while curr is not end:
            yield curr[0]
            curr = curr[2]

    def __reversed__(self):
        end = self.end
        curr = end[1]
        while curr is not end:
            yield curr[0]
            curr = curr[1]

    def pop(self, last=True):
        if not self:
            raise KeyError("set is empty")
        key = self.end[1][0] if last else self.end[2][0]
        self.discard(key)
        return key

    def __repr__(self):
        if not self:
            return "%s()" % (self.__class__.__name__,)
        return "%s(%r)" % (self.__class__.__name__, list(self))

    def __eq__(self, other):
        if isinstance(other, OrderedSet):
            return len(self) == len(other) and list(self) == list(other)
        return set(self) == set(other)

    def union(self, iterable):
        result = OrderedSet(self)

        for key in iterable:
            result.add(key)

        return result

    def index(self, key):
        if key in self.map:
            end = self.end
            curr = self.map[key]

            count = 0
            while curr is not end:
                curr = curr[1]
                count += 1

            return count - 1

        return None


def _split(a):
    return OrderedSet(s.strip() for s in (a or "").split(","))

@kayhayen
Copy link
Author

Obviously with ordered-set from PyPI, this becomes from ordered_set import OrderedSet . Let me know if I should make a PR out of it. It would be nice if it was accepted at least as an optional dependency. For the fallback, I am not 100% sure it's really perfect for everything, in my Python compiler Nuitka, it's not causing issues, but you may not be wanting to take a risk.

I cannot tell other consequences of having post _split producing ordered sets, for my blog, there are no measurable ones.

On my side, until this is released, I think I can monkey patch _split to be the improved one.

@nabobalis
Copy link
Contributor

nabobalis commented Jan 28, 2024

Thanks for the report!

How would you want it ordered or would you just want the tags to not change order for each build?

We should be able to order the output (hopefully sort would be enough) before it's passed to the templates. That would avoid the need for adding orderedset?

@kayhayen
Copy link
Author

I think it might be natural to expose the order of the tags provided by the user. That is what OrderedSet gives me now. It only removes duplicates. That of course exposes, that for similar posting types, I didn't pick the same ordering, "nuitka,python,compiler" is accompanied by many uses of "compiler,nuitka,python", etc. with many permutations. I obviously didn't consider their ordering until now.

My complaint is mainly with the HTML output being different for each build and in a sense an uncontrollable ordering happening, anything removing that is an improvement. Asking the user how to sort the different attributes on a config level, might be too much effort, and totally not worth it. It seems natural to order in the page source.

@nabobalis
Copy link
Contributor

In that case, maybe if I can add something like [a for a in list if a in set(list)] in the code base in the right location, I can avoid adding an optional dependency.

I will look into this hopefully soon(TM).

@kayhayen
Copy link
Author

kayhayen commented Feb 1, 2024

I didn't dare change the type away from set, but making things unique is of course doable like what you describe there far easier, if you only need iteration and in tests, that's of course no issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants