Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using transform on floats breaks sorting #60

Open
slochower opened this issue Oct 30, 2023 · 5 comments
Open

Using transform on floats breaks sorting #60

slochower opened this issue Oct 30, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@slochower
Copy link

slochower commented Oct 30, 2023

If I use a transform dictionary to change the formatting of certain properties (say, according to other local variables) like so:

raw_html = mols2grid.display(
    _df,
    mol_col="Mol",
    subset=[
        "Name",
        "img",
    ],
    transform={
        "MMP std. dev. difference": lambda x: (
            f"{x:.2f} {' (Log units)' if log else ''}"
        ),
    },
    tooltip=[
        "MMP std. dev. difference",
    ],
)._repr_html_()  # type: ignore

...then it appears all sorting it based on the str representation of "MPP std. dev. difference" instead of the float representation, even though in the actual data frame, the column "MMP std. dev. difference" has dtype of float64. In this example, I could just change the column title before I pass the data to mols2grid, but in other cases, I want to use transform to otherwise mutate the string shown to the user yet still retain sorting by original dtype. Is this possible?

Edit: I think the issue arises from

for col, func in transform.items():
df[col] = df[col].apply(func)
where the transform is applied directly to the data, changing the column to a str in the above code snippet.

A simpler example that would show the same behavior is this:

raw_html = mols2grid.display(
    _df,
    mol_col="Mol",
    subset=[
        "Name",
        "img",
        "x",
    ],
    transform={
        "x": lambda x:  f"x = {np.round(x,2)}"
    },
)._repr_html_()  # type: ignore

where just changing the display from 3.14 to x = 3.14 will break sorting.

@cbouy
Copy link
Owner

cbouy commented Nov 6, 2023

Can't really think of an easy solution for this one apart from keeping 2 distinct columns, one for displaying and one for sorting (which you could hide with style={"x": lambda x: "display: none"}) but that's not great...

A solution could be to do some regex search for numeric values and striping out everything else before sorting, but this operation would have to be done on each pair of values being compared which, in addition to being tricky to do correctly) would slow things down quite a bit (or it would need a significant rewrite which will not happen if I'm honest)

@slochower
Copy link
Author

Yes, I see what you mean. I had two thoughts:

  1. Apply the transform only for display, so that the transform string really only gets written to the div and not stored in the data frame, like this:
    s = f'<div class="data data-{slugify(col)}" style="display: none;"></div>'

    ...and then still use the data frame itself for sorting.
  2. Allow a custom key for sorting that would get passed directly to pd.sort_values(key=...) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html . I think this one is the simplest. Then, in my case, I could use a lambda to just truncate the first few characters which are fixed and cast the rest of the string to a float.

@cbouy
Copy link
Owner

cbouy commented Nov 6, 2023

Ah wait are you talking about the sort_by parameter in mols2grid.display or the sort button on the grid?

My "solution" above referred to the latter and would operate JavaScript side through the library that handles the grid display, yours seem to suggest the former (i.e. Python side).
If you're only interested in sorting once at the beginning, I guess I could just do the sorting before applying the data transforms (but you could also do that directly on your input dataframe tbh).

Regarding your thoughts:

  1. the div seen here is basically just a template string used on the JavaScript side later on. The values from the dataframe are directly injected between the <div ...></div> in that template by the JavaScript library that I use, so at that point I can only pass the already-transformed values.
  2. this would work, but as said in the message above, I could just do the sorting before the data transforms and avoid users having to provide yet another lambda function to handle the data.

@slochower
Copy link
Author

Sorry for the confusion -- I am actually referring to the button on the rendered grid. I didn't look at when/how the data is shipped to JS, so I forgot to think about sorting in JS rather than pandas. I'm not interested in sorting once -- as you say, that's not too tricky -- just hoping that if someone sorts on a field that's x = 123, that we can do the "right thing" numerically. I'm using a transform to do x = 123 simply because the molecules have lots of data associated with them and without the string, I don't think the users will know which number is displayed.

@cbouy
Copy link
Owner

cbouy commented Nov 6, 2023

Ok in that case I guess a reasonable feature would be to add a new regex_sort parameter that toggles a slower but more powerful sorting function on the JS side. Not sure when I can add that in but will definitely consider it

@cbouy cbouy added the enhancement New feature or request label Nov 6, 2023
@cbouy cbouy self-assigned this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants