Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream support for exporting pdbs #108

Open
djberenberg opened this issue Aug 9, 2022 · 6 comments
Open

Stream support for exporting pdbs #108

djberenberg opened this issue Aug 9, 2022 · 6 comments

Comments

@djberenberg
Copy link

djberenberg commented Aug 9, 2022

Describe the workflow you want to enable

I'd like to be able to export a pdb to a stream instead of to disk. In particular the reason why I'd like to do so is so that I can pass the stream directly to wandb.Molecule

Describe your proposed solution

The PandasPdb.to_pdb method could accept a path_or_stream: typing.Union[io.StringIO, str] instead of just a path: str argument. Internally, if path_or_stream happens to be a io.StringIO object, we don't need an openf function and instread can just execute the internal loops seen here, where f is now the io.StringIO object.

Making this change would enable inplace filling the stream with the pdb text.

Describe alternatives you've considered, if relevant

Currently I am needlessly writing to disk temporarily, reopening the file, and passing its contents to the wandb.Molecule object.

Additional context

@a-r-j
Copy link
Contributor

a-r-j commented Aug 9, 2022

Hey @djberenberg I've actually done this already. Code to follow once I find it :) I agree this would be a nice feature for biopandas

@a-r-j
Copy link
Contributor

a-r-j commented Aug 9, 2022

Here you go:

def to_pdb_stream(df: pd.DataFrame) -> StringIO:
    """Writes a PDB dataframe to a stream.

    :param df: PDB dataframe
    :type df: pandas.DataFrame
    :return: StringIO Buffer
    :rtype: StringIO
    """

    df = df.copy().drop(columns=["model_id"])
    df.residue_number = df.residue_number.astype(int)
    records = [r.strip() for r in list(set(df.record_name))]
    dfs = {r: df.loc[df.record_name == r] for r in records}

    for r in dfs:
        for col in pdb_records[r]:
            dfs[r][col["id"]] = dfs[r][col["id"]].apply(col["strf"])
            dfs[r]["OUT"] = pd.Series("", index=dfs[r].index)

        for c in dfs[r].columns:
            # fix issue where coordinates with four or more digits would
            # cause issues because the columns become too wide
            if c in {"x_coord", "y_coord", "z_coord"}:
                for idx in range(dfs[r][c].values.shape[0]):
                    if len(dfs[r][c].values[idx]) > 8:
                        dfs[r][c].values[idx] = str(
                            dfs[r][c].values[idx]).strip()

            if c not in {"line_idx", "OUT"}:
                dfs[r]["OUT"] = dfs[r]["OUT"] + dfs[r][c]

    df = pd.concat(dfs, sort=False)
    df.sort_values(by="line_idx", inplace=True)

    output = StringIO()
    s = df["OUT"].tolist()
    for idx in range(len(s)):
        if len(s[idx]) < 80:
            s[idx] = f"{s[idx]}{' ' * (80 - len(s[idx]))}"
    to_write = "\n".join(s)
    output.write(to_write)
    output.write("\n")
    return output

@djberenberg
Copy link
Author

Thank you @a-r-j !!!

@rasbt
Copy link
Member

rasbt commented Aug 10, 2022

Wow, thanks @a-r-j . If this works for you @djberenberg , it'd be great to add this to biopandas as a PR :)

@a-r-j
Copy link
Contributor

a-r-j commented Aug 10, 2022

Sure @rasbt , I'll add it to the open PR once I've got a moment.

@djberenberg
Copy link
Author

@rasbt @a-r-j Works for me, the only changes I added were to conditionally drop "model_id" as it might not be a present column and add output.seek(0) before returning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants