Deleting Duplicate Items in a Dataset #2099

emailic · 2024-05-20T09:48:58Z

emailic
May 20, 2024

Hello,
Recently, by accident, we ended up loading one of our datasets with duplicates.
This occurred by running 4-5 times a script related to updating the retrieval filter that contained the following lines:

import json
from langfuse import Langfuse

def ingest_json_file(jsonFilePath):
    with open(jsonFilePath, "r") as file:
        print("Loading json file...")
        data = json.load(file)
    examples = data.get("examples", [])
    print("Creating question-answer pairs...")
    question_answer_pairs = [
        (ex.get("query"), ex.get("reference_answer")) for ex in examples
    ]
    return question_answer_pairs

examples = ingest_json_file(os.path.join(dataset_path, "rag_dataset.json"))

    # creating dataset with langfuse
    try:
        langfuse = Langfuse()
        langfuse.create_dataset(name=args.dataset)
        for q, a in tqdm(examples):
            langfuse.create_dataset_item(
                dataset_name=args.dataset, input=q, expected_output=a
            )
    except Exception as e:
        print(
            f"Failed to upload dataset to Langfuse. Omitting this part. Error: {str(e)}"
        )

Once we realized what happened, we deleted those lines from our script.
Our original dataset used to contain 111 items, but now it contains 579 items because of this.

Three questions:

What way would you recommend of reverting our dataset back to the original state? We want to avoid deleting it and creating it again because that way we lose info on our previous runs. We also don't want to archive it and create a new one with a different name because that will cause issues with dependencies (huggingface, pinecone).
Is it a bug or is it intended that the existing dataset can be filled with additional data? I wish I just got an error when I run the lines of code above, rather than my dataset getting filled up with additional data.
Is it intended for a dataset to be able to contain duplicate lines, or is it a bug?

marcklingen · 2024-05-20T22:11:47Z

marcklingen
May 20, 2024
Maintainer

This is the intended default behavior.

You can change it by passing an id together with each dataset item which will be used to dedupe items within a dataset (items will be upserted on this id).

You can archive the items that you wish to remove by upserting them on their id with status="ARCHIVED'.

4 replies

emailic May 22, 2024
Author

Hi Marc, thank you for your reply.

Could you please elaborate a bit on what you meant by "You can archive the items that you wish to remove by upserting them on their id with status="ARCHIVED'" ?

How are we meant to collect the ids of the 468 duplicate items? Ideally we would like to keep only the original 111 items (as we already have some runs done on those), and archive everything that was added after a certain date.

marcklingen May 22, 2024
Maintainer

You can fetch the whole dataset which will include all items and their ids. You can then upsert via (create_dataaset_item(id=id, status="ARCHIVED") to archive the duplicates that you identified.

emailic May 23, 2024
Author

Hey Marc, thanks a lot for your input.
Hope you don't mind my detailed questions as I'd like to assure I'm doing the right thing in order not to mess up the dataset again before running our scripts.

Namely, it looks like create_dataset_item function doesn't accept status as an argument. Is this something that changed with the new version of Langfuse? (If so let me know because we definitely have a ticket in the backlog for updating Langfuse, and I'll push for prioritizing it).

Thank you for your time and patience Marc.

Take care,
Ema

marcklingen May 23, 2024
Maintainer

Yes, you'd need to upgrade the SDK, have a look here: https://python.reference.langfuse.com/langfuse/client#Langfuse.create_dataset_item

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langfuse

Deleting Duplicate Items in a Dataset #2099

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Langfuse

Deleting Duplicate Items in a Dataset #2099

emailic May 20, 2024

Replies: 1 comment · 4 replies

marcklingen May 20, 2024 Maintainer

emailic May 22, 2024 Author

marcklingen May 22, 2024 Maintainer

emailic May 23, 2024 Author

marcklingen May 23, 2024 Maintainer

emailic
May 20, 2024

Replies: 1 comment 4 replies

marcklingen
May 20, 2024
Maintainer

emailic May 22, 2024
Author

marcklingen May 22, 2024
Maintainer

emailic May 23, 2024
Author

marcklingen May 23, 2024
Maintainer