Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input folder deduplication #3465

Open
oO0 opened this issue May 13, 2024 · 2 comments
Open

input folder deduplication #3465

oO0 opened this issue May 13, 2024 · 2 comments

Comments

@oO0
Copy link

oO0 commented May 13, 2024

Sometimes faster to drop file again. So better to search if file with same hash is already in folder and do not copy it again.

If drop file from input folder - it copy it again. This is undesired at all.

@shawnington
Copy link
Contributor

You want us to try and do de-duplication via md5 checksums to prevent duplicates being loaded, and then load the duplicate instead if you try and load a duplicate?

@oO0
Copy link
Author

oO0 commented May 13, 2024

yes. Because searching files in external viewer is faster, then in web UI

shawnington added a commit to shawnington/ComfyUI that referenced this issue May 13, 2024
… if overwrite not specified

This is a fix to comfyanonymous#3465 

Adds function compare_image_hash to do a sha256 hash comparison between an uploaded image and existing images with matching file names. 

This changes the behavior so that only images having the same filename that are actually different are saved to input, existing images are instead now opened instead of resaved with increment. 

Currently, exact duplicates with the same filename are resave saved with an incremented filename in the format:

<filename> (n).ext 

with the code: 

```
while os.path.exists(filepath): 
                        filename = f"{split[0]} ({i}){split[1]}"
                        filepath = os.path.join(full_output_folder, filename)
                        i += 1
```

This commit changes this to: 

```
while os.path.exists(filepath): 
                        if compare_image_hash(filepath, image):
                            image_is_duplicate = True
                            break
                        filename = f"{split[0]} ({i}){split[1]}"
                        filepath = os.path.join(full_output_folder, filename)
                        i += 1
```

a check for if image_is_duplicate = False is done before saving the file. 

Currently, if you load the same image of a cat named cat.jpg into the LoadImage node 3 times, you will get 3 new files in your input folder with incremented file names.

With this change, you will now only have the single copy of cat.jpg, that will be re-opened instead of re-saved. 

However if you load 3 different images of cats named cat.jpg, you will get the expected behavior of having:
cat.jpg
cat (1).jpg
cat (2).jpg

This saves space and clutter. After checking my own input folder, I have 800+ images that are duplicates that were resaved with incremented file names amounting to more than 5GB of duplicated data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants