Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yoloworld use images as classes #12793

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

hoeflechner
Copy link

@hoeflechner hoeflechner commented May 18, 2024

Use Images as classes in YoloWorld

Extension to the model.set_classes() method so it optionally accepts images:

from ultralytics import YOLOWorld
model = YOLOWorld('yolov8x-world.pt')

# provide a generic image of a tire
model.set_classes(["bus"],images=["tire.jpg"])
results = model.predict('ultralytics/assets/bus.jpg',conf=0.7)

image

The clip model provides an encoder for text as well as for images. This encoder is used to search for occurences of one image in another image.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced the set_classes method in the YOLO model to accept images for class specification.

📊 Key Changes

  • Method signature changed to allow passing both class labels and images to specify what the model should recognize.
  • Integration of PIL (Python Imaging Library) for image processing.
  • Ability to use both text labels and images to define classes, enriching the model’s understanding and recognition capability.
  • Internal handling in set_classes for converting image paths to PIL images and generating image features using the CLIP model.

🎯 Purpose & Impact

  • Flexibility: Users can now specify classes not only through text but also by providing example images. This enhances the model's adaptability to various contexts and increases ease of use for non-expert users. 🔄
  • Enhanced Recognition: By using images as part of class specification, the model can potentially improve its accuracy for those specific classes. This can be particularly beneficial in scenarios where certain objects or subjects are best defined visually. 🎯
  • Inclusivity in Input: This update makes the model more versatile by accepting input in various formats (text and image), catering to a broader range of user needs and use cases. 🌐

This update is a step toward making AI models more interactive and user-friendly while potentially improving performance through richer input methods.

Copy link

github-actions bot commented May 18, 2024

All Contributors have signed the CLA. ✅
Posted by the CLA Assistant Lite bot.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @hoeflechner, thank you for submitting an Ultralytics YOLOv8 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with ultralytics/ultralytics main branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge main locally.
  • ✅ Verify all YOLOv8 Continuous Integration (CI) checks are passing.
  • ✅ Update YOLOv8 Docs for any new or updated features.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

See our Contributing Guide for details and let us know if you have any questions!

Copy link

codecov bot commented May 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.03%. Comparing base (11a2ed1) to head (e607968).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12793      +/-   ##
==========================================
- Coverage   70.03%   68.03%   -2.01%     
==========================================
  Files         124      124              
  Lines       15723    15721       -2     
==========================================
- Hits        11012    10695     -317     
- Misses       4711     5026     +315     
Flag Coverage Δ
Benchmarks 35.20% <8.33%> (-0.05%) ⬇️
GPU ?
Tests 66.25% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hoeflechner
Copy link
Author

I have read the CLA Document and I sign the CLA

@Burhan-Q Burhan-Q added the enhancement New feature or request label May 21, 2024
@Burhan-Q Burhan-Q requested a review from Laughing-q May 21, 2024 22:26
Copy link
Member

@Burhan-Q Burhan-Q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a very cool idea! I personally can't give you a full review of the proposed changes, but I've added a few notes to consider. Hopefully @Laughing-q will have a chance to take a look at these changes soon 🚀

ultralytics/models/yolo/model.py Outdated Show resolved Hide resolved
ultralytics/models/yolo/model.py Outdated Show resolved Hide resolved
@Laughing-q
Copy link
Member

@hoeflechner Thanks for the PR! This feature seems really awesome!
For the way of loading images, can we just use our internal source loader?

from ultralytics.data import load_inference_source
self.dataset = load_inference_source(
    source=source,
    batch=self.args.batch,
    vid_stride=self.args.vid_stride,
    buffer=self.args.stream_buffer,
)

Then we're able to support all the formats just like our predictor, no matter it's a file or a directory or ndarrays.
pic-240523-1448-00
And the problem of using our internal loader I think it that the output image would be ndarray with BGR channel order from opencv, but what the clip model expecting is PIL format with RGB order, then I guess we'll need to add a new preprocess here for clip to handle opencv format, with resize and normalization.

As reference, here's the preprocess from clip repo. Noted the CenterCrop is actually not used since it's accepting the same arg n_px as resize operation.

def _transform(n_px):
    return Compose([
        Resize(n_px, interpolation=BICUBIC),
        CenterCrop(n_px),
        _convert_image_to_rgb,
        ToTensor(),
        Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
    ])

@hoeflechner
Copy link
Author

@Laughing-q thanks for your feedback. I will try to implement as proposed

@hoeflechner
Copy link
Author

i switched to load_inference_source() but for the convertion to PIL i simply used Image.fromarray() not sure what the advantages and disadvantages are.

@Burhan-Q
Copy link
Member

i switched to load_inference_source() but for the convertion to PIL i simply used Image.fromarray() not sure what the advantages and disadvantages are.

The load_inference_source() function has the capability to load numerous data types

# Dataloader
if tensor:
dataset = LoadTensor(source)
elif in_memory:
dataset = source
elif stream:
dataset = LoadStreams(source, vid_stride=vid_stride, buffer=buffer)
elif screenshot:
dataset = LoadScreenshots(source)
elif from_img:
dataset = LoadPilAndNumpy(source)
else:
dataset = LoadImagesAndVideos(source, batch=batch, vid_stride=vid_stride)
so you can remove from PIL import Image and all related code using Image (conversion to RGB is carried out by the LoadPilAndNumpy class).

@hoeflechner
Copy link
Author

@Burhan-Q I used this approach to load the images but as @Laughing-q pointed out it returns a ndarray. clip requires a pil image. So I used PILs function to convert it.

@Burhan-Q
Copy link
Member

@hoeflechner I'm not familiar with the inputs for CLIP but I see now the comment from @Laughing-q above with regard to needing some sort of preprocessing. I'll let him weight in from here then. Thank you!

@hoeflechner
Copy link
Author

@Burhan-Q Clip returns a function that transforms a pil image into its own torch tensor. @Laughing-q was suggesting to write a function that does thesame with a ndarray. For me this probably more work than the hole patch I proposed as I have very little understanding about the internal mechanisms of clip and ultralytics... I also think it could lead to problems if clip changes it's internal tensor format.

@Laughing-q Laughing-q added the TODO Items that needs completing label May 27, 2024
@Laughing-q Laughing-q self-assigned this May 27, 2024
@glenn-jocher glenn-jocher removed the TODO Items that needs completing label Jun 1, 2024
@Laughing-q
Copy link
Member

Laughing-q commented Jun 6, 2024

@hoeflechner @Burhan-Q Guys I polished this PR a little bit and currently it supports all the source formats that we support for predictor(except the torch.Tensor type though), and I found we actually have a internal classify_transforms we can reuse for clip preprocessing. Everything looks good to me now!

@glenn-jocher This PR added support of images as input for YOLOWorld.set_classes, which we can now use images to set categories for our yoloworld model, which I feel it's a cool feature.
The PR is ready from my side, if you also find it interesting please take a look when you have time. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants