Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New alchemy forms - clip image feature extraction, clip text encode #356

Open
tazlin opened this issue Jan 17, 2024 · 2 comments
Open

New alchemy forms - clip image feature extraction, clip text encode #356

tazlin opened this issue Jan 17, 2024 · 2 comments
Labels
alchemy Improvements for the alchemy pipelines enhancement New feature or request

Comments

@tazlin
Copy link
Member

tazlin commented Jan 17, 2024

There are use cases for being able to do client-side manipulation of the various intermediate results of the clip interrogation process.

To compare an image to text via CLIP, the following happens:

  1. The text is encoded into features. open_clip uses clip_model.encode_text(text_tokens). This returns a tensor.
  2. The image "features" are extracted by using the CLIP model. open_clip uses clip_model.encode_image(...). This returns a tensor.
  3. The tensors are normalized.
  4. The image features and the text features are compared.
  5. A similarity score is assigned and returned.

This feature request would allow steps 1 + 2 to be returned independently, optionally as part of a regular interrogate request, or separately on their own without the need to load a CLIP model locally - they could perform the math pertinent to their use case in slow/limited RAM environments. Certain types of image-searching/database schemes could benefit from this.

I propose the following forms be added:

  1. encode_text

    • Accepts a list of strings and the value of a supported CLIP model.
    • For each string returns a .safetensors file containing the encoded text tensor and which model was used to encode it.
  2. encode_image

    • Accepts a source_image and the value of a supported CLIP model.
    • Returns a .safetensors file containing the encoded image features and which model was used to encode it.

This proposal has the obvious wrinkle of needing to support the upload of .safetensors files. The size of these files is on the order of magnitude of single-digit kilobytes.

Related to Haidra-Org/horde-worker-reGen#9.

@tazlin tazlin added alchemy Improvements for the alchemy pipelines enhancement New feature or request labels Jan 17, 2024
@rbrtcs1
Copy link

rbrtcs1 commented Jan 17, 2024

A useful feature might be to opt into including the resulting image embeddings with an image generation request.

I.e. in the /generate/status/ endpoint, each generation result would include an r2 url containing that image’s calculated embedding safetensor file.

That being said, it’s easily avoidable by just doing the alchemy request separately, and I imagine this request would be more difficult to set up.

@db0
Copy link
Member

db0 commented Jan 17, 2024

I think we might avoid using R2 here and just b64 the safetensors in the DB. couple-kb data per file shouldn't be a terrible amount and if bandwidth starts being choked due to these I can always switch to R2 later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alchemy Improvements for the alchemy pipelines enhancement New feature or request
Projects
Development

No branches or pull requests

3 participants