New alchemy forms - clip image feature extraction, clip text encode #356

tazlin · 2024-01-17T01:00:45Z

There are use cases for being able to do client-side manipulation of the various intermediate results of the clip interrogation process.

To compare an image to text via CLIP, the following happens:

The text is encoded into features. open_clip uses clip_model.encode_text(text_tokens). This returns a tensor.
The image "features" are extracted by using the CLIP model. open_clip uses clip_model.encode_image(...). This returns a tensor.
The tensors are normalized.
The image features and the text features are compared.
A similarity score is assigned and returned.

This feature request would allow steps 1 + 2 to be returned independently, optionally as part of a regular interrogate request, or separately on their own without the need to load a CLIP model locally - they could perform the math pertinent to their use case in slow/limited RAM environments. Certain types of image-searching/database schemes could benefit from this.

I propose the following forms be added:

encode_text
- Accepts a list of strings and the value of a supported CLIP model.
- For each string returns a .safetensors file containing the encoded text tensor and which model was used to encode it.
encode_image
- Accepts a source_image and the value of a supported CLIP model.
- Returns a .safetensors file containing the encoded image features and which model was used to encode it.

This proposal has the obvious wrinkle of needing to support the upload of .safetensors files. The size of these files is on the order of magnitude of single-digit kilobytes.

Related to Haidra-Org/horde-worker-reGen#9.

The text was updated successfully, but these errors were encountered:

rbrtcs1 · 2024-01-17T01:17:11Z

A useful feature might be to opt into including the resulting image embeddings with an image generation request.

I.e. in the /generate/status/ endpoint, each generation result would include an r2 url containing that image’s calculated embedding safetensor file.

That being said, it’s easily avoidable by just doing the alchemy request separately, and I imagine this request would be more difficult to set up.

db0 · 2024-01-17T08:30:39Z

I think we might avoid using R2 here and just b64 the safetensors in the DB. couple-kb data per file shouldn't be a terrible amount and if bandwidth starts being choked due to these I can always switch to R2 later.

tazlin added alchemy Improvements for the alchemy pipelines enhancement New feature or request labels Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New alchemy forms - clip image feature extraction, clip text encode #356

New alchemy forms - clip image feature extraction, clip text encode #356

tazlin commented Jan 17, 2024

rbrtcs1 commented Jan 17, 2024 •

edited

db0 commented Jan 17, 2024

New alchemy forms - clip image feature extraction, clip text encode #356

New alchemy forms - clip image feature extraction, clip text encode #356

Comments

tazlin commented Jan 17, 2024

rbrtcs1 commented Jan 17, 2024 • edited

db0 commented Jan 17, 2024

rbrtcs1 commented Jan 17, 2024 •

edited