Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meilisearch across the Semantic Verse #3838

Open
Kerollmops opened this issue Jun 15, 2023 · 5 comments
Open

Meilisearch across the Semantic Verse #3838

Kerollmops opened this issue Jun 15, 2023 · 5 comments

Comments

@Kerollmops
Copy link
Member

Kerollmops commented Jun 15, 2023

Spider-Man-Across-the-Spider-Verse-Spider-People-scaled

Meilisearch is currently exploring the semantic search universe. A promising epoch of search is unfolding right before us. The semantic search unlocks new paradigms, and understanding the documents and user query is a deal breaker. When it seems easy to aggregate vectors in a store and retrieve the nearest neighbors on a query, much more can be done. This is only the first step of this wonderful journey that awaits us.

Where are we?

We just released the v1.3-rc.0 of Meilisearch with the vector store experimental feature. It can do what most of the other vector stores on the market do: storing documents associated with a vector and returning the nearest ones based on another vector. We did that in a week or so. We will polish and release it in the final v1.3 version as an experimental feature. We would like to provide this experimental feature on the cloud and make it easier to compute your vectors by plugging Meilisearch into OpenAi and Hugging Face.

Exposing vector database capabilities unlocks a lot of potential use cases. People will be able to use Meilisearch to create conversational chatbots. We did a fun experiment: feeding Meilisearch our private Notion pages and asking the bot about internal processes using LangChain. The results were stunning!

There are a lot of other interesting use cases, specifically around raw search. As meilisearch can return the nearest documents based on the query's vector, one can mix the similarity score with the soon-to-be-released ranking score of the keyword search ranking rules. It is one of the paths that we can take to walk toward hybrid search. Not necessarily the best, but it works for many simple use cases. We did some experiments with an e-commerce dataset, and the results have been pretty good so far!

Where do we want to go?

There are plenty of roads we can take. Meilisearch is an excellent keyword search engine, one of the best on the market. We created tools to measure our engine's relevancy: precision and recall. This ensures we continue to improve our results and avoid regressions throughout our releases. We plan to release a blog post explaining everything and why we have better relevancy than the competition when using keyword-based queries, e.g., blue shirt, delonghi coffee machine.

However, we must and plan to extend this TREC-covid test suite with question-based queries, not only keyword ones. This is where Meilisearch lacks semantic understanding, e.g., What's the capital of France?. By mixing semantic search with keyword search, we can ensure that questioning the engine works and does not make it try to search for useless words, e.g., What, the. Mixing the different scores doesn't work well with the ranking rules. Those rules let the user define where the relevancy and correctness of sorting the results belong. When you want to ascend sort products by price, you want the more relevant ones first, even if they are more expensive.

Furthermore, semantic search doesn't support typos well or prefix search, also known as search-as-you-type queries. When an e-commerce user is typing his query in the search bar, the last word is partially written. The keyword-based Meilisearch version handle that very well, but not the semantic version. In our relevancy benchmark, a partial query like machine del gives bad results with the semantic search version. It interprets del as the computer brand, not the prefix of coffee machine delonghi. On the other hand, the keyword version finds occurrences of machine near DeLonghi, which shows in the relevancy scores.

Meilisearch's Where to Watch demo

We will explore different ways to mix them. We can run the semantic search when the query is lengthy and the keyword-based search on short queries. We could also increase the recall with query rewriting by fetching the nearest document, doing keyword extraction, and using those keywords as an alternative query. Extracting the most important terms of the document to fetch more, not necessarily better, but more documents on the same subject. In the same vein, we would like to explore automatic synonyms.

Nonetheless, we can mix the Ranking Rules system with semantic search. We will experiment with exposing a new semanticBoosting ranking rule that will decide if a document semantically matches the query and move it up or down, depending on the score. This ranking rule increases the precision by increasing the number of interesting documents on the first page and moving the others down.

Another interesting way to use semantic understanding is to refine filters. By understanding negation, we could move unrelated documents down by relying on the semanticBoosting ranking rule, by adding new filters, or at least be able to propose them to the user via a nice UI. Unfortunately, we had bad results with the OpenAI ada-002 text embedding model when negatively searching an e-commerce dataset, e.g., non-ASUS computer.

Technical Terms

In the following examples, I want you to imagine, very deeply, that you have an e-commerce dataset with 100 DeLonghi machines and 200 Nespresso coffee machines.

The recall is the number of interesting documents the engine found in the whole dataset based on the query. If you search for coffee with the keyword search API of Meilisearch, you will likely find 200 Nespresso coffee machines and not a single DeLonghi machine. The reason is that the coffee keyword is present on the first documents. However, if you use the semantic search API of Meilisearch, you will probably get all of the 300 coffee machines and probably even more stuff. In this example, semantic search highly increases the recall.

The precision is the number of interesting documents the engine can move up on the first pages. If you search for delonghi this time with the keyword search API of Meilisearch, you'll find exactly what you want, no more than the 100 DeLonghi machines. However, if you use the semantic search API of Meilisearch, you will get the 300 coffee machines in your dataset. There is a high chance that the engine will mix the Nespresso and the DeLonghi ones, the distance where it finds those documents will be quite useful but will hardly be as good as a keyword-based search here.

To be continued...

[could have been written with the help of an LLM]

@Phrrancis
Copy link

really enjoyed reading your raw thoughts on semantic search, looking beyond the hype to express the challenges and opportunities in the universe of search!
Godspeed to your team!

@happysalada
Copy link
Contributor

Hey interesting article!

This hybrid search has been around for a bit, im using meilisearch in combination with quadrant for that purpose.

The main reason im using it is that searching through noisy data with semantic search is really hard. Lets take chat data for example. You ask "can you tell me about company x". The semantic search will return all the chats where people ask that question. You however want to have the answer. I havent found the answer for search in chat data personally.

The other big factor is embeddings. Depending on which embeddings you use, you get different results. This might not be too much of a problem for english, but for other languages where embedding training data is still lacking, this makes quite a difference.

Looking forward to this though, i couldnt agree more that this is the future !

@Kerollmops
Copy link
Member Author

Kerollmops commented Jun 24, 2023

Hey @happysalada!

Thank you for your comment. Very interesting to see your point.

You ask "can you tell me about company x". The semantic search will return all the chats where people ask that question. You however want to have the answer.

What you probably need here is more comprehension than retrieval. You want to setup something like langchain with the new Meilisearch Vector Store experimental API. Once you connect it to ChatGPT, for example, it will be able to understand the content of your dataset and answer questions.

However, your store will try to return the most interesting document from the query and therefore return the message asking the question. You should probably aggregate the whole conversation in one single document (and compute multiple _vectors, one by message, to associate them to this big document), we support that in this PR ☝️

The other big factor is embeddings. Depending on which embeddings you use, you get different results. This might not be too much of a problem for english, but for other languages where embedding training data is still lacking, this makes quite a difference.

Fortunately, Meilisearch accept any kind of vector. You just have to send them along with the documents. But if you don't want to compute them on your side we plan to call the APIs for you. There should be embedding computation for other languages than english somewhere?!

Let's talk more technically in the product discussion if you want. Meilisearch is moving forward toward this area and we hope to drastically improve the quality of the results 🚀

@KShivendu
Copy link

KShivendu commented Jun 25, 2023

languages where embedding training data is still lacking, this makes quite a difference.

Try using Cohere multilingual embeddings. https://txt.cohere.com/multilingual/

You can also use https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings if you want a dataset to benchmark things :)

You ask "can you tell me about company x"

Try using langchain as @Kerollmops suggested or maybe try OpenAI functions for the same if you know the type of questions that will be asked (self plug: I built this a while back with a friend https://github.com/NirantK/agentai)

@curquiza
Copy link
Member

Hello!
v1.6.0 has just been released today, including hybrid search and simplifies the process of generating embeddings for semantic search 🎉

More information in the changelogs and in our documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants