What are text embeddings?

Text embeddings are the reason why LLMs like ChatGPT (GPT-4) are able to contextualize information quickly and effectively. They are highly beneficial for incorporating external data and play a pivotal role in enhancing the model’s performance and capabilities.

Running open source LLMs provides full control over data and costs, the same goes for making use of open source text embeddings engines.

Some even claim Open Source can be better and cheaper (article below).


Why Consider Open Source Embeddings?

Open-source text embeddings offer several advantages over commercial solutions:

  1. Cost-Effectiveness: Open-source models are generally free to use, reducing operational costs.
  2. Customization: They can be fine-tuned to meet specific project requirements.
  3. Community Support: A large community of developers often supports open-source models, providing a wealth of resources and updates.
  4. Transparency: Open-source models offer more transparency, allowing you to understand and modify the model’s internals.

Importance of Text Embedding in LLMs

  • Text embedding models are crucial for LLMs in various applications like chatbots.
  • They convert text into vector representations that capture the text’s meaning, useful for tasks like classification, clustering, and information retrieval.

Evaluating Text Embedding Models

  • Different embedding models yield different results, affecting the performance of AI applications.
  • Generic benchmark systems like BeIR and MTEB provide a common ground for evaluating embedding models.
  • Hugging Face offers a MTEB leading board
  • OpenAI’s text-embedding-ada-002 ranks 7th overall in MTEB benchmarks, performing best in clustering but not impressing in other tasks.

Hugging Face’s Text Embeddings Inference (TEI)

Introducing E5 Model by Microsoft

  • E5 (Embeddings from Bidirectional Encoder Representations) is a new text embedding model by Microsoft.
  • It surpasses the BM25 baseline on the BEIR retrieval benchmark in a zero-shot setting.
  • E5 is trained on a large corpus of text and code, capturing more nuanced semantic relationships.
  • It’s a small model, easy to host even on local machines, and performs faster than OpenAI’s model in certain benchmarks.

Cost and Customization

  • OpenAI’s model is not fine-tunable, limiting customization.
  • E5 offers better control and can be fine-tuned to specific project needs.
  • Hosting E5 is cheaper than using OpenAI’s API. For example, processing 100 million tokens would cost $2.47 for E5 compared to $10 for OpenAI.

Conclusions

  • Choosing the right embedding model is crucial for the success of LLM applications.
  • Open-source solutions like TEI or E5 can outperform commercial models in both speed and cost.
  • Benchmark systems are useful but should be used as a guide, not an absolute measure.
  • E5’s versatility and performance make it a strong contender for various NLP tasks, offering both cost-effectiveness and the ability for fine-tuning.

Reference article: