A mapping from a high-dimensional space (such as a one-hot encoded vector) to a lower-dimensional space (such as a dense vector with a fixed number of dimensions). Embeddings are commonly used to represent categorical or discrete variables (such as words, users, or products) as continuous vectors that can be used as input to a neural network.
By default, OpenAI’s text-embedding-ada-002 is used a default embedding vector model <see pricing table below>. Only $0.0004 / 1K token is very cheap at first glance. However, in reality, it can quickly become expensive.
An example to illustrate: suppose you want to build a chatbot to chat with your corporate’s doc and you have 10,000,000 files (pretty average in legal documents or patient records), with an average text length of 20000 tokens. In this scenario, you would end up spending: (10,000,000 x 20,000 x 0.0004) / 1000 = $80,000 solely on embeddings.
While OpenAI’s embedding model is widely known, it’s essential to recognize that there are alternative options available. Hugging Face, a renowned platform in the NLP community hosts the Massive Text Embedding Benchmark (MTEB) Leaderboard. This leaderboard serves as a valuable resource for evaluating the performance of various text embedding models across diverse embedding tasks.
« Back to Glossary Index