This blog guides you through the practical process of creating embeddings and storing them efficiently in Hopsworks Feature Store, unlocking their significance and diverse applications in data-driven decision-making.
In the rapidly expanding landscape of data driven decision making, embeddings have emerged as one of the formidable discoveries. With their ability to represent data in a higher dimensional vector space, they capture semantic relationships which in turn can be used in various applications of machine learning.
We begin this blog by discussing the profound implications of embeddings in machine learning, their diverse applications, and their crucial role in reshaping the way we interact with data. Once the embeddings are created the next step is to store them in a vector database for efficient retrieval. So we discuss what a vector database is and compare it with traditional ones. Lastly, we will delve into the process of using LangChain for data loading and chunking, harnessing the power of OpenAI models for creating embeddings and seamlessly storing them in Hopsworks Feature Store.
The machine learning models which we use in our everyday life operate on numerical data, as they seldom understand text. What this basically tells us is, for performing a NLP task we need to first convert the textual information to a vector of numbers which can be given to the model as an input. These numerical vectors capture the semantic and contextual information in text, enabling the models to learn meaningful relationships and patterns. In addition to using embeddings for text, it can also be used to encode other unstructured data such as images, audio, categorical variables and numerical values as well.
Before the advent of the latest embedding models, methods such as One-Hot encoding, Bag of Words(BoW), Term Frequency and Inverse Document Frequency (TF-IDF) and Latent Semantic Analysis(LSA) were used to convert text into numerical embeddings. Researchers also came up with custom embeddings for domain specific data but it turned out to be too complex to create. These earlier methods had limitations in capturing the nuanced semantics and contextual information present in natural language.
Then came the era of neural network based embeddings where techniques such as Word2Vec, GloVe, FastText, BERT and ELMo gained prominence. These embeddings played a significant role in advancing NLP tasks by capturing semantic information and context, enabling models to understand and generate human-like text.
Embeddings in machine learning have applications in a wide range of machine learning tasks. In this section, we explore some of the applications where embeddings play a pivotal role.
A vector database is a specialized type of database designed to efficiently store, manage, and query high-dimensional vector data, such as embeddings, feature vectors, and other numerical representations. Unlike traditional relational databases that primarily handle structured data, vector databases focus on unstructured or semi-structured data represented as vectors in multi-dimensional spaces.
The querying of vector databases is very different when compared with traditional databases. In traditional databases the query is exactly matched with the values. On the other hand, in vector databases a similarity metric such as Cosine Similarity is applied to find the vector which is most similar to the query. A vector database utilizes multiple algorithms to perform the Approximate Nearest Neighbour(ANN) Search such as Random Projection, Product Quantization and Locality Sensitive Hashing.
Another aspect which differentiates the vector database from a traditional database is the concept of indexes. In a vector database, an "index" refers to a data structure that is used to optimize the retrieval of high-dimensional vectors or embeddings.
High-dimensional vector data, such as embeddings or feature vectors, can be challenging to search through efficiently without indexing. The purpose of indexing is to reduce the search space and accelerate the process of finding the nearest neighbors or matching vectors to a query vector. An index structure in a vector database typically stores a subset of the dataset's vectors and organizes them in a way that enables fast similarity search.
Now we will explore the steps involved in generating embeddings in machine learning and saving them as features within the Hopsworks Feature Store. This procedure encompasses document retrieval, chunking, embedding generation, and subsequent storage within the Feature Store. For document retrieval and chunking, we rely on the langchain library, while the OpenAI Embedding model is utilized to generate embeddings for the chunks created.
Langchain:
LangChain is a framework designed to simplify the creation of applications using large language models. Chatbots, Question Answering systems and Summarization tools are some of the use cases from langchain.
OpenAI Embeddings:
OpenAI’s text embedding model “text-embedding-ada-002” outperforms all the old embedding models on text search, code search, and sentence similarity tasks and gets comparable performance on text classification.
Hopsworks:
Hopsworks includes OpenSearch as a multi-tenant service in projects. OpenSearch provides vector database capabilities through its k-NN plugin that supports the FAISS and nsmlib embedding indexes. Through Hopsworks, OpenSearch also provides enterprise capabilities, including authentication and access control to indexes (an index can be private to a Hopsworks project), filtering, scalability, high availability, and disaster recovery support.
The dataset used for this blog is the wikipedia information on elements Hydrogen, Helium and Lithium. These files are stored in text format in the Elements directory. You can access the code repository here, where you'll find the hands-on example to apply the concepts discussed in this blog. Dive in and start turning theory into practice!
If we want our application to answer certain questions, or perform recommendations from custom data or data which it hasn’t been trained on we need to connect the external datasource to the LLM. So the first step is to load data from external sources which may be of different formats to a standard one.
Document loaders load data from the source. Source can be a single document or a folder with multiple documents. Langchain’s document loader loads data in multiple formats such as csv, json, pdf and text files. In our case we use Directory Loader to load the directory called “Elements” which has our data(text files).
Once the document is loaded, we move on to Document Splitting(Chunking). Chunking is required before embedding primarily due to size limitations and the need to preserve contextual information. Language models often have token limits, so breaking the text into smaller chunks ensures it fits within these limits. It is imperative that we split the documents into semantically relevant chunks to perform downstream tasks.
The input text is split based on a defined chunk size with some defined chunk overlap. Chunk Size is a length function to measure the size of the chunk and the chunk overlap ensures continuity between the chunks. The value of these parameters are determined through experiments and mainly depend on the data and task in hand.
The recommended TextSplitter is the RecursiveCharacterTextSplitter. This will split documents recursively by different characters - starting with "\n\n", then "\n", then " ". This is nice because it will try to keep all the semantically relevant content in the same place for as long as possible.
So far we have loaded the documents and converted it into meaningful chunks, now we have to create embeddings out of them, so that our Machine learning models can make sense of it. There are multiple embedding models provided by OpenAI, Cohere and HuggingFace etc. For our use case, we have chosen OpenAI’s text-embedding-ada-002 model. This is one of the best models out there for embedding.
Once we have the embeddings we need to store it in a vector database to assist faster retrieval and efficient querying. We use Hopsworks as the feature store for storing the embeddings for machine learning. It also provides support to find k-nearest neighbors for a query point by OpenSearch knn plugin.
We store the embeddings along with the chunk in the newly created feature group.
In the next phase of your journey, you can dive deeper into the world of recommendation systems. With Hopsworks, you can build Personalized Search Retrieval and Ranking systems. This repository contains notebooks for exploring the creation and retrieval of Embedding Features for a recommendation system use case.
In this blog we have discussed the essential steps in the creation of embeddings for machine learning as features for the feature store using tools like Langchain and OpenAI. We also discussed the significance of embeddings in machine learning, their applications and the need for vector databases to store these embeddings. By delving deep into how vector databases differ from traditional ones we are able to appreciate vector databases even more. The role of embeddings in capturing semantic content has been phenomenal and their integration with Feature Stores will open new avenues for data driven decision making.
In this article, we outline how we leveraged ArrowFlight with DuckDB to build a new service that massively improves the performance of Python clients reading from lakehouse data in the Feature Store