The two-tower (or twin-tower) embedding model is a model training method for connecting embeddings in two different modalities by placing both modalities in the same vector space. For example, a two-tower model could generate embeddings of both images and text in the same vector space. Personalized recommendation systems often use items and user-histories as the two different modalities. The modalities need to be “grounded”. For example, image and text can be “grounded” by creating training data where a caption matches an image. Two-tower models are able to map embeddings from different modalities into the same space by ensuring both modalities have the same dimension “d”. For example, if the item embedding is of length 100, then the query embedding dimension should be 100.
The two-tower model for personalized recommendations combines two items and “user history and context”. That is, given user history and context, can we generate hundreds of candidate items from a corpus of millions or billions of items? Given, those hundreds of candidates, can personalize the ranking of the candidates to the user’s history and context (e.g., items that are trending)?
We can create training data for our personalized recommender system that combines the two modalities by presenting items in response to the user query. The user may click on some item, purchase another item, place another item in a shopping cart, and not click on another item. We collect training samples as the combination of the item, a score for the user's action (0=not clicked, 5=purchased, 1=clicked), and the user’s query, user history, and context.
Most two-tower architectures are used for personalized search/recommendations, where you have queries and items as the two modalities. In this case, the two-tower embedding model architecture is a deep learning model architecture that consists of a query tower and an item tower. The query tower encodes search query and user profile to query embeddings, and the item tower encodes the item, store, and location features to item embeddings.
The probability of a user query resulting in an item being clicked or placed in a shopping cart is computed using a distance measure (such as the dot product, cosine similarity, Euclidean distance, or Hadamard product) between the embeddings from two towers. The query and item tower models are trained jointly on the history of user queries and item interactions.
You can use the Hopsworks platform to manage the collection and usage of feature data when building two-tower models. Hopsworks includes a feature store, model registry, and vector database, providing both the online services needed to collect and manage training data, and the online infrastructure for candidate retrieval (VectorDB) and personalized ranking (feature store).
Bytedance used test/images with ALBERT and Vision transformer twin-tower model architecture. Another example is the TextGNN Architecture that uses the two tower structure for decoupled generation of query/keyword embeddings
A two-tower embedding model connects vectors in only two different modalities. Research is ongoing in generalizing to 3 modalities or more. Multi-modal models trained as N-tower embedding models could potentially connect vectors from N different modalities.