Back to the Index

Embedding

What is an embedding in ML?

An embedding is a compressed representation of data such as text or images as continuous vectors in a lower-dimensional space. Embeddings should capture the semantic relationships and similarities between the original uncompressed objects while reducing the dimensionality of the data.

The above figure shows a Word Embedding with Dimensionality of Length 3. That is, each word is encoded as a vector of length 3 using a Word Embedding model. Notice that “ML” and “Learning” are close in embedding space, but “Sports” is far away. “Feature” is close to “Learning” and “ML”. In this example, if you searched for the closest word to “ML”, you would find the word “Learning”.

‍

What type of data can be compressed into an embedding?

The most common embedding models are for text data, image data, audio data, and many types of structured data (such as DNA data, your purchase history at an ecommerce store, and any historical sequence of events that have identifiable patterns).

‍

Similarity search for embeddings in Vector Databases

A vector database can be used to store embeddings and provide support for searching for similar embeddings.

How are embeddings created?

Embeddings are created by training a deep learning model to generate continuous vector representations of complex objects, such as words or images. For example, Word2Vec is an unsupervised learning method that uses a shallow neural network to learn word embeddings.

How are embeddings related to latent space?

Embeddings are similar to latent space in that they both are a compressed representation of higher dimensional data that maintains the relationships in the original data. Embeddings are essentially points or coordinates in the latent space.

Interested for more?

🤖 Register for free on Hopsworks Serverless
🐍 Learn all about the Python-Centric Feature Store
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.

E

Auto-regressive Models

E

Backfill features

Backfill training data

Backpressure for feature stores

Batch Inference Pipeline

E

CI/CD for MLOps

Compound AI Systems

Context Window for LLMs

E

DAG Processing Model

Data Compatibility

Data Partitioning

Data Transformation

Data Type (for features)

Data Validation (for features)

Data-Centric ML

Dimensional Modeling and Feature Stores

E

Encoding (for Features)

E

Gradient Accumulation

Grouped Query Attention

E

Hallucinations in LLMs

Hyperparameter Tuning

E

Idempotent Machine Learning Pipelines

In Context Learning (ICL)

Inference Pipeline

Instruction Datasets for Fine-Tuning LLMs

E

LLM Code Interpreter

LLM Temperature

LLMs - Large Language Models

Lagged features

E

Natural Language Processing (NLP)

E

On-Demand Features

On-Demand Transformation

Online Inference Pipeline

Online-Offline Feature Skew

Online-Offline Feature Store Consistency

E

Parameter-Efficient Fine-Tuning (PEFT) of LLMs

Point-in-Time Correct Joins

Precomputed Features

Prompt Engineering

E

RLHF - Reinforcement Learning from Human Feedback

Real-Time Machine Learning

Recommender System

Representation Learning

Retrieval Augmented Generation (RAG) for LLMs

E

SQL UDF in Python

Similarity Search

Splitting Training Data

Streaming Feature Pipeline

Streaming Inference Pipeline

E

Theory-of-Mind Tasks

Time travel (for features)

Train (Training) Set

Training Pipeline

Training-Inference Skew

Two-Tower Embedding Model

Types of Machine Learning

E

E

Vector Database

Versioning (of ML Artifacts)