No items found.
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

The big dictionary of MLOps 

Comprehensive Terminology Guide for Building and Managing ML Solutions.

This dictionary/glossary covers terms from MLOps, data engineering, and feature stores, but does not cover terms from the broader ML (Machine Learning) algorithms and frameworks space. MLOps is the roadmap you follow to go from training models in notebooks to building production ML systems. MLOps is a set of principles and practices that encompass the entire ML System lifecycle, from ideation to data management, feature creation, model training, inference, observability, and operations. MLOps is based on three principles: observability, automated testing, and versioning of ML artifacts.

Observability
for ML systems refers to the ability to gain insights into the behavior and performance of production machine learning models. Automated testing will enable you to build ML systems with confidence that tests will catch any potential bugs in your data or code. Versioning will enable you to safely operate ML systems by supporting upgrades and rollback without affecting system operations. MLOps should help tighten your ML development iteration loop by enabling you to roll out fixes and improvements to ML systems faster. Finally, the Feature Store is often called the data layer for MLOps. It acts as a data platform that enables ML pipelines to be decomposed into smaller more manageable pipelines for feature engineering, model training, and model inference.

A

AutoML stands for Automated Machine Learning and it describes the process of automating various tasks in model training pipelines.

B

Backfilling is the process of recomputing datasets from raw, historical data.
Backfilling training data from a feature store means creating a point-in-time consistent snapshot of feature data that will be used to train one or more models.
The backpressure pattern consists of a feedback mechanism that allows consumers to inform upstream components when they are ready to handle new messages.
A batch inference pipeline is a program that takes as input a batch of data and a model, and outputs predictions that are typically written to some sink.

C

Continuous Integration (CI) is the practice of continuously merging code changes from multiple developers into a shared repository.

D

DAG Processing Model

A DAG (directed acyclic graph) processing model is a method of representing the dependencies between tasks in a workflow or pipeline.

Data Compatibility

Data compatibility often refers to feature consistency, where the schema of features used in feature pipelines, training pipelines, and inference pipelines is compatible.

Data Contract

A data contract provides schema level guarantees for a feature group or feature view and includes metadata, such as how/where a feature may be used.

Data Lakehouse

A Data Lakehouse is a modern data architecture that combines the benefits of both data lakes and data warehouses.

Data Leakage

Data leakage occurs when data that should be outside of the training dataset is explicitly or implicitly used to train a model. It can result in incorrect estimation of a trained model’s performance.

Data Modeling

Data modeling describes how tables in a data warehouse are structured to create a simplified and easy-to-understand layout that enables efficient, ad-hoc querying and analysis of large datasets.

Data Partitioning

When you create a feature group, you can select one or more features (columns) as the partition key, storing data with the same partition key values in the same directory.

Data Pipelines

Data pipelines are orchestrated programs that move data from one system to another while also performing transformations on the data

Data Quality

High data quality for ML refers to data that can be used to train high performance models.

Data Transformation

A data transformation is a function that is applied to some input data that changes the data in such a way that the data is easier to consume by downstream applications or users.

Data Type (for features)

A feature value is a data value. In programming languages, a feature is represented as a primitive data type, such as an int, string, array, or boolean.

Data Validation (for features)

ML model training or inference can crash if there are problems with input data. Incorrect or out-of-distribution data can introduce the problem of skew in the inference or training data.

Data-Centric ML

Data-centric ML describes a set of practices for iteratively improving the quality of and set of available feature data for models.

Dimensional Modeling and Feature Stores

In data warehousing, dimensional modeling is a data modeling technique that identifies entities and then decomposes your data into “facts” and “dimensions” related to those entities.

Downstream

Downstream indicates that the user/client/application is a consumer of the dataset or feature group.

E

ELT

ELT stands for Extract, Load, and Transform of data.

ETL

ETL stands for Extract, Transform, and Load of data

Embedding

An embedding is a compressed representation of data such as text or images as continuous vectors in a lower-dimensional space.

Encoding (for Features)

Feature values can be encoded for data compatibility or to improve model performance.

Entity

In a feature store, an entity is represented as rows in a feature group, where each row corresponds to a single instance of the object or concept.

F

Feature

A feature is a measurable property of some data-sample that is used as input for a ML model for training and serving.

Feature Engineering

Feature engineering is the process of selecting, creating, and transforming raw data into features that can be used as input to machine learning algorithms.

Feature Freshness

Feature freshness refers to the time lag between when the date required to compute a feature becomes available to when the feature is available for use in an inference pipeline.

Feature Function

A feature function is a function that computes one or more feature values from input data.

Feature Groups

A feature group is a logical table of features that provides a single API for updating feature values, two different APIs - an online and an offline API - for reading feature values.

Feature Logic

Feature logic is the series of steps that transform input data into the unencoded data value that represents the feature in the feature store

Feature Monitoring

Feature monitoring involves continuously monitoring the performance of the features used as model inputs in inference pipelines to identify potential problems in input feature values.

Feature Pipeline

A feature pipeline is a program that orchestrates the execution of a dataflow graph of feature functions where the computed features are written to one or more feature groups.

Feature Platform

A feature platform is a feature store that also provides support for a domain-specific language (DSL) to define feature logic and feature pipelines.

Feature Reuse

Features are computed in a feature pipeline and stored in the feature store. Features are reused if the same feature is used in more than one model.

Feature Selection

Feature selection is the process of finding existing features, in potentially different feature groups, and joining them together along with the label(s) to define a set of features.

Feature Service

A feature service is a feature view that is implemented as a network service that provides both an online and offline API for retrieving feature vectors and batches of feature values, respectively.

Feature Store

A feature store is a data platform that provides APIs for ingesting features, an Offline API for reading historical feature data, and an Online API for reading the latest feature data at low latency.

Feature Type

A feature type defines the set of valid encodings (model-dependent transformations) that can be performed on a feature value.

Feature Value

A feature value is a measurement (or value) of a feature at a given point in time.

Feature Vector

A feature vector is a row of feature values. A training sample for a model includes a feature vector and the label(s).

Feature View

A feature view is a selection of features (and labels) from one or more feature groups.

Filtering

Data filtering is an operation on a dataset (such as a DataFrame) that defines which data to extract or remove from the dataset.

G

Generative AI

Generative AI generally refers to models and techniques that generate new data samples by learning the underlying data distribution.

H

Hyperparameter

In training pipelines, a hyperparameter is a parameter that influences the performance of model training but the hyperparameter itself is not updated during model training.

Hyperparameter Tuning

Hyperparameter tuning involves training multiple models each with different hyperparameter values to find good values for hyperparameters that optimize model performance.

I

Idempotent ML Pipelines

An idempotent operation produces the same result no matter how many times you execute it.

Inference Data

Inference data is the input feature values that are the input to a trained model that outputs a prediction.

Inference Logs

Inference logs are the input and output of inference pipelines.

Inference Pipeline

An inference pipeline is a program that takes input data, optionally transforms that data, then makes predictions on that input data using a model.

L

LLMs - Large Language Models

LLMs stands for Large Language Models.

Lagged features

Lagged features are a feature engineering technique used to capture the temporal dependencies and patterns in time series data.

Latent Space

Latent space is the representation of compressed data, where compressed data is data encoded using fewer bits than the original representation.

M

ML

ML stands for Machine Learning, which is a subfield of artificial intelligence (AI) that focuses on the development of algorithms and statistical models.

ML Artifacts (ML Assets)

ML artifacts are outputs of ML pipelines that are needed for execution of subsequent pipelines or ML applications.

ML Pipeline

A ML pipeline is a program that takes input and produces one or more ML artifacts as output.

MLOps

Machine learning operations (MLOps) describes processes for automated testing of ML pipelines and ML artifact versioning that helps improve both developer productivity.

MVPS

A MVPS is a Minimal Viable Prediction Service.

Model Architecture

A model architecture is the choice of a machine learning algorithm along with the underlying structure or design of the machine learning model.

Model Bias

Model bias refers to the presence of systematic errors in a model that can cause it to consistently make incorrect predictions.

Model Deployment

A model deployment enables clients to perform inference requests on the model over a network.

Model Development

Model development is the process of building and training a machine learning model using training data.

Model Evaluation (Model Validation)

Model evaluation (or model validation) is the process of assessing the performance of a trained ML model on a (holdout) dataset.

Model Governance

Model governance is the process for managing ML models to ensure they are secure, ethical, trustworthy, explainable, and comply with relevant regulations

Model Interpretability

Model interpretability (also known as explainable AI) is the process by which a ML model's predictions can be explained and understood by humans.

Model Monitoring

Model monitoring involves continuously monitoring the performance of predictions made by models to identify potential problems.

Model Performance

Model performance in machine learning (ML) is a measurement of how accurate predictions or classifications a model makes on new, unseen data.

Model Quantization

Model quantization can reduce the memory footprint and computation requirements of deep neural network models.

Model Registry

A model registry is a version control system for models that provides APIs to store and retrieve models and model-related artifacts.

Model Training

Model training in MLOps happens as part of a model training pipeline.

Model-Centric ML

‍Model-centric ML is an approach to machine learning that focuses on iteratively improving model architecture and hyperparameters to enhance model performance.

Model-Dependent Transformations

A model-dependent transformation is a transformation of a feature that is specific to one model, and is consistently applied in training and inference pipelines.

Model-Independent Transformations

Model-independent data transformations produce features that can potentially be reused in training or inference by one or more models.

Monolithic ML Pipeline

A monolithic ML pipeline is a single program that can be run as either a feature pipeline followed by a training pipeline or a feature pipeline followed by a batch inference pipeline.

N

Natural Language Processing (NLP)

‍NLP stands for Natural Language Processing.

O

Offline Store

The offline store in a feature store stores the historical values of features, enabling efficient and scalable access to large volumes of historical feature data.

On-Demand Features

If a feature is used in an online inference pipeline and it is created using data only available at request-time, then it is an on-demand feature.

On-Demand Transformation

An on-demand transformation is a feature function that is used to compute an on-demand feature.

Online Inference Pipeline

An online inference pipeline is a program that runs in a model deployment and returns predictions to a client using a model, downloaded and cached from a model registry.

Online Store

The online store is a row-oriented database or key-value store that provides low-latency lookups for precomputed feature values using one or more entity IDs (or primary keys).

Online-Offline Feature Skew

Feature skew is when there are significant differences between the feature logic executed in an offline ML pipeline and the feature logic executed in the corresponding online inference pipeline.

Online-Offline Feature Store Consistency

Features that are stored in both the online and offline stores should be consistent. A replication protocol with consistency guarantees that the feature data is kept in sync.

Orchestration

The orchestration of ML pipelines is crucial to making ML pipelines run without human intervention, and run reliably, even in the event of hardware or software errors.

P

Pandas UDF

Pandas UDFs (User-Defined Functions) are functions that allow users to perform feature engineering (or any custom transformations) on a Pandas DataFrame using PySpark.

Point-in-Time Correct Joins

A point-in-time correct join is a database operation that performs a join between two tables in a way that ensures the results reflect the state of the tables at a specific point in time.

Precomputed Features

A precomputed feature is a feature that has been created by a feature pipeline and is stored in a feature store.

Python UDF

A Python UDF (user-defined function) in ML, is a function written by a user, typically to implement a feature function.

R

Real-Time Machine Learning

Real-time ML reference to ML systems where decisions or predictions must be produced with minimal, predictable latency.

Representation Learning

Representation Learning, defined as a set of techniques that allow a system to discover the representations needed for feature detection or classification from raw data.

S

SQL UDF in Python

A SQL UDF (User-Defined Function) is a custom function that extends the capabilities of SQL by allowing users to implement complex logic and transformations that are not available with built-in SQL.

Schema

A schema defines the shape, order, and type of data stored in ML artifacts, including: feature groups, feature views, training datasets,and models.

Similarity Search

Vector similarity search (or similarity search for embeddings) finds the “top K” most similar vectors to a query vector in a vector database.

Skew

In machine learning, skew refers to an imbalance in the distribution of the label (target variable) in a training dataset. A

Splitting Training Data

When you train a model, you would like your model to generalize and perform well on new, unseen data. You don’t want your model to overfit to training data.

Streaming Feature Pipeline

‍A streaming feature pipeline is a program that continuously processes incoming data in real-time, extracting and computing features and writing those features to a feature store.

Streaming Inference Pipeline

A streaming inference pipeline is a streaming application that makes real-time, non-interactive predictions triggered by the arrival of an event and outputs predictions to some sink.

T

Test Set

The test set is a portion (or partition) of the available training data that is “held back” and not used during model training.

Time travel (for features)

Time travel for features refers to the ability to access historical versions of feature values at previous points in time.

Train (Training) Set

‍The train (or training) set is the portion of the training data that is used to train a machine learning model. I

Training Data

Training data refers to the data set that is used to train and evaluate a ML model.

Training Pipeline

A training pipeline is a series of steps or processes that takes input features and labels (for supervised ML algorithms), and produces a model as output.

Training-Inference Skew

Training-inference skew is when there are (even slightly) different implementations of a transformation between the training and inference pipelines

Transformation

A transformation is a function that is applied to some input data and produces processed data as output.

U

Upstream

Upstream indicates that the user/client/application is a producer of data to a given dataset or feature group (that itself is downstream).

V

Validation Set

The validation set is a subset of the training data used to evaluate the performance of a machine learning model during hyperparameter tuning and model selection.

Vector Database

A vector database for machine learning (ML) is a database that stores, manages, and provides semantic query support for embeddings (high-dimensional vectors).

Versioning (of ML Artifacts)

Versioning of models, features, feature groups, feature views, and training datasets enables the management of dependencies between ML artifacts.

© Hopsworks 2023. All rights reserved. Various trademarks held by their respective owners.

Privacy Policy
Cookie Policy
Terms and Conditions