What are model-independent transformations?
Model-independent data transformations produce features that can potentially be reused in training or inference by one or more models. More specifically, a model-independent transformation is a feature engineering step that transforms some input data (raw data or existing features) into one or more unencoded features that can potentially be used for training or inference in one or more models. Data transformations in a feature pipeline should be model-independent transformations, so that the features stored in the feature store can be reused by multiple models.
Why are model-independent transformations important?
By refactoring your feature engineering steps into model-independent and model-dependent transformations, you can write feature pipelines that produce features that can be reused across many models. This enables the feature store to become a centralized data warehouse of reusable features that can be used across different models and projects. Every time a feature is reused, it reduces the number of data pipelines that an organization has to develop and operate, saving time and costs. The more features are reused, the higher quality the features will become, as reuse promotes consistency, team collaboration, and the testing of features.
What are examples of model-independent transformations?
Model-independent data transformations include:
- Filtering: Selecting only a subset of columns from a table (a projection), and/or only the rows that match a certain condition (such as with a date range);
- Grouped Aggregations: Descriptive statistics, such as min, max, average, standard deviation;
- Windowed Statistics: compute statistics over a rolling/sliding/session window (e.g., count/mean/max seen in a 1 hr window of time);
- Lag features: the N most recent seen values of some feature (e.g., yesterday’s sales and the previous day’s sales);
- Joining: Combining data from multiple tables using a common join key;
- Mapping: Transforming values from one or more columns to new column values using a user-defined function;
- Stream mapping: transforming values from one or more columns to new column values using both a user-defined function and accumulated state;
- Time-series analysis: analyzing or aggregating data over time, such as identifying trends, patterns, or anomalies.
Unstructured data can also be transformed into structured data that become reusable features in models. These are often on-demand features, computed using unstructured data that is only available at request time. Examples include:
- Natural language processing (NLP): features can be mined from text data, such as sentiment or entity/community detection;
- Computer vision: features can be extracted from images or video using deep-learning techniques such as object detection or image classification;
- Audio: Deep learning (transformer) models can transcribe audio (e.g., Whisper), summarize text, translate between languages, and identify speakers from audio data. The extracted data can be used as features in models;
- HTTP/SaaS/REST APIs: data can be pulled from external services using APIs and used as features;
- Website scraping: data can be scraped from websites and used as features in models. Tools such as ChatGPT and GPT-4 are useful in generating the source code used to scrape websites.
Various machine-learning methods can also be used to transform raw data into more meaningful data, for example:
- Clustering: grouping similar data points based on certain characteristics, such as customers with similar purchasing habits.
- Dimensionality reduction: reducing the number of features in a dataset, while retaining as much semantic information as possible, to make it easier to analyze or visualize.
Model-independent Example - Grouped Aggregation
Aggregations, filtering, binning, embeddings, and feature crosses are popular examples of transformations.
Here is a short Python snippet showcasing an aggregation for four pre-defined age groups:
This example demonstrates how to aggregate the age data into age groups using the Pandas library in Python. The result is a DataFrame that includes an additional column with the age group assignments for each individual.