We are proud to introduce the AI Lakehouse, the first unified tool specifically designed for building AI systems.
The third edition of the LLM Makerspace dived into an example of an LLM system for detecting check fraud.
So, what exactly are checks? They're essentially money transfers where you write the amount on a piece of paper and hand it over to someone. It's considered legal tender and payment for services. As an Anglo-Saxon myself from Ireland, I'm familiar with checks. Other countries like America still use them too, but much of the world has moved on. However, there are still plenty of people processing checks, especially in the financial industry and banks. They receive these physical pieces of paper daily and need to validate that they're not fraudulent. That's where LLMs come in handy, and we'll explore how to use them for this purpose.
If you want to start from the beginning, you can recap the previous editions of the LMM Makerspace on the Hopsworks YouTube channel. You can also read about the other AI systems we’ve built on our blog page:
Banks and financial institutions receive a large volume of physical checks daily. Each check needs to be validated to ensure it's not fraudulent. This is where an LLM system can assist. When a check is marked as fraudulent, a human employee needs to write an explanation of why it's considered fraudulent. This is a time-consuming task that many financial institutions undertake. They hire people to evaluate checks for fraud and write descriptions for fraudulent ones.
The goal is to use an LLM to generate these explanations, freeing up employees to be more productive in other areas. While AI may not be allowed to write these descriptions in many jurisdictions, it can still suggest a description that the human can accept if they're satisfied with it, thereby improving their productivity.
To build a check fraud detection system, we'll follow a specific architecture called the feature-training-inference (FTI) pipeline architecture. The idea is to break down the AI system into smaller, easily composable modules. Here's an overview:
We start with images of checks and some labeled data for supervised machine learning. The checks go through an optical character recognition (OCR) system that extracts the text from the images.
The OCR system identifies bounding boxes on the check and uses optical character recognition to extract the written text. This is a traditional deep learning CNN-style network problem, and many existing frameworks can handle it. We'll assume this OCR system is already in place at your financial institution.
Using the extracted text from the checks, we create a feature pipeline. This pipeline reads the data, parses it, and creates features that we store in a feature table called a feature group.
Some of the features we extract include:
We also have a label indicating whether the check is valid or not. Here's a preview of the data:
With our features ready, we move on to training a simple model to predict whether a check is fraudulent. We use an XGBoost classifier for this purpose. In Hopsworks, we create a feature view that selects the relevant features from the feature group(s). For this model, we don't need a huge number of features.
We use:
We train the model and store it in the Hopsworks model registry. We also compute some model metrics and feature importance scores.
The inference pipeline is a batch process that runs daily. It takes the new checks that arrive each day and predicts whether they are fraudulent. If a check is predicted as fraudulent, the LLM generates a description explaining why.
Here's how it works:
Here's an example of the output:
Let's walk through the code to see how everything comes together. We'll use Python notebooks in Hopsworks for this example, but you can also run the code locally or in Colab with some modifications.
Clone the Hopsworks tutorials repository:
Navigate to the fraud-check-detection directory:
Install the required libraries:
Connect to Hopsworks and read the CSV file containing the check data:
Explore the data and create features:
Create a feature group and insert the data:
Create a feature view:
Create the training dataset, with a random train/test split of 80/20:
Train the model:
Evaluate the model:
Save the model in the Hopsworks model registry:
Load the trained model and OCR processor:
Define a function to generate explanations using the LLM:
Process the new checks daily:
Save the results in a feature group:
In this blog post, we've seen how to build a check fraud detection system using LLMs. We broke down the process into three main parts:
By automating the explanation generation process, we can save financial institutions a significant amount of time and resources. The LLM-generated explanations can serve as a starting point for human employees, who can then review and modify them as needed. The aim of this example was to give a good understanding of how to use LLMs in a practical application like check fraud detection.
Watch the full video here: