No items found.

Hopsworks 4.0 - The AI Lakehouse

We are proud to introduce the AI Lakehouse, the first unified tool specifically designed for building AI systems.

Read more Try it now!

Jim Dowling

CEO and Co-Founder

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

RonDB: A Real-Time Database for Real-Time AI Systems

From Lakehouse to AI Lakehouse with a Python-Native Query Engine

The Feature Store Makes Your Data Warehouse Easy to Use for AI

The Journey from Star Schema to Snowflake Schema in the Feature Store

The Taxonomy for Data Transformations in AI Systems

Article updated on

Building a Cheque Fraud Detection and Explanation AI System using a fine-tuned LLM

May 13, 2024

10 min

Read

Jim Dowling

CEO and Co-Founder

Hopsworks

Data Engineering

TL;DR

The third edition of the LLM Makerspace dived into an example of an LLM system for detecting check fraud.

So, what exactly are checks? They're essentially money transfers where you write the amount on a piece of paper and hand it over to someone. It's considered legal tender and payment for services. As an Anglo-Saxon myself from Ireland, I'm familiar with checks. Other countries like America still use them too, but much of the world has moved on. However, there are still plenty of people processing checks, especially in the financial industry and banks. They receive these physical pieces of paper daily and need to validate that they're not fraudulent. That's where LLMs come in handy, and we'll explore how to use them for this purpose.

If you want to start from the beginning, you can recap the previous editions of the LMM Makerspace on the Hopsworks YouTube channel. You can also read about the other AI systems we’ve built on our blog page:

The Problem: Check Fraud Detection

Banks and financial institutions receive a large volume of physical checks daily. Each check needs to be validated to ensure it's not fraudulent. This is where an LLM system can assist. When a check is marked as fraudulent, a human employee needs to write an explanation of why it's considered fraudulent. This is a time-consuming task that many financial institutions undertake. They hire people to evaluate checks for fraud and write descriptions for fraudulent ones.

The goal is to use an LLM to generate these explanations, freeing up employees to be more productive in other areas. While AI may not be allowed to write these descriptions in many jurisdictions, it can still suggest a description that the human can accept if they're satisfied with it, thereby improving their productivity.

The Solution: A Feature Pipeline, Training Pipeline and Inference Pipeline Architecture

To build a check fraud detection system, we'll follow a specific architecture called the feature-training-inference (FTI) pipeline architecture. The idea is to break down the AI system into smaller, easily composable modules. Here's an overview:

Feature Pipeline: This module takes in the data, parses it, and creates features. The resulting feature table is called a feature group.
Training Pipeline: Here, we train a simple model for fraud detection in checks using the features from the feature pipeline.
Inference Pipeline: Finally, we perform inference for fraud detection using the trained model.

The Process

Data Preparation

We start with images of checks and some labeled data for supervised machine learning. The checks go through an optical character recognition (OCR) system that extracts the text from the images.

The OCR system identifies bounding boxes on the check and uses optical character recognition to extract the written text. This is a traditional deep learning CNN-style network problem, and many existing frameworks can handle it. We'll assume this OCR system is already in place at your financial institution.

Feature Pipeline

Using the extracted text from the checks, we create a feature pipeline. This pipeline reads the data, parses it, and creates features that we store in a feature table called a feature group.

Some of the features we extract include:

Check number
User ID
Amount of money in text
Amount written in numbers
File path
Whether the amount in text matches the amount in numbers
Bank name
Spelling correctness
Username

We also have a label indicating whether the check is valid or not. Here's a preview of the data:

Training Pipeline

With our features ready, we move on to training a simple model to predict whether a check is fraudulent. We use an XGBoost classifier for this purpose. In Hopsworks, we create a feature view that selects the relevant features from the feature group(s). For this model, we don't need a huge number of features.

We use:

Is the spelling correct?
Does the amount of the check in letters and numbers match?
The valid label

We train the model and store it in the Hopsworks model registry. We also compute some model metrics and feature importance scores.

Inference Pipeline

The inference pipeline is a batch process that runs daily. It takes the new checks that arrive each day and predicts whether they are fraudulent. If a check is predicted as fraudulent, the LLM generates a description explaining why.

Here's how it works:

The new check images are uploaded to a designated directory.
The batch inference program runs daily, processing the images in that directory.
For each check, the program:some text
- Extracts the text using OCR
- Predicts whether the check is fraudulent using the trained model
- If the check is predicted as fraudulent, uses the LLM to generate an explanation
The predictions and explanations are stored in a feature group (which is also a MySQL table).
A decision support system can connect to this table, read the output, and use the generated explanations in reports.

Here's an example of the output:

Check ID: 6Status: FraudulentDescription: The check is 
considered fraudulent because the amount in words is missing, 
which is a crucial detail that should be included in a valid check.

Putting It All Together

Let's walk through the code to see how everything comes together. We'll use Python notebooks in Hopsworks for this example, but you can also run the code locally or in Colab with some modifications.

Prerequisites

Clone the Hopsworks tutorials repository:

git clone https://github.com/logicalclocks/hopsworks-tutorials.git

Navigate to the fraud-check-detection directory:

cd hopsworks-tutorials/advanced_tutorials/fraud_check_detection

Install the required libraries:

pip install -r requirements.txt

Feature Pipeline

Connect to Hopsworks and read the CSV file containing the check data:

import hopsworksproject = hopsworks.login()
import pandas as pd
df = pd.read_csv("data/check_data.csv")

Explore the data and create features:

df.head()
df["is_spelling_correct"] = ...
df["amount_matches"] = ...

Create a feature group and insert the data:

fg = fs.get_or_create_feature_group(
	name="check_fg",
	version=1,
	primary_key=["check_id"],
	description="Check details"
)
fg.insert(df)

Training Pipeline

Create a feature view:

fv= fs.create_feature_view(
	name="check_fv",
	version=1,
	query=fg.select_all()
)

Create the training dataset, with a random train/test split of 80/20:

X_train, X_test, y_train, y_test =
		fv.train_test_split(test_size=0.2)

Train the model:

from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)

Evaluate the model:

from sklearn.metrics import accuracy_score, f1_score
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

Save the model in the Hopsworks model registry:

mr = project.get_model_registry()
model_dir = "check_fraud_detection_model"
model.save_model(model_dir + "/model.json")
model_evaluation = {"accuracy": accuracy, "f1_score": f1}
check_fraud_model = mr.python.create_model(
	name="check_fraud_detection_model",
	feature_view=fv,
	metrics=model_evaluation,
	description="Check fraud detection model"
)
check_fraud_model.save(model_dir)

Inference Pipeline

Load the trained model and OCR processor:

from xgboost import XGBClassifier
mr = project.get_model_registry()
retrieved_model = mr.get_model(
	name="check_fraud_detection_model",
	version=1
)
saved_model_dir = retrieved_model.download()
model_fraud_detection = XGBClassifier()
model_fraud_detection.load_model(saved_model_dir + "/model.json")
ocr_processor, ocr_model = load_check_parser()

Define a function to generate explanations using the LLM:

def generate_explanation(check_image_path):
	parsed_text =ocr_model(ocr_processor(
		Image.open(check_image_path),
		return_tensors="pt"))["parsed_text"][0]
is_fraud = model.predict([get_features(parsed_text)])[0]
	if is_fraud:
		prompt = f"The check with the following parsed text is
		considered fraudulent:\n\n{parsed_text}
		\n\nExplain why this check is considered fraudulent."
		explanation = llm(prompt)
	else:
		explanation = "The check is considered valid."
	return is_fraud, explanation

Process the new checks daily:

import os
check_dir = "path/to/daily/check/images"
results = []
for check_image in os.listdir(check_dir):
	check_path = os.path.join(check_dir, check_image)
is_fraud, explanation = generate_explanation(check_path)
results.append({"check_id": check_image.split(".")[0],"is_fraud": is_fraud,"explanation": explanation})
result_df = pd.DataFrame(results)

Save the results in a feature group:

result_fg = fs.get_or_create_feature_group(
	name="check_validation_fg",
	version=1,
	primary_key=["check_id"],
	description="Check validation results"
)
result_fg.insert(result_df)

Conclusion

In this blog post, we've seen how to build a check fraud detection system using LLMs. We broke down the process into three main parts:

A feature pipeline to extract features from check images
A training pipeline to train a fraud detection model
An inference pipeline to predict fraud and generate explanations using an LLM

By automating the explanation generation process, we can save financial institutions a significant amount of time and resources. The LLM-generated explanations can serve as a starting point for human employees, who can then review and modify them as needed. The aim of this example was to give a good understanding of how to use LLMs in a practical application like check fraud detection.

Watch the full video here:

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

With Redis no longer being a open source database, RonDB will continue to be so in order to uphold the principles that keeps the technology advancing.

Doubling Down on Open Source: How RonDB Upholds the Principles Redis Left Behind

Redis will no longer be open source. Our own project, RonDB, will continue being open source in order to uphold the principles that keeps the technology advancing.

Mikael Ronström

We compared the performances of AWS S3 file system with our own, open source one: HopsFS. In our benchmark, we have found that HopsFS can get 100x faster

November 19, 2020

HopsFS file system: 100X Times Faster than AWS S3

Many developers believe S3 is the "end of file system history". It is impossible to build a file/object storage system on AWS that can compete with S3 on cost. But what if you could build on top of S3

Mahmoud Ismail

Learn the best way to integrate Kubeflow projects with Hopsworks and take advantage of its Feature Store and scale-out deep learning capabilities.

Manage your own Feature Store on Kubeflow with Hopsworks

Learn how to integrate Kubeflow with Hopsworks and take advantage of its Feature Store and scale-out deep learning capabilities.

Jim Dowling

PRODUCT

RESOURCES

COMPANY

JOIN OUR MAILING LIST

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

© Hopsworks 2024. All rights reserved. Various trademarks held by their respective owners.

Terms and Conditions