End-to-End Record Linkage Pipeline - Daniel Manns

Home

Projects

About

Contact

Stack

Buy Template

Home

Projects

About

Contact

Stack

Daniel Manns

Data Scientist

ML Engineer

AI Consultant

Available for Amazing Projects

Daniel Manns

Data Scientist

ML Engineer

AI Consultant

Daniel Manns

Data Scientist

ML Engineer

AI Consultant

Available for Amazing Projects

All Projects

End-to-End Record Linkage Pipeline

An End-to-End Machine Learning Pipeline for Linking Internal Records with External Data Sources

Data Science

Duration:

06.2023 - 12.2023

Client:

Anonymous client in public administration

Technologies:

Python, PySpark (Cloudera & Stackable), Data Lake, SQL, Scala, Java, spark.ml, XGBoost, mlflow, Hadoop FS, Amazon S3, Docker, kubernetes, Apache Airflow, ArgoCD, Gitlab CI

Situation

A client in public administration was collecting over 1 petabyte of data on companies subject to legal obligations. The client needed to identify these companies within external heterogeneous data sources to prosecute fraudulent activities.

Task

The client required a system to automatically and efficiently identify internal records within diverse external data sources.

Action

I developed an end-to-end machine learning pipeline for record linkage in PySpark, which included:

Data cleaning.
Feature engineering to extract relevant attributes.
Clustering to group similar records.
Distributed XGBoost model training.
Comprehensive model evaluation with mlflow.
Model inference with mlflow serving.
Deployment in the client's kubernetes environment as a spark application for efficient scaling.
Job scheduling via Apache Airflow.

Result

The client could automatically and efficiently identify matching companies across various data sources, enabling the initiation of manual processes for the prosecution of fraudulent companies.

More Projects