Daniel Manns

Data Scientist

ML Engineer

AI Consultant

Available for Amazing Projects

Daniel Manns

Data Scientist

ML Engineer

AI Consultant

Daniel Manns

Data Scientist

ML Engineer

AI Consultant

Available for Amazing Projects

End-to-End Record Linkage Pipeline

End-to-End Record Linkage Pipeline

An End-to-End Machine Learning Pipeline for Linking Internal Records with External Data Sources

An End-to-End Machine Learning Pipeline for Linking Internal Records with External Data Sources

Data Science

Duration:

06.2023 - 12.2023

Client:

Anonymous client in public administration

Technologies:

Python, PySpark (Cloudera & Stackable), Data Lake, SQL, Scala, Java, spark.ml, XGBoost, mlflow, Hadoop FS, Amazon S3, Docker, kubernetes, Apache Airflow, ArgoCD, Gitlab CI

Situation

A client in public administration was collecting over 1 petabyte of data on companies subject to legal obligations. The client needed to identify these companies within external heterogeneous data sources to prosecute fraudulent activities.

Task

The client required a system to automatically and efficiently identify internal records within diverse external data sources.

Action

I developed an end-to-end machine learning pipeline for record linkage in PySpark, which included:

  • Data cleaning.

  • Feature engineering to extract relevant attributes.

  • Clustering to group similar records.

  • Distributed XGBoost model training.

  • Comprehensive model evaluation with mlflow.

  • Model inference with mlflow serving.

  • Deployment in the client's kubernetes environment as a spark application for efficient scaling.

  • Job scheduling via Apache Airflow.

Result

The client could automatically and efficiently identify matching companies across various data sources, enabling the initiation of manual processes for the prosecution of fraudulent companies.

More Projects

AI-Driven Hospital Invoice Verification utilizing the Microservices Architecture

Machine Learning-Driven Estimation of Customer Value using XGBoost

A GenAI Application for Creating High Quality Summaries of German Newspaper Articles.

AI-Driven Hospital Invoice Verification utilizing the Microservices Architecture

Machine Learning-Driven Estimation of Customer Value using XGBoost

Say hello 👋

Let's Connect!

Let's create something unique together! Here's how you can reach out to me!

Say hello 👋

Let's Connect!

Let's create something unique together! Here's how you can reach out to me!

Say hello 👋

Let's Connect!

Let's create something unique together! Here's how you can reach out to me!