Data Science
Duration:
06.2023 - 12.2023
Client:
Anonymous client in public administration
Technologies:
Python, PySpark (Cloudera & Stackable), Data Lake, SQL, Scala, Java, spark.ml, XGBoost, mlflow, Hadoop FS, Amazon S3, Docker, kubernetes, Apache Airflow, ArgoCD, Gitlab CI
Situation
A client in public administration was collecting over 1 petabyte of data on companies subject to legal obligations. The client needed to identify these companies within external heterogeneous data sources to prosecute fraudulent activities.
Task
The client required a system to automatically and efficiently identify internal records within diverse external data sources.
Action
I developed an end-to-end machine learning pipeline for record linkage in PySpark, which included:
Data cleaning.
Feature engineering to extract relevant attributes.
Clustering to group similar records.
Distributed XGBoost model training.
Comprehensive model evaluation with mlflow.
Model inference with mlflow serving.
Deployment in the client's kubernetes environment as a spark application for efficient scaling.
Job scheduling via Apache Airflow.
Result
The client could automatically and efficiently identify matching companies across various data sources, enabling the initiation of manual processes for the prosecution of fraudulent companies.
More Projects