Data platform development, productionizing personalized recommendations
eBay Classifieds Group / Adevinta (as of June 2021)
Data Engineering and MLOps in the Benelux team, known in NL for Marktplaats.nl and in BE for 2dehands.be/2ememain.be.
My work revolves around building, running and optimising data- and machine learning pipelines in production, impacting millions of people with recommendations, search, fraud detection, etc.Tech: Python, Scala, Spark, Hadoop, SQL, Hive, BigQuery, Cassandra, ElasticSearch, Airflow, Dagster, MLflow, Nomad, Kubernetes, Docker, Linux, CI/CD (Jenkins), Google Cloud Platform
Detecting and localizing infected tulips
H2L Robotics (part time @ 0.2 FTE)
H2L Robotics build robots that drive through tulips fields autonomously, while detecting and localizing sick tulips and applying a treatment.
I worked on the neural network that detects the tulips using cameras. That involved building tools around data processing (e.g. for annotations), writing Keras/TensorFlow code, performing a lot of experiments to figure out what works and what doesn’t.Tech: Python, Keras, TensorFlow, TensorRT, CNNs / ConvNets, Object Detection, Keypoint Detection, MLflow, Linux, Docker, Amazon Web Services
Financial asset management data hub
Helped a financial asset management firm transition to a cloud native data hub on AWS.Tech: Python, Airflow, Spark, Linux, Docker, Amazon Web Services
Deep Turnaround was an initiative to detect events during the aircraft handling process with Deep Learning-based Computer Vision. I built the initial prototype and evangelized it internally. When we got funding to proceed, my main responsibility was detection with high accuracy: initially by implementing Object Detection models using TensorFlow Object Detection API, Kalman-filter based tracking and rule-based event detection. Later on, this transitioned into end-to-end learning on small video clips, using a custom Action Recognition approach based on a DeepMind paper.
Next to that, I also took on a significant chunk of engineering: ETL pipelines (from videos to de-duplicated pre-processed images), low-level TensorFlow code, Airflow DAGs for hyperparameter searches and experimentation, annotation tooling and maintaining Linux systems (on-prem and Azure VMs).
In addition, I played an important role in getting the project off the ground: convince legal to get permission to use the data, convince other stakeholders, determine and validate anonymization strategies, figure out optimal camera positions, perform site surveys, build a strong team, define the roadmap and team priorities.
Last I checked early 2022, the project is still going strong and has survived the COVID pandemic!Tech: Python (pandas, numpy, keras, matplotlib, seaborn, click, OpenCV), TensorFlow, TensorFlow Object Detection API, TensorBoard, CNNs / ConvNets, Object Detection, YOLO, SSD, Faster R-CNN, ResNets, video activity recognition, Inception-based architectures, multi-task learning, Locality Similarity Hashing, Kalman filter tracking, Airflow, MLflow, Spark, Databricks, PostgreSQL, Linux, Azure
Predicted Off-Block Time (Departure Delay Prediction)
Schiphol has experienced tremendous growth during the last couple of years, and infrastructure has struggled to keep up. The inevitable result is increased delays. I set out to develop a model to predict delay and - in the process - try to get a better understanding of the factors that drive delays.
The final model that maximized predictive accuracy was a boosted tree (XGBoost) model with a lot of feature engineering. The model improved existing estimates of departure time by 15% to 50%. I built an async flight API client, which refreshes on a timer and shows predictions in simple Flask UI. Later on, I built the first version of a low-latency streaming implementation using Spark Structured Streaming on Databricks.Tech: Python (pandas, numpy, scikit-learn, matplotlib, Flask), Jupyter, SQL, Hive, Spark, Spark Structured Streaming, Databricks, random forest regressor, boosted trees (XGBoost), LIME and SHAP
Wi-Fi-Sensor Based Location Analytics
Part of a team that developed a system to measure the presence of Wi-Fi radios using custom Wi-Fi sensors. In essence, this enables insight into the approximate number of people in an area, how long they remain there, which route they take, how often they come back, and so on. This system was deployed at various clients in retail, public transport, facility management and a football stadium.
I started out on the Data Engineering side and gradually transitioned into Data Science.
My engineering contributions:
- Designed, implemented and maintained a lambda-architecture big data platform.
- Developed streaming data processing code and real-time summary statistics using Apache Storm (Java).
- Built a framework to greatly simplify PySpark-based batch jobs. Also, scheduling and monitoring these jobs.
- Privacy by design: developed the anonymization pipeline involving a Trusted Third Party and a physical opt-out facility.
- Built a Python/Flask/MongoDB/jQuery/Bootstrap based configuration management tool to simplify the administration of sensor locations, regions, maps, geometry, etc. This saved many hours of menial work.
- Monitoring with Prometheus.
- Supervising Junior Data Engineers.
- Developed a real-time Crowd Monitor for the KPMG Restaurant, with a short-term prediction (30 mins ahead). This was beneficial for internal marketing and helped our colleagues avoid crowds and queues.
- Analyzing data from 120 sensors in a large furniture store, working together with stakeholders to extract useful insights in shopper’s behavior.
- Pivoting the product into a version for workplace utilization and occupancy monitoring, in order to more efficiently allocate teams to areas, and to possibly close down an entire section of the building (saving a lot of money on exploitation costs).
Fun side project:
- Prototyped an indoor navigation app for Android using iBeacons. Dijkstra-based routing, sensor fusion, proximity-triggered messaging managed by a Python backend.
Tram & Metro Vehicle Maintenance Analysis
Public transport provider (via KPMG)
Public transport providers maintain expensive assets and malfunctions on the track can be quite disruptive to their travelers and society as a whole. If maintenance can be done earlier (preventing breakdowns) or more efficiently, this can translate into many euros saved.
I investigated how patterns in vehicle (sensor) data relate to vehicle maintenance records. This uncovered interesting insights, such as specific areas of the rail that cause significantly more wheel damage.Tech: Python (pandas, numpy, scikit-learn, matplotlib), Jupyter notebook, Java, SQL, Hive, Hadoop, Spark, Hortonworks big data cluster, Linux, association rule mining, random forest classifier
Public Transport Traveler Clustering
Public transport provider (via KPMG)
Since switching to electronic payment cards for public transport, a lot of data has been collected on behavior of travelers. This raises the question: can this data be utilized to create better products, more in line with travelers’ wishes?
I investigated how (anonymized) travelers can be assigned into several clusters based on their behavior. I used Hive on a Hadoop cluster to calculate various normalized behaviour indicators, applied the K-means clustering algorithm, visualized the results with matplotlib and Gephi and facilitated the interpretation and validation of the results with business stakeholders.Tech: Python (pandas, numpy, scikit-learn, matplotlib), Jupyter notebook, SQL, Hive, Hadoop, Spark, Hortonworks big data cluster, Linux, Gephi, k-means clustering, dimensionality reduction (PCA)
Highway Vehicle Intensity Prediction
Ministry of Infrastructure and Environment (Rijkswaterstaat; via KPMG)
The Dutch road administration has many terabytes of data from measurements of vehicles on the highways, made using induction loops embedded in the road. This results in noisy measurements of the number of vehicles, their length and velocity. The goal of this project was to investigate the possibilities of applying big data techniques to induction loop sensor data.
I developed a predictive model for the intensity on the road at any given time, based on historic intensity and weather data. The model was able to predict the standard weekly pattern quite accurately, including holiday effects and rush hour traffic. Adding precipitation data reduced the error by 3%. Predicting traffic jams due to collisions and rare “black swan” events remains elusive, though. This was just a Proof-of-Concept, but the results were featured in a newspaper article in the NRC (Dutch).Tech: Python (pandas, numpy, matplotlib, scikit-learn), Jupyter notebook, random forest regressor, gradient descent, time series prediction, autoregressive feature extraction