🔍 data science projects
⁍ Reducing high fatality accidents in the UK
Problem statement: What characterises high fatal accidents? Can we find areas of focus to reduce such accidents?

In this project, I have identified 9 improvement opportunities to reduce high fatal accidents and raised key questions to guide the exploration of such opportunities.

The project involves a lot of fun viz techniques such as plotting a dendrogram based on Ward's clustering method to visualise correlated features, and plotting a choropleth map of the UK which get your hands dirty with GeoJSON files.

To identify 11 key features of high fatal accidents, I applied feature selection using Chisquared test, mutual information gain, and treebased MDI. I then take a deeper look at these selected features to see which criteria that separates high fatal accidents and verify these criteria using a 2 proportion Ztest.

Since the dataset was extremely imbalanced with 200 high fatal accidents in a pool of 91000+ accidents (roughly 0.002 ratio), I implemented SMOTE to oversample the minority label since uniform downsampling is not suitable since location is a key feature.
This project is hosted on DataCamp Workspace.
⁍ Measuring the impact of a website redesign the Bayesian way
Problem statement: How can we determine if the change in observed conversion rates after website redesign is not due to random chance? Moreover, can we get an exact probability of how likely a test group is to perform better/worse than the control group?

In this mini project, I analysed the conversion rates of four different user groups (one of which is the control group) using pivot tables and bar plots to get a first big picture.

Then, using PyMC3, I applied Bayesian A/B testing based on a Metropolis sampler to measure which group is most likely to have a higher conversion rate than the control group. The Bayesian way yields a solid probability unlike the frequentist approach. For example, in this project, I was able to deduce that "there is a 99% probability that user group B has a higher conversion rate than the control group".

This exact probability gives us the confidence required to make important business decisions. In this case, we can be really confident in our decision of switching to the design version exposed to group B since there is only a 1% chance of B being worse than the control. This is one of the many awesome reasons why I favor the Bayesian approach for data tasks; this also helps us answer the second problem.
This project is hosted on DataCamp Workspace.
⁍ Regression analysis on TfL Santander cycles
Problem statement: To what extent can we use a lightweight, explainable model to predict cycle hires with lowest RMSE?
The idea of the project is to understand trends in the usage of TfL Santander Cycles (aka Boris bikes) and then predict cycle hires using linear models which are scalable and lightweight. In this project, I have

Investigated suitability of fitting a linear model by checking existence of trends in the residuals vs fitted plot.

Verified normality of cycle hires after YeoJohnson normalisation using a density, histogram and QQ plot.

Applied feature engineering methods such as forward search model selection, onehot encoding, moving average, and YeoJohnson normalisation.

Implemented a linear model with 99% lower RMSE (935.5 to 3.5) and 155% higher R 2 score (0.27 to 0.69) relative to the baseline naive linear model.
This project was hosted on Deepnote and then hosted on DataCamp Workspace.
⁍ Is handwashing really effective?
The Covid19 pandemic made me want to look back at the history of handwashing when it was originally introduced. In this data exploration, we are able to answer questions like: (i) "How significant was the change between prehandwashing era and posthandwashing era?" and (ii) "What is the 90% confidence interval of the mean of difference between these eras?".
This project was hosted on Deepnote and then hosted on DataCamp Workspace.
💻 web app projects
⁍ TradeLogger
This project is written in Python using the Flask framework. One of the problems with stock traders is that they do not keep the big picture of where they are in their trading journey. For example. if they are losing, do they realise that they are losing 5 trades consecutively? This tool is designed for traders to trade better. A machine learning pipeline is also added to learn the time series pattern of whether the trader has a good chance on the next trade given the outcomes of the previous trades.
⁍ allpow
This project is written in Python using Streamlit. It is a simple calculator to estimate one's monthly budget (in London by default). I originally made it to help my fellow Malaysian students who are coming to London for the first time organise their finances better. The first problem of coming to a new country far away from home (and also most probably first time renting a property in their life) is to estimate the prices of groceries, transport, rent and the like. This calculator solves that problem.
Link to the deployed calculator: click here
🔧 utilities
⁍ EasyPS
A simple, out of the box personal statement LaTeX framework. It helps handling duplicated tex files when writing personal statements for multiple universities, scholarships or jobs.