If you ever got yourself an insurance for a vehicle you know the price varies a lot depending on your personal information and vehicle details that determine how likely you are to have an accident. Since we have an access to substantial amount of data, we decided to create our own tool that based on the past accidents would estimate the risk group you belong to.
Our estimator is based on the age and genger of a driver but also vehicle details such as its age and engine capacity.
Since our dataset did not contain any variable that could serve as a risk profiler directly, we created a new variable which is calculated using the input variables, information about past data including the severity of accidents or number of people injuries but also data about the entire population. This values are then used for creating 100 risk groups.
Once our dataset has been augmented with this extra variable, we built a classifier that tries to predict the risk group.
For this purpose we used simple decision tree that first has been tunned with aid of cross-validation.
During this process we performed a search of best values for `max_depth`, `min_samples_split` and `min_samples_leaf` using 5-fold CV. Although the accuracy of the classifier may seem to be rather low (50%), the actual performance for our purpose is much higher becase the predicted group is often off by 1 or 2 which might still be considered as a good approximation.
This could be improved by using lower number of risk groups but for the sake of nice and fine visualization we decided to go for a high number of groups.
So why is this useful and who can make use of it?
It's basically for everyone who is interested in estimating risk profile and getting to know how insurance company percept you.
This visual is a map of road accidents clustered together with the K-means algorithm, the K-means takes n observations and partition them into k clusters. you can change the amount of clusters in the visual by using the buttons.
The small circles each respresent and accident and their color represent which cluster they belong to.
Why is this interesting:
You could use this as a part of a risk profiling is a person is living closer to cluster center that person might be in greater risk of an accident than elsewise.
Another interesting thing to look at is the severity of the accident. Unfortunatelly, we only have 3 different values describing the severity which are slight injury, serious injury and fatal.
Nevertheless, we thought it was worth trying to predict the outcome of the accidents.
Since the data was highly inbalanced in terms of the amount of each kind of accident, we decided to try k-nearest-neighbour algorithm with low number for k.
It turned out that it was quite a good choice and a small k was capable of dealing with the inbalance.
In fact our dataset contained only 1.3% of fatal, 13.8% of serious injury accidents and 84.9% of slight injury accidents.
Our KNN algorithm when tested on the complete dataset outputted almost the same proportions with deviation of +- 1%.
Because this algorithm does not require model training and the output is deterministic. Consequently, this algorithm does not require to be validated.
On the other hard, we can still use validation technique such as CV for finding optimal set of parameters, which in our case are `n`= and `p`= that results in high accuracy of 95%.
Unfortunatelly, due to too high number of features and their possible values we could not use the same approach as before of performing exhaustive search and saving all possible combinations to file so we can load them on the website.
More details on this algorithm and implementation can be found on in following notebook.