We use cookies to provide you with a better service. Carry on browsing if you're happy with this. You can find out more about cookies here


Predicting Apartment Prices in London with Machine Learning

This article is continuation of a post I have published earlier where I used Machine Learning to predict the rate of apartments in London. The post can be found here: https://www.linkedin.com/feed/update/urn:li:activity:6318031037922361344

As highlighted in the post, using Airbnb data out-of-the-box resulted in dreadful prediction. Indeed, not even an expert could tell if a given rate was “normal” or otherwise without knowing more about the apartment in question.

I spent the past couple of days playing around with the data. Summarisation techniques revealed that the relationship between the target (price) and the predictors (location, size, number of bedrooms, etc.) was pretty much linear. This meant that Linear algorithms such as GLM or GBM were likely to outperform non-Linear models such as Deep Learning.

To improve model performance (i.e. generalisation) , some data manipulation was required:

  1. A formulae that sets apartments “condition” was developed. I had to worked it out with an expert in the field.
  2. The zipcode field (aka Postcode) in the UK contains two parts (EC3A 1BX). This makes the life of postmen much easier but makes generalisation more difficult. I simply stripped off the second part to increase the area coverage. In this example, all apartments within EC3A (which is just large enough for generalisation) are treated similarly.
  3. The prices were rounded to the nearest 25, 50, 75 and so on up to 500 (anything above 500 was treated as an outlier). This technique is called “binning” and it is very useful to increase generalisation.
  4. Finally, the price column was converted into categorical. This effectively converted the problem from being “prediction problem” into “classification problem”. Linear algorithms tend to perform better on classification problems.

For this problem, H2O’s GBM (Gradient Boosting Machine) was used as it performed better than GLM. All data manipulation happened in Python.

A sample of the output is presented below:

I compared the top and bottom 50 apartments (price wise) and I would say that the model performed reasonably well, considering that it is not fined-tuned. Unsurprisingly, though, the feature-engineered column turned out to be -by far- the most important predictor.

Route to Market

The generated model can be embedded into an app (web, mobile or both) that accepts apartments postcode, number of bedrooms, bathrooms, and overall condition (which the user can infer from the advertised images). The App can then use the ML model to provide reliable estimate to help users decide on the best apartment that suites their budget and needs.