The real estate company 'ImmoEliza' was really happy about our previously made regression model. They would like you to create an API to let their web-devs create a website around it.
Housing prices prediction is an essential element for an economy. Analysis of price ranges influence both sellers and buyers.
The API is stored on Heroku and is documented here.
Our previous project created a Linear Regression model to estimate the price of the house accurately with the given features. Within this project an online API to run the price prediction is made.
An overview of the project is available as the Google presentation.
'ImmoEliza' web developers to receive price prediction based on input values provided by a user.
- Be able to deploy a machine learning model.
- Be able to create a Flask API that can handle a machine learning model.
- Deploy an API to Heroku with Docker.
Project evalution is based on the compliance to the following criteria:
Criteria | Indicator |
---|---|
Is complete | Your API works. |
Your API is wrapped in a Docker image. | |
Pimp up the readme. (what, why, how, who). | |
Your model predict. | |
Your API is deployed on Heroku. | |
Is good | The repo doesn't contain unnecessary files. |
You used typing. | |
The presentation is clean. | |
The web-dev group understood well how your API works. |
Input requirements for the web developers were initially agreed with the other Becode teams. Then the team agreed on how to fulfill and split the required development steps among its members. Step 0 (team management) was added.
- Team organization
- Project preparation
- Scrapping immoweb website for updates
- Cleaning dataset for database storing
- Set up of an online database
- Querying the online database
- Pre-processing pipeline
- Fit your data!
- Create your API
- Create a Dockerfile to wrap your API
- Deploy your Docker image in Heroku
- Document your API
For step 2 and 3 source files from the previous project were used as basis after adapting them in in line with the JSON input requirements.
Hereby follow the agreed requirements for the json to be provided by the web developers:
{
"data": {
"area": int,
"property-type": "APARTMENT" | "HOUSE" | "OTHERS",
"rooms-number": int,
"zip-code": int,
"land-area": Optional[int],
"garden": Optional[bool],
"garden-area": Optional[int],
"equipped-kitchen": Optional[bool],
"full-address": Optional[str],
"swimmingpool": Opional[bool],
"furnished": Opional[bool],
"open-fire": Optional[bool],
"terrace": Optional[bool],
"terrace-area": Optional[int],
"facades-number": Optional[int],
"building-state": Optional["NEW" | "GOOD" | "TO RENOVATE" | "JUST RENOVATED" | "TO REBUILD"]
}
}
Output requirements were not strictly fixed but rather delegated to each team after sharing a reference template:
response = {
prediction: {
price: int,
test_size: int,
median_absolute_error: float,
max_error: float,
percentile025: float,
percentile975: float
},
error: str
}
A HTTP status code is also provided in case of error.
In the introduction meeting it emerged that team members were looking for an organisation to avoid overlaps and meet project requirements without rushing for it.
A Trello board (Kanban template) was organised and team members agreed on how to split the required development steps. For each step was responsible person was chosen, support could be provided if requested. A trello list with general useful links, guidelines and tips was provided.
Morning meeting were scheduled to check the status of the project and plan the day activites. Afternon meetings were scheduled at lunch if morning tasks were completed to set up new goal.
An interim review was set up on 4/12/20 to upload an already working version before refining it.
Organisation and coordination after deploying the first deployment on Heroku on 4/12/2020 posed a challenge since many small improvements had to be tested which were overlapping.
A repository was prepared to fullfill the required requirements:
- Create a folder to handle your project.
- Create a file app.py that will contain the code for your API.
- Create a folder preprocessing that will contain all the code to preprocess your data.
- Create a folder model that will contain your model.
- Create a folder predict that will contain all the code to predict a price.
All these main folders, exclusively dedicated to the api, were created in a source folder. Additional folders (assets, data, docs and outpus) were created for the project.
Even with git connection problems it was still possible to push for ordered changes in the repository through github. It wasn't initially clear how to split the code between the model creation/evaluation and the API service thus some additional time was spent on reconverting the files structure after.
The code from our immoweb scraping challenge at becode was integrated to run inside a docker. It is based on the Selenium Webdriver. We had to change it to work with firefox rather than chromium, because the headless mode of chromium returned anti-bot captcha pages and firefox did not.
The DataFeatures class in validation.py serves to validate if the client stuck to the required json schema and stayed within realistic value ranges. It has operations for returning human readable errors and returning a corrected json, where for example "0" is evaluated as integer 0. The class is based on the marshmallow library.
The file cleaning_data.py contains all the code used to preprocess the validated request received to predict a new price. Optional values not provided by the client but needed by our model are imputed to be 0 (ex. does not have swimmingpool) or 1 (ex. garden-area is 0). We use 1 to indicate a zero area value because the model feature is actually log(area).
The pre-processing was split into two distinguished step, the validation of the request and then the formatting of the values after to comply with the model requirements.
In the predict folder a file prediction.py contains all the code used to predict a new house's price. The file contains a function predict() that takes the preprocessed data as an input and returns a price as output.
Instead of providing only a single model, one model for property-type was provided. Models performance were tested only once and then stored as csv in the model folder to be retrieved later.
In the app.py file, the Flask API contains:
- A route at / that accept:
- GET request and return "alive" if the server is alive.
- A route at /predict that accept:
- POST request that receives the data of a house in json format.
- GET request returning a string to explain what the POST expect (data and format).
The complete documentation about the API is available here.
To deploy the API Docker was used. The Dockerfile created an image with Ubuntu and Python 3.8 plus all the required dependencies for the created code: library|version click|7.1.2 Flask|1.1.2 gunicorn|20.0.4 itsdangerous|1.1.0 Jinja2|2.11.2 MarkupSafe|1.1.1 marshmallow|3.9.1 numpy|1.19.4 pandas|1.1.4 python-dateutil|2.8.1 pytz|2020.4 six|1.15.0 Werkzeug|1.0.1
First we had to find a base Dockerimage. While we found an existing image with ubuntu and python 3.8 already installed, it was too big (1.2 Go) so we opted in the end to start from the latest official ubuntu image that already had python 3.8 too, installing python3-pip ourselves.
Heroku allowed to push the docker container on their server and to start it (more information here).
Making log messages available in the heroku web UI requird special attention. It turned out python print statements by themself were not reflected and stdout additionally needed flushing.
API is documented here.
Hereby follow the answers to the main questions about the API.
What routes are available? With which methods?
We have 2 routes available you'll find more information on them here
What kind of data is expected (How should they be formatted?
Here's a link to the expect data entity.
Some fields are mandatory and some must apply some validation conditions.
What is the output of each route in case of success?
Here's a link to more deatail about about the return entity
What is the output in case of error?
Here's a link to all the possible errors
First release of the API was made easier by using information from the previous project and by splitting the work among all the team members. Additional small improvements posed a challenge since their impact had to be verifed in all the different steps involving different platforms/tools (comparing to a simpler code integration). Extension for database storing and update was hampered by the reduced working time due to unplanned Becode events and team members' personal reasons.
Project is considered concluded and no additional work is not foreseen. However a few possible improvements on the modelling part are hereby suggested to improve its accuracy:
- scrapping more data online including also other key parameters (e.g. building construction year)
- make full use of other available reliable datasets (e.g. official statistics) to improve the model
- replace linear regression (degree equal to one) with a polynomial when relevant
- use logaritmic transformations only when relevant
- 2/12/2020 (start)
- 9/12/2020 (code deliverable)
- 11/12/2020 (extension for database storing and update). Only 5 person-days available from 10/12/20 to 11/12/20 due to unplanned Becode events and team members' personal reasons.
- 14/12/2020 (presentation deliverable)