Using BeautifulSoup to web-scrape rent announces and predict prices asked based on its attributes

Rodrigo Dutcosky
5 min readApr 20, 2021

Hey there,

Getting the bad jokes from this first picture aside. The evolution of decision making process did pass through a transformation on this decade. Increase on volume of data as the main driver for it.

For all sort of methods used during any analysis, the complex it can be, they have data as common ingredient. On this post I will introduce a simplistic approach for web-scraping tasks, BeautifulSoup.

Steps from this reading

  1. Search for a house rent announces website that allows you direct requests.
  2. Scrape and pre-process the data from these announces, in order to find common attributes from rentals.
  3. Train a regression model to predict the rent price using these common attributes as input.

BeautifulSoup

Before anything else, it's important to mention that you might run to websites that won't allow you requesting their URL directly. I bumped into one or two when I was searching for a rent announcing webpage.

On the example, I am using this website, but added extra parameters to the URL string to get to the specific page I wanted to.

The soup.prettify() method is an alternative way to search for the html tags existing on the website front-end. The effort of scraping the content from a webpage is most of times, related to finding the right html tags.

There's a significant option of libraries that can be used for the purpose of web-scraping, and each one has it's pros and cons. Although they are of much help, you will still need to analyze the page content yourself.

The webpage front-end is filled with tags. On my case, I found a pattern over the announces on the following:

Seems like the announces have their info inside <marker>, which is one of html tags used on the page code.

BeautifulSoup names these tags as Elements. Once I set the instance of the class on a variable named soup, I can track down all elements from a parsed tag name with find_all() method.

soup = BeautifulSoup(requests.get(URL).content, 'html5lib')# This will return a list of elements from the tag parsed.html_tags = soup.find_all(name = "marker")# If I print the attributes of the first element on the list,N = 0
for tag in html_tags:
if N> 0:
break
print('Price: {}'.format(tag['preco']))
print('Area: {}'.format(tag['area_total']))
print('Rooms: {}'.format(tag['qtd_quartos']))
N+= 1
[PRINT OUTPUT]
Price: 1.150.000,00
Area: 259.37
Rooms: 3

Seems easy right… yeah ok.. I confess I picked an easy one to write this post. But once you practice these tag searches on random html codes it shouldn't get much harder than that.

Now instead of printing the values of the first tag element from the list, I will loop over all of them and transform into pandas dataframe.

Code

Output

Remember, this data is inputed by humans so the changes of having errors are higher. Also, all features are set as string type (of course!.. we scrapped the actual text from the webpage). Still real data tho.. From actual rent announces!

Now I can pre-process all this data and use as input for a regression model.

Pre-processing

Output

This lead us with a bit more then 1000 announces of apartment monthly rents prices located at my hometown, Curitiba.

The feature we want to predict is price, so we will train a model to do such, using the other features as input, based on the rental's attributes: total area, quantity of bathrooms, rooms and parking slot.

I will use the sklearn pre-built function to instance a class of DecisionTreeRegressor. My input data dimensions are not exactly on the sale scale. Using StandardScaler would be an option to solve this "issue".

I've read a lot of discussion on this topic, since I had this doubt myself. Seems that tree based models are not as impacted by different dimensions among feature values as much as other machine learning models. Makes sense if we use common sense, like, how tree models operate? -Finding the best feature/value to split the data into nodes based on a criteria. There's a few known options to be used as decision criteria but they all remain selecting a single feature and it's threshold at time.

I won't be scaling the data at this time. The purpose of this final session is not focus on finding a good metric from the model predictions.. But would like to know your thoughts on this topic if you have time. Hit me up!

Code

I end up not even splitting my data on train/test. Not a practice to be followed too.. But just curious to see what prediction I would get based on personal attributes I'm looking for a rent.. By any chance, do you know someone is looking to rent the following?..

# Parsing attributes of my ideal place to rent
total_area = 75
bath_qnt = 2
room_qnty = 2
park_qnty = 1
desire_atributes = np.array([
[total_area, bath_qnt, room_qnty, park_qnty]
])
print('Predicted Rent Price: R$ {:0.0f}'.format(
tree_model.predict(desire_atributes)[0]
))
[PRINT OUTPUT]
Predicted Rent Price: R$ 1343

That seems a fair price to be charged! I don't know why I cant find anything on this range.. Maybe it's a shitty model..? Nah.. economy is killing us all.

Before you Go

Thanks for reading! Hope you enjoy it. If you didn't notice before, the picture stamping this post is a Figure object coded with plotly, my favorite library for visualization.

I will share the script down here. See you next time!

--

--