Program  

Courses
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data

Become a modern data engineer by learning cloud, Airflow, Spark, Data lake/warehouse, NoSQL, and real-time data pipelines

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING

Beginner

Intermediate

Advanced

Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

From Web Scraping to Useful Data Frames — How to Scrape a Website

May 20, 2020

The blog is posted by WeCloudData’s Big Data course student Laurent Risser.

Toronto is known for its crazy housing market. It’s getting harder and harder to find an affordable and convenient place. Searching for “How to find an apartment in Toronto” on Google leads to dozens of pages of advice, which is a pretty good indicator that apartment hunting is a painful process.

As a Data Scientist trainee, I was sure that I could alleviate this pain a bit and simplify the way people search for a place to live in. The project I came up with aims to find out the relationships between the price of an apartment in the Greater Toronto Area (GTA), its location, surface, and the number of bedrooms. The business idea of this project is to help apartment seekers to find the best deal across different neighbourhoods in the GTA.

To conduct this project, I decided to use the popular website Craiglist. My idea was to extract the data from the website using a web scraping tool from Python (version 3.7.4), named Beautiful Soup.

To keep everyone here awake, I have divided this project into two parts: the first part is the Web scraping and Data Frame generation, and the second part focuses on the Analysis and Predictions [coming sooner than you expect…].

So, what can I extract from Craiglist?

Craigslist apartment listings for Toronto are located at https://toronto.craigslist.org/d/apts-housing-for-rent/search/apa

To begin, I needed to get the website’s URL. To make it cleaner, I filtered out the posts without picture, to narrow down the search just a little. Even though it’s not a ‘true’ base URL, it is still good for our purpose here.

Then I created my attack plan in four steps:

  • Understand the data
  • Collect the data
  • Create a dataset
  • Clean the dataset

Before digging into each step, in this project I used several Python packages, but will only touch upon the most relevant ones. Of course, Beautiful Soup from bs4, which is the module that can parse the HTML of the web page retrieved from the server. I quickly checked the type and length of that item to make sure it matched the number of posts on the page (the default is 120 posts per page).

In case you are interested in details, here is a list of the packages needed for this project:

Understand the Data (the website)

I used the get module from the requests package in Python. I defined a variable response and assigned it to the get method called on the base URL. What I mean by ‘base URL’ is the URL on the first page you want to pull the data from.

A typical post on Craigslist, useful to confirm the four fields of data: price, surface, location, number of bedrooms. Source : Author

Then, to do the scraping correctly, I needed to understand how the website was organized. To do that, I performed a basic search on Craiglist and opened the HTML box. Looking at the screenshot below, you see on the right side <li class=“result-row”>. This is the tag you want to find for a single post; this is the box that contains all the elements I needed!

Craiglist, source : Author

Collect the data

To make an initial quick test, I worked in the following way: I grabbed the first post and all the variables I wanted from it, and made sure I knew how to access each of them before looping the whole page. In the same manner, I then made sure I could successfully scrape one page before adding the loop that goes through all pages.

So what does the loop that I designed to extract the data look like? These are the details of the ‘for’ loop I used in my project:

  • For each page in pages:
    — If page returns status code other than 200, send a warning
  • For the post in posts:
    – If not missing neighbourhood information:
    — Add post-date-time posted to list of date-times
    — Add post neighbourhood to list of neighbourhoods
    — Add post title to list of post titles
    — Add a link to post to list of links
    — Add cleaned post price to list of prices
    — Add the surface to list of surface
    — Add the number of bedrooms in the list of bedrooms

Feel free to access the full code on GitHub click here.

I also included some steps to clean the data in the loop, like pulling the ‘DateTime’ attribute and removing the ‘ft2’ from the surface in sqt footage variable and making that value an integer. In addition, I removed ‘br’ from the number of bedrooms as that was scraped as well.

With these additional steps, I started the data cleaning with some work already done, which is always good, right?

Create a dataset

After I extracted the data with the loop above, I saved the data into a data frame. Then I filtered the data frame with the following columns: Date Posted, Neighbourhood, Post Title, URL, Price, Surface, Number of Bedrooms.

Source: Author

Cleaning the data set

Next, I needed to further clean the data set by modifying the class of some objects and removing others. Here are the additional steps required:

  • Turned DateTime string into a DateTime object.
  • Removed $ and converted Price to an integer.
  • Converted Bedroom to class float.
  • Removed () from the Neighborhood column.
  • Changed missing values in Price and Sqft to NaN type and removed them.

Surprisingly, after I did all this cleaning, I ended up with 101 rows, and only 53 rows with values for the surface. This is not the ideal sample size we would like, but let’s see what we can get from it.

The resulting data frame after cleaning, source: Author

Now that the dataset is ready to go, I can analyze it.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see the learning path. To read more posts from Laurent, check out his Medium posts here.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.