Program  

Courses
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data

Become a modern data engineer by learning cloud, Airflow, Spark, Data lake/warehouse, NoSQL, and real-time data pipelines

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING

Beginner

Intermediate

Advanced

Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

Kijiji House Price Analysis using Python

October 28, 2019

This is the first project that I have done for WeCloudData. The purpose of this project is to find the relationship between housing prices in Toronto(GTA) in relation to location, house size, number of bedrooms and number of bathrooms.

We start by scraping data from Kijiji through the URL requests.

Then we parse our data source by using Beautiful Soup.

Screen Shot 2017-10-22 at 9.21.23 AM.png

After, we translate the raw data into the clean dataset. This is what the first five rows of our data looks like below:

Screen Shot 2017-10-21 at 8.59.37 PM.png

Price represents housing prices in Toronto; postcode is the corresponding postcode; FSA is the first three letters of the postcode; bedroom is number of bedrooms in the house; bath represents the number of bathrooms in the house; sqrt means squared root of feet of the house; city is the location; while the house latitude and the longitude correspond.

As you can see, there is missing data, so the next step is to delete the outliers. We first find out the percentage of missing data in each feature. Besides sqrt other features only have few missing data. However, sqrt has more than 60 percent missing data. In this case, we think this variable(sqrt) is useless. Therefore, we forget about this variable and only delete other features missing from the data and just take the other features into account. Then our new dataset is below, we only see five rows here:

Screen Shot 2017-10-21 at 9.12.40 PM.png

The next step is to find out the relationship between each feature, since we are not just putting the city into our account. Therefore, we decide to translate city into the mean price of the city, which means we translate the categorical variable into quantitative variables. After this is done, we can now draw a boxplot and delete the outliers. We first use SciPy to draw a box plot of the housing prices, then you can see a huge outlier there.

Screen Shot 2017-10-21 at 9.38.18 PM.png

Then, we delete the outliers and use Plotly to draw a nicer box plot.

newplot (1).png

This is the box plot of the housing prices. As you can see, the minimum price is around 0, the maximum amount is approximately 1.6M, and the median is approximately 0.7M, and there aren’t outliers since we deleted them already. After that, we have also drawn the scatter plot for the housing prices.

Scatter Plot for Housing Prices

newplot.png

For the scatter plot above, the x-axis is the number of the house, the y-axis is the price of the house.

Screen Shot 2017-10-21 at 9.32.29 PM.png

We also drew the histogram of the housing prices, as you can see the shape is almost normal. Then we have other descriptive statistics, such as a pie chart.

Screen Shot 2017-10-21 at 9.32.42 PM.png

This is the pie chart of the number of bedrooms in those advertisements. As well as the pie chart of the number of bathrooms below:

Screen Shot 2017-10-21 at 9.32.54 PM.png

We have also found out the bar chart for the number of bathrooms and number of bedrooms. The x-axis is the number and the y-axis is the housing prices in Toronto.

Bar Chart number of Bathroom

Screen Shot 2017-10-21 at 9.33.01 PM.png

Bar Chart number of Bedroom

Screen Shot 2017-10-21 at 9.33.10 PM.png

Then we draw the graph of the location of each advertisement since our project is mainly focused on Toronto. Then you can see most of the advertisements are located primarily in the larger areas of Toronto.

Screen Shot 2017-10-21 at 9.34.08 PM.png

Screen Shot 2017-10-21 at 9.33.54 PM.png

After that, we have also drawn the QQ plot of the housing prices to see whether it follows the normal distribution. And the answer is yes. Because most of the points are along the red line.

Screen Shot 2017-10-21 at 9.34.59 PM.png

Then the last step is to find out the relationship between the housing prices and the other features, except for sqrt. We pick four regressions to compare and test: Linear Regression, Lasso Regression, SVM, and Decision Tree Regression. After, we use the cross-validation method to test those four regressions based on our data, we find out the unfortunate answer. The accuracy of all the models gave me terrible results, which means all of those models are not good. We have also drawn the Scatter Plot for comparing the test data and predict data for these four methods. The x-axis is the test housing price in Toronto and the y-axis is the predicted housing prices in Toronto. As you can see, they do not perform very well.

Screen Shot 2017-10-21 at 9.35.14 PM.png

Scatter Plot of Quantitative Comparison

Screen Shot 2017-10-21 at 9.36.08 PM.png

The reason for this result is we did not take sqrt into account. Because sqrt should not be one of the main reasons for determining housing prices. If we want to improve our results and obtain a reasonable conclusion, next time, we need to choose a more valuable website and scrape more features along with more data.

To see Manqiong’s original blog post please click here.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.