Data Storage Evolution

Among the many memories I have of my father in our chats about technology, one of the most striking (and recurring) is related to the rapid evolution of storage media. When I was in college, I showed…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Introduction

This is (hopefully) the first of many blog posts I will be writing on Medium!

For the last few weeks, I have been working on a machine learning project as part of IBM’s Professional Data Science Certificate on Coursera. In this post, I will summarize my report on how I used K-Means Clustering method to understand the city of Minneapolis in terms of its neighborhoods, venues in each neighborhood, Zillow’s Home Value Index (HVI), data of the homes (number of bedrooms, bathrooms, year built, sale price etc.), and venues from each home.

This post is for:

Part A

Here’s the first five rows of neighborhood in Minneapolis giving us a general sense of homes in each neighborhood and estimation of the home values in the future.

Minneapolis Real Estate Market Overview by neighborhood

Using FourSquare API, I got a total of 220 unique venue categories amongst 63 neighborhoods including school, museum, bar, restaurant, shopping mall, park, and many more.

Top 10 common venues for each neighborhood

I merged the location venues in each neighborhood and real estate data to cluster the neighborhoods. Using the Elbow method, I found that k = 5 is the optimal value for k for the K-Means clustering algorithm.

Elbow Method to find the optimal k to cluster neighborhoods

Table below shows the average values for each cluster group ordered by the HVI column.

The final clusters of the neighborhoods is shown in the picture below. Due to the unfortunate default colors, here are the clusters with their colors (order matched with the table above):

Clusters of neighborhoods in Minneapolis

Part B

From Apify, I retrieved a total of over 800 homes from all the neighborhoods in Minneapolis.

I repeated the steps to find nearby location venues for each home, optimal k value (k = 6), cluster the homes based on parameters: bedrooms, bathrooms, sqft (living area), price (asking price of the real estate), and year built, and the nearby venues of each home. The table below shows how the 6 clusters differ by the number of bathrooms, bedrooms, living space, house sale price, and year built. Note that there’s an increment in living area (sqft) with an increase in price. Generally, it seems like the newer homes (year built) are more expensive than the older homes. Then again, we will have to run more analysis to decide if these observations are noteworthy.

The seemingly positive correlation between the sale price and living area is confirmed in the picture below (correlation: 0.81).

The map below displays the clusters of homes against a choropleth visualization where darker the shade of red of a neighborhood, higher the number of venues in it.

Again, for clarification, the clusters are colored as (order matched with the table above):

It is important to note that the data included for clustering of the homes are not standardized. As seen on the map, there are many homes belonging in Cluster 0 (red, 501 homes). Cluster 1 has 181 homes, Cluster 2 has 81 homes, Cluster 1 has 24 homes, Cluster 5 has 14 homes, and Cluster 3 has 2 homes. Coincidentally, the order of the number of homes in each cluster match the order of the clusters’ average home sale price. We definitely want more data (and standardize) to have a better understanding of whether the clusters of homes are grouped optimally.

Since Cluster 3 is hard to notice because of the (again) unfortunate colors, here’s a table showing the two homes in the neighborhoods East Isles and Lowry Hill.

Cluster 3 (highest average home sale priced cluster)

This project can be taken even further by finding more data on homes. Some parameters that many people consider when buying a home (that are found to affect the property value significantly) are usable space, upgrades, and local market among others. The program I wrote to scrape data such as commute and walk scores from the Zillow website had some web crawling issues. There might be APIs that also provide the year that a home was renovated, condition of the home, view/commute/walk scores — all of which are important factors to consider for buyers and agents alike.

It would also be a convenient next step to predict home prices using regression analysis. I would try to find more data in each clusters. Currently, the number of homes in the clusters are widely different. It might help to standardize the dataset used for clustering based.

It might have been easier to do this analysis on bigger cities with more available data. However, Minnesota has now become a second home for me and Minneapolis is the biggest I could get. Moreover, Minneapolis is the second most densely populated city in the Midwest region behind Chicago. The city, along with St. Paul, makes up the ‘Twin Cities.’

It would have been interesting to look into the twin cities as well in general:)

Another thought I had for this project was to use Natural Language Processing tools to look at the descriptions of each home. I like word clouds for readability. Since this step is only few lines of codes, stay tuned to see it eventually on my Github.

If you found this article, I probably asked you to check it out or Medium suggested it to you (in that case#DataScience #MachineLearning #InsertPopularHashtags), or you looked up homes to buy in Minneapolis (and google recommended this to you! whaaaa). Either way, I hope that this article gave you an idea of what you are looking for in homes or which neighborhoods you like the most based on different factors or what home clusters you want to belong to in the future. The codes I wrote can be replicated to fit any city of your choice so get coding!

Data Storage Evolution

Introduction

Add a comment

Related posts:

How To Be Patient In Tough Times

How Blogging Benefits All Segments of Your Business

Loni