Scraping Craigslist

In this part, I'll be extracting information on apartments from Craigslist search results. I'll be using BeautifulSoup to extract the relevant information from the HTML text.

For reference on CSS selectors, please see the notes from Week 6.

Getting the HTML

First we need to figure out how to submit a query to Craigslist. As with many websites, one way you can do this is simply by constructing the proper URL and sending it to Craigslist. Here's a sample URL that is returned after manually typing in a search to Craigslist:

http://philadelphia.craigslist.org/search/apa?bedrooms=1&pets_cat=1&pets_dog=1&is_furnished=1

There are two components to this URL:

  1. The base URL: http://philadelphia.craigslist.org/search/apa
  2. The user's search parameters: ?bedrooms=1&pets_cat=1&pets_dog=1&is_furnished=1

We will use requests.get() function to get the search page's response. For the search parameters, we will set bedrooms=1, which will make sure the number of bedrooms is listed.

This can be done easiest by using the params keyword of the get() function. We didn't cover this in the lecture, so I've went ahead and done the necessary steps.

In [4]:
import requests
In [5]:
url_base = 'http://philadelphia.craigslist.org/search/apa'
params = {'bedrooms': 1}
rsp = requests.get(url_base, params=params)
In [3]:
# Note that requests automatically created the right URL
print(rsp.url)
https://philadelphia.craigslist.org/search/apa?bedrooms=1

1.1 Parse the HTML

  • Use BeautifulSoup to parse the HTML response.
  • Use the browser's Web Inspector to identify the HTML element that holds the information on each apartment listing.
  • Use BeautifulSoup to extract these elements from the HTML.

There should have a list of 120 elements, where each element is the listing for a specific apartment on the search page.

In [1]:
from bs4 import BeautifulSoup
In [6]:
soup = BeautifulSoup(rsp.content, 'html.parser')
In [7]:
selector = "#sortable-results > ul li"
rows = soup.select(selector)
In [7]:
len(rows)
Out[7]:
120

1.2 Find the relevant pieces of information

We will now focus on the first element in the list of 120 apartments. Use the prettify() function to print out the HTML for this first element.

From this HTML, identify the HTML elements that hold:

  • The apartment price
  • The number of bedrooms and square footage (this will be in a single element)
  • The apartment title
  • The datetime string of the posting, e.g., '2019-03-23 12:07'

For the first apartment, print out each of these pieces of information, using BeautifulSoup to select the proper elements.

Hint: Each of these can be extracted using the text attribute of the selected element object, except for the datetime string. This information is stored as an attribute of an HTML element and is not part of the displayed text on the webpage.

In [8]:
row1 = rows[0]
print(row1)
<li class="result-row" data-pid="6996641837">
<a class="result-image gallery" data-ids="1:01313_1t9aTqORgdQ,1:00k0k_bRDYZpe7dGU,1:00Y0Y_7cbgohNddMs,1:00e0e_798ZPzjNWfs,1:00Q0Q_b3mzZNQofaQ,1:00c0c_56du9BbE4Ea,1:00t0t_2TnzjnrmfSi,1:01010_iRpcSPphm8O,1:00F0F_3OWjQiYqlyf,1:00P0P_llMScLd8LiN,1:00r0r_4nFnjwWbvg1" href="https://philadelphia.craigslist.org/apa/d/west-chester-pet-friendly-two-dog-parks/6996641837.html">
<span class="result-price">$1805</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2019-10-29 18:58" title="Tue 29 Oct 06:58:45 PM">Oct 29</time>
<a class="result-title hdrlnk" data-id="6996641837" href="https://philadelphia.craigslist.org/apa/d/west-chester-pet-friendly-two-dog-parks/6996641837.html">Pet Friendly, Two Dog Parks, Nature Trails,  BBQ Stations &amp; MORE!</a>
<span class="result-meta">
<span class="result-price">$1805</span>
<span class="housing">
                    2br -
                </span>
<span class="result-hood"> (Pet Friendly! 2 Dog Parks!)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
<a class="restore-link" href="#">
<span class="restore-narrow-text">restore</span>
<span class="restore-wide-text">restore this posting</span>
</a>
</span>
</p>
</li>
In [9]:
price = row1.select_one('span.result-price').text
print(price)
$1805
In [10]:
nbed_area = row1.select_one('span.housing').text
print(nbed_area)
                    2br -
                
In [11]:
title = row1.select_one('p > a').text
print(title)
Pet Friendly, Two Dog Parks, Nature Trails,  BBQ Stations & MORE!
In [94]:
datetime = row1.select_one('p > time')['datetime']
print(datetime)
2019-10-29 18:58

1.3 Functions to format the results

In this section, I'll create two functions that take the price and time results from the last section and format them properly.

In [13]:
import re
In [14]:
def format_size_and_bedrooms(size_string):
    """
    Extract size and number of bedrooms from the raw
    text, using regular expressions
    """
    split = re.findall("\n(.*?) -", size_string)
    
    # both size and bedrooms are listed
    if len(split) == 2:
        n_brs = split[0].strip().replace('br', '')
        this_size = split[1].strip().replace('ft2', '')
    # only bedrooms is listed
    elif 'br' in split[0]:
        n_brs = split[0].strip().replace('br', '')
        this_size = np.nan
    # only size is listed
    elif 'ft2' in split[0]:
        # It's the size
        this_size = split[0].strip().replace('ft2', '')
        n_brs = np.nan
    
    # return floats
    return float(this_size), float(n_brs)
In [15]:
def format_price(price_string):
    # Format the price string and return a float
    # This will involve using the string.strip() function to remove unwanted characters
    this_price = price_string.strip('$')
    return float(this_price)
In [16]:
from datetime import datetime
In [109]:
def format_time(date_string):
    # Return a Datetime object from the datetime string
    datetime_object = pd.to_datetime(date_string, format = '%Y-%m-%d %H:%M')
    #datetime_object = datetime_object.to_pydatetime()
    return datetime_object
In [110]:
print(format_time(datetime))
2019-10-29 18:58:00

1.4: Putting it all together

In this part, I'll re-request information using results from previous parts. The code will loop over 5 pages of search results and scrape data for about 600 apartments.

In the code below, the outer for loop will loop over 5 pages of search results. The inner for loop will loop over the 120 apartments listed on each search page.

After filling in the missing pieces and executing the code cell, I got a Data Frame called results that holds the data for 600 apartment listings.

Notes

Be careful if you try to scrape more listings. Craigslist will temporarily ban your IP address (for a very short time) if you scrape too much at once. I've added a sleep() function to the for loop to wait 30 seconds between scraping requests.

If the for loop gets stuck at the "Processing page X..." step for more than a minute or so, your IP address is probably banned temporarily, and you'll have to wait a few minutes before trying again.

In [23]:
from time import sleep
import numpy as np
import pandas as pd
In [111]:
results = []

# search in batches of 120 for 5 pages
# NOTE: you will get temporarily banned if running more than ~5 pages or so
# the API limits are more leninient during off-peak times, and you can try
# experimenting with more pages
max_pages = 5
results_per_page = 120
search_indices = np.arange(0, max_pages*results_per_page, results_per_page) 
url = 'http://philadelphia.craigslist.org/search/apa'

# loop over each page of search results
for i, s in enumerate(search_indices):
    print('Processing page %s...' % (i+1) )
    
    # get the response
    resp = requests.get(url, params={'bedrooms': 1, 's': s})
    
    # get the list of all aparements
    # It should be a list of 120 apartments
    apts = soup.select("#sortable-results > ul li")
    print("number of apartments = ", len(apts))

    # loop over each apartment in the list
    page_results = []
    for apt in apts:
        
        sizes_brs = apt.select_one('span.housing').text # the bedrooms/size string
        title = apt.select_one('p > a').text # the title string
        price = apt.select_one('span.result-price').text # the price string
        dtime = apt.select_one('p > time')['datetime'] # the time string
        
        # format using functions from Part 1.3
        sizes, brs = format_size_and_bedrooms(sizes_brs)
        price = format_price(price)
        dtime = format_time(dtime)
        
        # save the result
        page_results.append([dtime, price, sizes, brs, title])
    
    # create a dataframe and save
    col_names = ['time', 'price', 'size', 'brs', 'title']
    df = pd.DataFrame(page_results, columns=col_names)
    results.append(df)
    
    print("sleeping for 30 seconds between calls")
    sleep(30)
    
# Finally, concatenate all the results
results = pd.concat(results, axis=0)
Processing page 1...
number of apartments =  120
sleeping for 30 seconds between calls
Processing page 2...
number of apartments =  120
sleeping for 30 seconds between calls
Processing page 3...
number of apartments =  120
sleeping for 30 seconds between calls
Processing page 4...
number of apartments =  120
sleeping for 30 seconds between calls
Processing page 5...
number of apartments =  120
sleeping for 30 seconds between calls
In [112]:
results
Out[112]:
time price size brs title
0 2019-10-29 18:58:00 1805.0 NaN 2.0 Pet Friendly, Two Dog Parks, Nature Trails, B...
1 2019-10-29 18:55:00 3348.0 1462.0 2.0 Fantastic Location, Beautiful Amenities, Luxur...
2 2019-10-29 18:54:00 1479.0 651.0 1.0 Come Home to Casa Del Sol
3 2019-10-29 18:52:00 1100.0 1000.0 2.0 2 bedrooms apt for rent at 5015 Cedar
4 2019-10-29 18:50:00 980.0 1100.0 4.0 Housing Available for immediate rent
... ... ... ... ... ...
115 2019-10-29 17:17:00 1395.0 1000.0 1.0 ### Blue Bell - One Bedroom Brand new Bathroom...
116 2019-10-29 17:17:00 1387.0 725.0 1.0 Special Pricing On Select 1 Bed 1 Bath For Imm...
117 2019-10-29 17:16:00 1305.0 NaN 1.0 Two Pet Parks, Open Concept, BONUS ROOM!
118 2019-10-29 17:16:00 1328.0 725.0 1.0 Fall In Love With Your New Home Special Pricin...
119 2019-10-29 17:15:00 1765.0 1184.0 2.0 2 Bedroom 2 full bath with fireplace!

600 rows × 5 columns

1.5: Plotting the distribution of prices

Use matplotlib's hist() function to make two histograms for:

  • Apartment prices
  • Apartment prices per square foot (price / size)

Make sure to add labels to the respective axes and a title describing the plot.

Side note: rental prices per sq. ft. from Craigslist

The histogram of price per sq ft should be centered around ~1.5. Here is a plot of how Philadelphia's rents compare to the other most populous cities:

Source

In [113]:
from matplotlib import pyplot as plt
%matplotlib inline
In [126]:
fig, ax = plt.subplots(figsize=(8,6))

# Plot histogram 
ax.hist(results['price'], bins=30, color = "skyblue")

ax.set_xlabel("Apartment Price", fontsize=18)
ax.set_ylabel("Number of Apartments", fontsize=18);
ax.set_title("Apartments vs Their Respective Prices in Philadelphia", fontsize=18);

The apartment prices in Philly have a distribution a bit skewed to the right, with mode at around $1500.

In [124]:
fig, ax = plt.subplots(figsize=(8,6))

# Plot histogram 
pricesqft = results['price']/results['size']
ax.hist(pricesqft[~np.isnan(pricesqft)], bins=30, color = "skyblue")

ax.set_xlabel("Apartment Price per Square Foot", fontsize=18)
ax.set_ylabel("Number of Apartments", fontsize=18);
ax.set_title("Apartments vs Their Respective Prices/sqft in Philadelphia", fontsize=18);

From the scrapped data, the average rent price is centered around $1.7/sqft, however, there are quite a couple of apartments that have larger or smaller unit prices.

1.6 Comparing prices for different sizes

Use altair to explore the relationship between price, size, and number of bedrooms. Make an interactive scatter plot of price (x-axis) vs. size (y-axis), with the points colored by the number of bedrooms.

Make sure the plot is interactive (zoom-able and pan-able) and add a tooltip with all of the columns in our scraped data frame.

With this sort of plot, you can quickly see the outlier apartments in terms of size and price.

In [127]:
import altair as alt
alt.renderers.enable('notebook')
Out[127]:
RendererRegistry.enable('notebook')
In [176]:
chart = alt.Chart(results).mark_circle(size=50).encode(
    x=alt.X('price:Q', axis=alt.Axis(title='Rent Price ($)')),
    y=alt.Y('size:Q', axis=alt.Axis(title='Size of the Apartment (sqft)')),
    color=alt.Color("brs",legend=alt.Legend(title="Number of Bedrooms")),
    tooltip=[alt.Tooltip("price", title='price ($)'), alt.Tooltip("size", title='size (sqft)'), alt.Tooltip("brs", title='# of bedrooms'), "time", "title"],
).interactive().properties(
title= 'Apartment Size vs Rent Price in Philadelphia'
)

chart
Out[176]:

From the plot above, the majority of the apartments are in the range of 500-1500 sqft with rent $1000-$3000. We can see a weak trend between size of the apartments and the rent price for apartments within this range. The larger apartments (with more bedrooms) tend to have smaller unit prices.