
"How many coffee shops in Boston?"
Hmmm...well, suppose there are 1 million people in Boston. Then... NOPE. Stop it. This series is nothing about strategy for any brain teaser. This is an analysis about two major coffeehouse chains, Starbucks vs Dunkin' Donuts, "corporate + license" vs pure franchise, in Boston area. I will walk you through their distribution and strategies of choosing a location (if consistent). This series is roughly separated by following parts.
- Part 0: Get the data prepared!
- Part 1: A glance at locations
- Part 2: Who are Dunkin's and Starbucks' favorite neighbors? - Empirical Bayes Analysis
- Part 3: What can we expect around a coffeehouse? - Network Visualization
- Summary
Get the data prepared!
In this post, I will show you how I collected data sets for this series. You can also find my code here. Enjoy!
Three kinds of data sets will be needed for this series. However, you can also find the data from the folder and skip this post :)
- Zip code and name of each region in Boston and its neighborhood
- Latitude and longitude of all Starbucks' and Dunkin's store
- Latitude and longitude of all merchants near each store
Take a breath. Here we go!
import requests
from bs4 import BeautifulSoup as Soup
import pandas as pd
import json
import settings
Zip code
Let's start from the easiest part! Click here to download the spreadsheet from Mass.gov.
I rename following regions to make them more specific.
Zip Code | Original Name | New Name |
---|---|---|
02446 | Brookline | Brookline North |
02467 | Brookline | Brookline Chestnut Hill |
02210 | Boston | Boston Seaport |
Location of Starbucks and Dunkin'
Use the zip code we get from the above to scrape all geometric data from their official websites. (Starbucks and Dunkin')
zip_code = pd.read_table('./data/zip_boston_neighborhood.txt', header=0, names = ['zip_code','region'], dtype = 'object')
#To get longitude and latitude from web content
def store_dict(source, id, coordinates, zip_code, output):
out = output
for store in source:
if store[id] in output.keys():
continue
else:
out[store[id]] = {'zip_code': zip_code, 'geo': store[coordinates]}
return out
Starbucks
sb_store = {}
for z in zip_code.zip_code.values:
url = "https://www.starbucks.com/store-locator?place={}".format(z)
page = requests.get(url)
soup = Soup(page.content, 'html.parser')
store = soup.find_all('div',id = 'bootstrapData')
store = store[0]
store = store.get_text()
sb_dict = json.loads(store)
sb_dict = sb_dict['storeLocator']['locationState']['locations']
sb_store = store_dict(sb_dict, 'id', 'coordinates', z, sb_store)
Dunkin'
dd_store = {}
dd_dict = []
for z in zip_code.zip_code.values:
url = "https://www.mapquestapi.com/search/v2/radius?callback=json111206657990027752725_1504756855257&key=Gmjtd%7Clu6t2luan5%252C72%253Do5-larsq&origin={}&units=m&maxMatches=30&radius=25&hostedData=mqap.33454_DunkinDonuts&ambiguities=ignore&_=1504756855258".format(z)
page = requests.get(url)
soup = Soup(page.content,'html.parser')
store = soup.get_text()
store = store[store.find("(")+1:store.rfind(")")]
dd_dict_temp = json.loads(store)
dd_dict_temp = dd_dict_temp['searchResults']
for dd in dd_dict_temp:
dd_dict.append(dd['fields'])
dd_store = store_dict(dd_dict, 'recordid', 'mqap_geography', z, dd_store)
Integration
sb_nearby = pd.DataFrame([{'id': key, 'lat': sb_store[key]['geo']['latitude'], 'lon': sb_store[key]['geo']['longitude']}, 'zip_code': sb_store[key]['zip_code'] for key in sb_store.keys()])
dd_nearby = pd.DataFrame([{'id': key, 'lat': dd_store[key]['geo']['latLng']['lat'],'lon': dd_store[key]['geo']['latLng']['lng']}, 'zip_code': sb_store[key]['zip_code'] for key in dd_store.keys()])
sb_nearby['type'] = 'starbucks'
dd_nearby['type'] = 'dunkin'
sb_nearby['id'] = sb_nearby['id'].astype('str')
dd_nearby['id'] = dd_nearby['id'].astype('str')
nearby = pd.concat([sb_nearby, dd_nearby],ignore_index=True)
nearby = pd.merge(nearby, zip_code, how = 'inner', on = 'zip_code')
Merchants Nearby
Finally, this is the last data set we need. Before doing it, you need to get a key for Google Place API. Google Places API Radar Search Service could be used to find up to 200 nearby merchants by types for each search. Thus, we are able to get the number of each type of merchants for a given store. A list of valid types could be found here. Additional parameters, like minprice and maxprice, are usefule if you are interested in diving deeper.
#search nearby merchants using Google Place API
def search_nearby(datasets, type_list, API_KEY, optional_para = None):
#the key of the dictionay is used as the suffix for output, the value is used as a part of URL
optional = {'':""}
if optional_para is not None:
for key in optional_para.keys():
optional[key] = "&{}={}".format(key,optional_para[key])
near = {}
for suffix in optional.keys():
for type in type_list:
print("Working on {}".format(type))
new_column = []
for i in range(len(datasets.values)):
row = datasets.values[i]
url = 'https://maps.googleapis.com/maps/api/place/radarsearch/json?location={},{}&radius=1000&type={}{}&key={}'.format(row[1], row[2], type, optional[suffix], API_KEY)
page = requests.get(url)
soup = Soup(page.content,'html.parser')
place = soup.get_text()
place = json.loads(place)
place = place['results']
for merchant in place:
near[row[0]] = dict([merchant['place_id']]: merchant['geometry']['location'])
new_column.append(len(place))
print("{}% is completed.".format(round(i/len(datasets.values)*100, 2)))
variable_name = "{}_{}".format(type, suffix)
datasets[variable_name] = pd.Series(np.array(new_column), index = datasets.index)
return (datasets, near)
Price level of restaurants and clothing stores is available in Google Place API for search. Thus, I decided to see if Starbucks or Dunkin' has special preference.
#Define cheap and expensive as price level from 0-2 and 3-4 respectively
optional_keywords_price_level = {'cheap': {'minprice':0, 'maxprice':2}, 'expensive': {'minprice':3, 'maxprice':4}}
#Search nearby cheap and expensive restaurant and clothing store
type_list_with_price = ['restaurant','clothing_store']
nearby, merchants = search_nearby(nearby, type_list_with_price, settings.API_KEY, optional_keywords_price_level)
Search other types of merchants
#Search other nearby merchants
type_list = ['atm', 'bakery', 'bank', 'beauty_salon', 'book_store','cafe','car_repair', 'movie_theater', 'convenience_store', 'dentist', 'florist', 'gas_station', 'gym', 'home_goods_store',\
'hospital', 'laundry', 'liquor_store', 'museum', 'park', 'pharmacy', 'police', 'real_estate_agency', 'school', 'shopping_mall', 'stadium', 'transit_station', 'university',\
'accounting', 'art_gallery', 'bicycle_store', 'car_dealer', 'car_rental', 'car_repair', 'church', 'city_hall', 'department_store', 'electronics_store', 'embassy', 'funeral_home',\
'fire_station', 'hindu_temple', 'veterinary_care', 'synagogue', 'post_office', 'physiotherapist', 'parking', 'mosque', 'local_government_office', 'library']
nearby, merchants_1 = search_nearby(nearby, type_list, settings.API_KEY)
merchants = merchants.update(merchants_1)
Done! Don't forget to save it!
nearby.to_pickle('nearby.pkl')
with open('merchants.pkl', 'wb') as f:
pickle.dump(merchants, f, pickle.HIGHEST_PROTOCOL)
Why didn't I use Google API to find all Starbucks' and Dunkin's stores directly?
Google API does not support searching by zip code. You have to provide a pair of latitude and longitude as the center of region. It might be a little more difficult to search through all regions in Boston. Luckily, we can use zip code in their official searching engine. It makes life much easier!
What's next?
You will have a brief idea of where are those stores located in Boston. Heatmaps will be used to show the density of coffeehouse distribution. Based on the pattern of store locations, it is possible to find some strategies of their business and marketing. And it is also possible to help them find an ideal location according to their location preference.