-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGroup_Work_DeskB.qmd
781 lines (645 loc) · 42.3 KB
/
Group_Work_DeskB.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
```{python}
# A lot of packages are not used in this qmd file
# but it did be used in some other ipynb files which provide solid support as a basic point,
# including cleaning/preprocessing part, Word2Vec part, SVM part, etc.
import os
import spacy
import pandas as pd
import numpy as np
import geopandas as gpd
import re
import math
import string
import unicodedata
import gensim
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import matplotlib.patheffects as PathEffects
import nltk
import seaborn as sns
import ast
import umap
import zipfile
import requests
from PIL import Image
import contextily as ctx
import urllib.request
from PIL import ImageDraw
from scipy.spatial import cKDTree
from scipy.spatial.distance import cdist
from scipy.ndimage import convolve
from shapely.geometry import Point
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import dendrogram, linkage
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk import ngrams, FreqDist
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.matutils import Sparse2Corpus
from gensim.matutils import corpus2dense
from gensim.models import tfidfmodel
from gensim.models import Word2Vec
from gensim.models import TfidfModel
from gensim.models import KeyedVectors
from gensim.models.ldamodel import LdaModel
from graphviz import Digraph
from IPython.display import Image
from joblib import dump
from joblib import load
from bs4 import BeautifulSoup
from wordcloud import WordCloud, STOPWORDS
```
```{python}
# get the current directory
current_dir = os.getcwd()
# Set the Github PERMALINK URL for downloading bio.bib and harvard-cite-them-right.csl
# Automatically download the BibTeX file.
bib_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/reference.bib"
# create local path for saving
local_bib_path = os.path.join(current_dir, "bio.bib")
# download and save .bib
response = requests.get(bib_url)
with open(local_bib_path, 'wb') as file:
file.write(response.content)
csl_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/harvard-cite-them-right.csl"
# create local path for saving
local_csl_path = os.path.join(current_dir, "harvard-cite-them-right.csl")
# download and save .csl
response = requests.get(csl_url)
with open(local_csl_path, 'wb') as file:
file.write(response.content)
```
---
bibliography: bio.bib
csl: harvard-cite-them-right.csl
title: DeskB's Group Project
execute:
echo: false
jupyter: python3
format:
html:
theme:
- minty
- css/web.scss
code-copy: true
code-link: true
toc: true
toc-title: On this page
toc-depth: 3
toc_float:
collapsed: false
smooth_scroll: true
pdf:
include-in-header:
text: |
\addtokomafont{disposition}{\rmfamily}
mainfont: Spectral
sansfont: Roboto
monofont: JetBrainsMono-Regular
papersize: a4
geometry:
- top=25mm
- left=40mm
- right=30mm
- bottom=25mm
- heightrounded
toc: false
number-sections: false
colorlinks: true
highlight-style: github
jupyter:
jupytext:
text_representation:
extension: .qmd
format_name: quarto
format_version: '1.0'
jupytext_version: 1.15.2
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
## Declaration of Authorship {.unnumbered .unlisted}
We, \[**DeskB**\], confirm that the work presented in this assessment is our own. Where information has been derived from other sources, we confirm that this has been indicated in the work. Where a Large Language Model such as ChatGPT has been used we confirm that we have made its contribution to the final submission clear.
Date: 19th December 2023
Student Numbers: 20017359 23032922 23081403 23103585 23130397
## Brief Group Reflection
| What Went Well | What Was Challenging |
|------------------|----------------------|
| data description | plotting |
| data cleaning | SVM classifier model |
## Priorities for Feedback
Are there any areas on which you would appreciate more detailed feedback if we're able to offer it?
Frankly, we've encountered lots of confusion towards the topic of this assessment. Especially in the topic selection, among all the predictive topis in the website, we can not propose the very specific quesiton and structure at the beginning. How to build the bridge between NLP recommending system for branding and inform valuable proposal for STL regulation could be the key issue for us.
So, if convinient, we would like to know did we structure the whole report with a solid logical chain. Also, did we successfully propose some constructive and feasible suggestions? And what should be NLP analysis used for proposal looked like in a real company project?
```{=html}
<style type="text/css">
.duedate {
border: dotted 2px red;
background-color: rgb(255, 235, 235);
height: 50px;
line-height: 50px;
margin-left: 40px;
margin-right: 40px
margin-top: 10px;
margin-bottom: 10px;
color: rgb(150,100,100);
text-align: center;
}
</style>
```
{{< pagebreak >}}
# Response to Questions
```{python}
# check the "Data" folder
data_dir = os.path.join(current_dir, "Data")
if not os.path.exists(data_dir):
os.makedirs(data_dir)
# check the "Model" folder
model_dir = os.path.join(current_dir, "Model")
if not os.path.exists(model_dir):
os.makedirs(model_dir)
# check the "Images" folder
iamges_dir = os.path.join(current_dir, "Images")
if not os.path.exists(iamges_dir):
os.makedirs(iamges_dir)
```
```{python}
# Download and read the csv file remotely from url
host = 'http://data.insideairbnb.com'
path = 'united-kingdom/england/london/2023-09-06/data'
file = 'listings.csv.gz'
url = f'{host}/{path}/{file}'
# Save csv file
if os.path.exists(file):
Airbnb_Listing = pd.read_csv(file, compression='gzip', low_memory=False)
else:
Airbnb_Listing = pd.read_csv(url, compression='gzip', low_memory=False)
Airbnb_Listing.to_csv(os.path.join("Data","listing.csv"))
```
```{python}
# Download and read the gpkg file remotely from url
host = 'https://data.london.gov.uk'
path = 'download/london_boroughs/9502cdec-5df0-46e3-8aa1-2b5c5233a31f'
file = 'London_Boroughs.gpkg'
url = f'{host}/{path}/{file}'
# Save gkpg file
if os.path.exists(file):
London_boroughs = gpd.read_file(file, low_memory=False)
else:
London_boroughs = gpd.read_file(url, low_memory=False)
London_boroughs.to_file(os.path.join("Data","London_Boroughs.gpkg"), driver='GPKG')
```
```{python}
data_dir = os.path.join(current_dir, "Data")
zip_url = "https://data.london.gov.uk/download/statistical-gis-boundary-files-london/08d31995-dd27-423c-a987-57fe8e952990/London-wards-2018.zip"
local_zip_path = os.path.join(data_dir, "London-wards-2018.zip")
response = requests.get(zip_url)
with open(local_zip_path, 'wb') as file:
file.write(response.content)
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
zip_ref.extractall(data_dir)
London_wards = gpd.read_file(os.path.join("Data","London-wards-2018_ESRI","London_Ward.shp"))
```
## 1. Who collected the data?
The dataset was collected by [Murray Cox](https://en.wikipedia.org/wiki/Inside_Airbnb) through automatic scraping from the Airbnb website, specifically for the Inside Airbnb project.
## 2. Why did they collect it?
The [Inside Airbnb](http://insideairbnb.com/about) project aims to provide an independent perspective, helping the public, researchers, and policymakers understand how Airbnb affects urban housing affordability and community dynamics. It offers insights for policy discussions and social understanding of Airbnb's role in urban environments.
## 3. How was the data collected?
[listings.csv](http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/data/listings.csv.gz) : Inside Airbnb collects its data primarily by scraping information from the Airbnb website. This process involves the following steps:
- Web Scraping: Inside Airbnb employs scripts to rapidly and extensively extract Airbnb listing data, imitating human browsing.
- Data Extraction: Information about each listing, such as location, price, availability and host details, is extracted and compiled.
- Data Aggregation: Aggregated data forms a database for analyzing Airbnb trends and insights across cities and regions.
- Regular Updates: The scraping process is repeated periodically to keep the database current, capturing new listings and updates to existing ones.
## 4. How does the method of collection impact the completeness and/or accuracy of its representation of the process it seeks to study, and what wider issues does this raise?
The dataset is mostly obtained by scraping information from the Airbnb website, so its breadth and depth of information publicly available on the site may be limited. For instance, detailed information about certain listings might not be fully disclosed, or website terms might restrict access to some data. Moreover, legal and ethical considerations in web scraping, such as data privacy and usage rights, may affect the integrity and accuracy of the data. The content of the website is constantly changing dynamically, but data scraping occurs at intervals, which means the data might not be updated in-real time, potentially leading to information gaps[@prentice_addressing_2023].
## 5. What ethical considerations does the use of this data raise?
### 5.1 Privacy issues
Whether the dataset has the consent of the owner to disclose its information, e.g., house location, name. Geocoded data is privacy-sensitive and highly likely to expose personal privacy when used to study demographic patterns and behaviours[@van_den_bemt_teaching_2018]. Therefore, It is crucial to obtain the consent of the owners to ensure that their privacy is not infringed upon.
### 5.2 Legal compliance
Usage of the dataset should comply with laws and regulations such as GDPR, DPA and EDPS. The EDPS 2015 report states that it is not enough to comply with the law in today's digital environment; We must consider the ethical dimensions of data processing[@hasselbalch_making_2019]. Legal compliance and ethical considerations should be closely combined in the digital age.
### 5.3 Social responsibility
It is critical to use the dataset correctly, as exposing certain data may result in inequity and bias. The Fairness and Openness Report[@walker_consumer_2019] emphasizes how to use information responsibly and ethically, as well as the importance to resist the labelling of low-income communities, race, etc. For example, a significant gap in housing prices between different neighbourhoods may reflect economic differences, which may affect perceptions of the social status of those areas. To avoid unwanted consequences, it is necessary to examine how to disclose the tagged attributes of the data.
### 5.4 Data security
Some sensitive information in the dataset must be stored securely to prevent unauthorized access and misuse. By adjusting the norms of network data use, it is possible to effectively guarantee data security and increase companies' ethical behavior level when processing data[@culnan_how_2009]. Thus, attention to data security can prevent unscrupulous individuals from collecting housing data for profit or monitoring purposes.
## 6. With reference to the data (*i.e.* using numbers, figures, maps, and descriptive statistics), what does an analysis of Hosts and Listing types suggest about the nature of Airbnb lets in London?
### 6.1 Why should we choose the textual information?
Many studies have analyzed various aspects of Airbnb listings, including price[@zhang_key_2017], spatial distribution[@la_location_2021], room type[@voltes-dorta_drivers_2020], etc. However, the "textual description", with more impressive potential than numeric fields, also plays a crucial role in shaping renters' first impressions of the listings, contributing to facilitating successful rental transactions. Therefore, we scrutinize the textual features/characteristics from the data, generalize, classify and summarize insightful conclusion which is correlated with the branding potential value[@ji_analysis_2021].
### 6.2 What can we dig from the textual information?
Datasets consists of two textual fields: 'Description' and 'Amenities' from the host's self-promotion. 'description' column is to describe advantages and characteristics. 'Amenities' is about facilities affiliated with the listing.
After some [cleaning and preprocessing](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/Processing_Airbnb_listing_normalising.ipynb), there are two set of questions corresponding to the two columns respectively.
#### 6.2.1 Which topics would host like to focus on when promoting their properties?
We could use the LDA model to generalize and extract topics to get the most frequent keywords in those topics. After calculating iteratively the model, we determine the best topics' number for summarizing 'descriptions' column should be 16. (*Figure1a*)
```{python}
current_dir = os.getcwd()
coherence_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/Data/coherence_values.csv"
# create local path for saving
local_coherence_path = os.path.join(current_dir, "Data","coherence_values.csv")
# download and save .bib
response = requests.get(coherence_url)
with open(local_coherence_path, 'wb') as file:
file.write(response.content)
# because it might cost several minutes to run the LDA modle
# so we just directly read the model's output remotely
# the detailed coding info could be accessed through project's github
LDAtopicwords_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/Data/lda_topics_and_words.csv"
# create local path for saving
local_LDAtopicwords_path = os.path.join(current_dir, "Data","lda_topics_and_words.csv")
# download and save .bib
response = requests.get(LDAtopicwords_url)
with open(local_LDAtopicwords_path, 'wb') as file:
file.write(response.content)
```
```{python}
# read coherence,csv
LDA_topic_coherence_frame = pd.read_csv(os.path.join("Data","coherence_values.csv"))
# read the LDA model output
LDA_topics_and_words_frame = pd.read_csv(os.path.join("Data","lda_topics_and_words.csv"))
```
```{python, fig.cap="Figure1a: best number of topics for summarizing key words", #Figure1a}
# create the line chart
fig, ax1 = plt.subplots(figsize=(12, 6))
ax1.plot(LDA_topic_coherence_frame['Topic_Num'], LDA_topic_coherence_frame['Coherence_Score'], marker='o')
ax1.set_title('Coherence Scores across Different Numbers of Topics')
ax1.set_xlabel('Number of Topics')
ax1.set_ylabel('Coherence Score')
ax1.grid(True)
# add the label for Y value
for x, y in zip(LDA_topic_coherence_frame['Topic_Num'], LDA_topic_coherence_frame['Coherence_Score']):
ax1.annotate(f'{y:.3f}', (x, y), textcoords="offset points", xytext=(0, 5), ha='center')
# add extra space for label
#plt.subplots_adjust(bottom=0.2)
# labels of the picture on the bottom
#fig.text(0.5, 0.08, 'Figure1a: Best number of topics for summarizing key words', ha='center', va='bottom')
#add (a) in the left top corner
fig.text(0.1, 0.9, '(a)', ha='left', va='top', fontsize=14, color='black', weight='bold')
plt.savefig(os.path.join("Images","CoherenceScoreOfLDA.png"))
plt.show()
```
```{python, fig.cap="Figure1b: Topics and key words", #Figure1b}
fig, axes = plt.subplots(4, 4, figsize=(24,24))
axes = axes.flatten()
# plot wordcloud for each topic
for i, topic in enumerate(LDA_topics_and_words_frame['Topic'].unique()):
topic_data = LDA_topics_and_words_frame[LDA_topics_and_words_frame['Topic'] == topic]
word_frequencies = {row['Word']: row['Weight'] for index, row in topic_data.iterrows()}
wordcloud = WordCloud(width=400, height=400, background_color='white').generate_from_frequencies(word_frequencies)
axes[i].imshow(wordcloud, interpolation='bilinear')
axes[i].axis('off')
axes[i].set_title(f'Topic {topic}', fontsize=15)
plt.tight_layout()
# add extra space for label
plt.subplots_adjust(bottom=0.1)
# labels of the picture on the bottom
# add (b) in the left top corner
fig.text(0, 0.97, '(b)', ha='left', va='top', fontsize=30, color='black', weight='bold')
fig.text(0.5, 0.05, 'Figure1:(a)Variation of LDA Model Coherence Scores with Topic Quantity.\n(b)Airbnb Listing Topic Analysis: LDA Modeling and Keyword Visualization', ha='center', va='bottom', fontsize=30)
plt.savefig(os.path.join("Images","LDA_topic16_wordcloud.png"))
plt.show()
```
The [LDA process](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/LDA_Modelling_by_TFIDFmatrix.ipynb) will cost about 30 minutes, so wesave and re-read the output remotely from Github. Then, results shows that among 16 topics *Figure1b*, there are some topics mainly describe the location like *topic8* and *topic6*. Also, some contains information about the facilities and some adjectives towards surrounding environments like *topic13* and *topic14*. In short, all of those key words could illustrate the general features about Airbnb listings which is essential to the recommendation algorithms in platform's branding[@mody_airbnb_2018].
#### 6.2.2 Do the listings in the same neighbourhood, or with the same spatial location, share the similar amenities?
Amenities are highly categorizable, like '500Mb-WiFi' and 'highspeed Internet access' basically meaning the same. Thus, we should identify various amenities' similarities just like group synonyms out from dictionaries. We use the [Word2Vec model](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/Word2Vec_Modelling_by_SVM.ipynb) to classify voluminous words and phrases, and then apply UMAP[@stalder_self-supervised_2023] for better visualization in *Figure2a*.
```{python}
# import Word2Vec Model remotely
word2vec_url = "https://github.com/BohaoSuCC/Groupwork_DeskB/raw/main/Model/word2vec-d500-w40.model"
# create local path for saving
local_word2vec_path = os.path.join(current_dir, "Model","word2vec.model")
# download and save .bib
response = requests.get(word2vec_url)
with open(local_word2vec_path, 'wb') as file:
file.write(response.content)
word2vec_model = Word2Vec.load(os.path.join("Model","word2vec.model"))
```
```{python}
# read csv after norm and split
amenities_norm_split = pd.read_csv("https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/Data/amenities_norm_split.csv",low_memory=False)
```
```{python}
NormListing_url = "https://github.com/BohaoSuCC/Groupwork_DeskB/raw/main/Data/Airbnb_listing_norm_min.zip"
local_NormListing_path = os.path.join(data_dir, "Airbnb_listing_norm_min.zip")
response = requests.get(NormListing_url)
with open(local_NormListing_path, 'wb') as file:
file.write(response.content)
with zipfile.ZipFile(local_NormListing_path, 'r') as zip_ref:
zip_ref.extractall(data_dir)
Airbnb_Listing = pd.read_csv(os.path.join("Data","Airbnb_listing_norm_min.csv"))
```
```{python}
Airbnb_Listing = pd.read_csv(os.path.join("Data","Airbnb_listing_norm_min.csv"))
texts_word2vec = Airbnb_Listing['amenities_norm']
# convert every word in column 'amenities' into a list
amenities_ast_literal = amenities_norm_split
amenities_ast_literal.drop('Unnamed: 0',axis=1)
list_of_lists = amenities_ast_literal.apply(lambda row: [item for item in row if item is not None], axis=1).tolist()
```
```{python}
# def vectorizing function
def vectorize(text, model):
# distort the text into words and filter those unrelated words
words = [word for word in text if word in model.wv.key_to_index]
# if no words, return 0
if len(words) == 0:
return np.zeros(model.vector_size)
# get the mean of all vectors
word_vectors = [model.wv[word] for word in words]
return np.mean(word_vectors, axis=0)
Airbnb_Listing['amenities_vector'] = pd.Series(list_of_lists).apply(lambda x: vectorize(x, word2vec_model))
amenities_vector = Airbnb_Listing['amenities_vector']
```
```{python}
import warnings
warnings.filterwarnings('ignore')
# decrease the dimension by UMAP
# convert pd.Series to np.array
amenities_vector_nparray = amenities_vector.to_numpy()
numpy_array = np.array([np.array(x) for x in amenities_vector_nparray])
reducer = umap.UMAP(n_components=2,n_neighbors=10,min_dist=0.9)
embedding = reducer.fit_transform(numpy_array)
```
```{python}
# caculate the centroid
center = (np.median(embedding, axis=0)+np.mean(embedding, axis=0))*0.5
# get the moving amount
translation = -center
# transform all points
translated_embedding = embedding + translation
# verify the new centroid
new_center = translated_embedding.mean(axis=0)
#print(f"New center after translation: {new_center}")
```
```{python, fig.cap="Figure2a: Features clustering after UMAP", #Figure2a}
import warnings
warnings.filterwarnings('ignore')
mag = np.sqrt(np.power(translated_embedding[:,0],2) + np.power(translated_embedding[:,1],2)).reshape(-1,1)
angle = np.arctan2(translated_embedding[:,1], translated_embedding[:,0])
# normalization to angle and distance
angle = (angle-np.min(angle)) / (np.max(angle) - np.min(angle))
#standarlizing scaling to [0,1]
mag = (mag-np.mean(mag)/np.std(mag))
#sigmoid scaling
mag = 1 / (1 + np.exp(-mag))
circ_colors = mpl.colors.hsv_to_rgb(np.concatenate((angle.reshape(-1,1),
np.ones_like(mag).reshape(-1,1),
mag.reshape(-1,1)),
axis=1))
color_info = np.concatenate((translated_embedding, circ_colors), axis=1)
# create the fig
fig, ax = plt.subplots(figsize=(10, 6))
# sctter plot
ax.scatter(translated_embedding[:, 0], translated_embedding[:, 1], color=circ_colors, s=0.5)
ax.axis('off')
# add (a) in the left top corner
fig.text(0.1, 0.9, '(a)', ha='left', va='top', fontsize=10, color='black', weight='bold')
plt.savefig(os.path.join("Images","Word2Vec_2D_UMAP_Projection.png"), dpi=150)
#the image rendered by quarto would be multi-layers and take eras to reload and appear in PDF file,
#so I will just re-read the rendered picture locally or remotely
#and definitely they are the same version with no difference
plt.close(fig)
#plt.show()
```
![](Images/Word2Vec_2D_UMAP_Projection.png)
```{python}
# save the color info in Airbnb_Listing
Airbnb_Listing['Word2Vec_UMAP_Xcor'] = color_info[:, 0]
Airbnb_Listing['Word2Vec_UMAP_Ycor'] = color_info[:, 1]
Airbnb_Listing['Word2Vec_UMAP_colorR'] = color_info[:, 2]
Airbnb_Listing['Word2Vec_UMAP_colorG'] = color_info[:, 3]
Airbnb_Listing['Word2Vec_UMAP_colorB'] = color_info[:, 4]
Airbnb_Listing = Airbnb_Listing.drop(['amenities_vector'], axis=1)
```
```{python, fig.cap="Figure2b: spatial distribution of Listing's similarities", #Figure2b}
# Transfer pandas dataframe (Airbnb_listing.csv) to geopandas geodataframe
# By using the coordinates ()
# Converting to GeoDataframe
gdf_listing = gpd.GeoDataFrame(Airbnb_Listing, geometry=gpd.points_from_xy(Airbnb_Listing.longitude, Airbnb_Listing.latitude))
# Set the CRS
gdf_listing.set_crs("EPSG:4326", inplace=True) # (EPSG:4326)
#print("Converting successful")
# Drop NAs of columns ['amenities_norm','longitude','latitude']
gdf_listing = gdf_listing.dropna(subset=['amenities_norm','longitude','latitude'])
#print(f"Now gdf has {gdf_listing.shape[0]:,} rows and {gdf_listing.shape[1]:,} columns.")
gdf_listing = gdf_listing.to_crs(epsg=3857)
London_boroughs = London_boroughs.to_crs(epsg=3857)
London_wards = London_wards.to_crs(epsg=3857)
#print("gdf_listing CRS:", gdf_listing.crs)
#print("London_boroughs CRS:", London_boroughs.crs)
#print("London_wards CRS:", London_wards.crs)
# plot the map
fig, ax = plt.subplots(figsize=(16, 16))
London_boroughs.boundary.plot(ax=ax, edgecolor='black', linewidth=0.5, alpha=0.4)
London_wards.boundary.plot(ax=ax, edgecolor='black', linewidth=0.5, alpha=0.2)
# extract the coordinates and RGB info from gdf_listing
x = gdf_listing.geometry.x
y = gdf_listing.geometry.y
colors = gdf_listing[['Word2Vec_UMAP_colorR', 'Word2Vec_UMAP_colorG', 'Word2Vec_UMAP_colorB']].values # RGB info
brightness_factor = 1.5
colors_brightened = np.clip(colors * brightness_factor, 0, 1) # make sure the value is between [0,1]
ax.scatter(x, y, color=colors_brightened, s=40, alpha=0.1)
"""
subax = plt.axes([0.1, 0.2, 0.3, 0.4]) # 左下角位置
"""
# add the label for boroughs
for idx, row in London_boroughs.iterrows():
centroid = row.geometry.centroid
text = ax.text(centroid.x, centroid.y, row['name'], fontsize=7, color='white',ha='center', va='center', alpha=0.7,
path_effects=[PathEffects.withStroke(linewidth=0.5, foreground='black')])
# add the label for wards
for idx, row in London_wards.iterrows():
centroid = row.geometry.centroid
text = ax.text(centroid.x, centroid.y, row['NAME'], fontsize=2, color='black',ha='center', va='center', alpha=0.5,
path_effects=[PathEffects.withStroke(linewidth=0.2, foreground='white')])
"""
x_min, x_max, y_min, y_max = -25000, 5000, 6695000, 6725000
subax.set_xlim(x_min, x_max)
subax.set_ylim(y_min, y_max)
London_boroughs.boundary.plot(ax=subax, edgecolor='black', linewidth=1, alpha=0.4)
London_wards.boundary.plot(ax=subax, edgecolor='black', linewidth=0.5, alpha=0.2)
subax.scatter(x, y,
color=colors_brightened, s=40,
vmax=0.4, vmin=-0.5, alpha=0.2)
"""
#OSM map,
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik, alpha = 0.7)
plt.subplots_adjust(bottom=0.1) # set extra space for label
# hide the axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# add (b) in the left top corner
fig.text(0.1, 0.79, '(b)', ha='left', va='top', fontsize=20, color='black', weight='bold') # 修改处
fig.text(0.5, 0.15, 'Figure2:(a)Spectrum of Features: A UMAP Clustering of Word Embeddings.\n(b) Geographic Distribution of Residential Similarities in London', ha='center', va='bottom', fontsize=20)
# save as PNG file,150 dpi
fig.savefig(os.path.join("Images","Word2Vec_OSM_geospace.png"), dpi=350)
#the image rendered by quarto would be multi-layers and really slow to reload and appear in PDF file,
#so I will just read the rendered picture locally or remotely
#and definitely they are the same version with no difference
plt.close(fig)
#plt.show()
```
![](Images/Word2Vec_OSM_geospace.png)
In the *Figure2b*, each kind of colours represents the amenities feature of a property, and areas with similar colors indicate highly similar amenities features between properties. This allows us to determine whether the properties in a specific area or community exhibit homogeneity (highly similar colors) or heterogeneity (more varied colors) in listing features.
### 6.3 Which indicator guide the branding?
Even though Airbnb, as a responsible company, should take community and regulation into consideration, the essence of branding and recommendation system is still aiming for profit. Therefore comes the question: what indicator could represent the potential economic opportunities for listing's branding or promotion?
```{python}
# compare the average one
average_income_forlisting = Airbnb_Listing['sum_income'].mean()
#print((Airbnb_Listing['price'] >= 2000).sum())
#print(f"Data frame is {Airbnb_Listing.shape[0]:,} x {Airbnb_Listing.shape[1]:,}")
# remain column 'price' less than 2000
Airbnb_Listing = Airbnb_Listing[Airbnb_Listing['price'] < 2000]
# check dataframe's shape
#print(f"Data frame is {Airbnb_Listing.shape[0]:,} x {Airbnb_Listing.shape[1]:,}")
Airbnb_Listing['profitable'] = (Airbnb_Listing['sum_income'] >= average_income_forlisting).astype(int)
median_income_forlisting = Airbnb_Listing['sum_income'].median()
# Transfer pandas dataframe (Airbnb_listing.csv) to geopandas geodataframe
# By using the coordinates ()
# Converting to GeoDataframe
gdf_listing = gpd.GeoDataFrame(Airbnb_Listing, geometry=gpd.points_from_xy(Airbnb_Listing.longitude, Airbnb_Listing.latitude))
# Set the CRS
gdf_listing.set_crs("EPSG:4326", inplace=True) # (EPSG:4326)
#print("Converting successful")
# Drop NAs of columns ['description','amenities']
gdf_listing = gdf_listing.dropna(subset=['amenities_norm'])
#print(f"Now gdf has {gdf_listing.shape[0]:,} rows and {gdf_listing.shape[1]:,} columns.")
```
```{python}
import warnings
warnings.filterwarnings('ignore')
gdf_listing = gdf_listing.to_crs(epsg=3857)
London_boroughs = London_boroughs.to_crs(epsg=3857)
London_wards = London_wards.to_crs(epsg=3857)
#print("gdf_listing CRS:", gdf_listing.crs)
#print("London_boroughs CRS:", London_boroughs.crs)
#print("London_boroughs CRS:", London_wards.crs)
"""
# add borough names
gdf_listing_with_borough = gpd.sjoin(gdf_listing, London_boroughs, how='left', op='within')
gdf_listing_with_borough = gdf_listing_with_borough.rename(columns={'name': 'borough_name'})
# add ward names
gdf_listing_with_borough_wards = gpd.sjoin(gdf_listing_with_borough, London_wards, how='left', op='within')
gdf_listing_with_borough_wards = gdf_listing_with_borough_wards.rename(columns={'NAME': 'ward_name'})
"""
gdf_listing['log_sum_income'] = np.log(gdf_listing['sum_income'])
gdf_listing['log_sum_income'].value_counts()
gdf_listing_dropinf = gdf_listing[gdf_listing['log_sum_income'] != -np.inf]
```
```{python, fig.cap="Figure3a: Statistical distribution of Listings' profit-cost ratio", #Figure3a}
import warnings
warnings.filterwarnings('ignore')
# nromalize the data btween 0 and 1
min_val = gdf_listing_dropinf['log_sum_income'].min()
max_val = gdf_listing_dropinf['log_sum_income'].max()
gdf_listing_dropinf['log_sum_income_normalized'] = (gdf_listing_dropinf['log_sum_income'] - min_val) / (max_val - min_val)
# modify the range to [-1, 1]
gdf_listing_dropinf['log_sum_income_normalized_scaled'] = gdf_listing_dropinf['log_sum_income_normalized'] * 2 - 1
median_num_income = np.median(gdf_listing_dropinf['log_sum_income_normalized_scaled'],axis=0)
gdf_listing_dropinf['log_sum_income_normalized_scaled'] = gdf_listing_dropinf['log_sum_income_normalized_scaled'] - median_num_income
fig, ax = plt.subplots(figsize=(10, 6))
gdf_listing_dropinf['log_sum_income'].hist(bins=150, ax=ax, alpha=0.5, label='Original Data')
gdf_listing_dropinf['log_sum_income_normalized_scaled'].hist(bins=150, ax=ax, alpha=0.5, label='Normalized & Scaled Data')
ax.legend()
plt.subplots_adjust(bottom=0.15) # save extra space for label
# add (a) in the left top corner
fig.text(0.02, 0.95, '(a)', ha='left', va='top', fontsize=14, color='black', weight='bold')
# save to png,150 dpi
fig.savefig(os.path.join("Images","Profit-cost ratio distribution.png"), dpi=150)
plt.show()
```
```{python, fig.cap="Figure3b: spatial distribution of Listings profit-cost ratio", #Figure3b}
import warnings
warnings.filterwarnings('ignore')
fig, ax = plt.subplots(figsize=(24,24))
# Jenks breaks
#breaks = jenkspy.jenks_breaks(gdf_listing_dropinf['log_sum_income_normalized_scaled'],n_classes=15)
# set the breaks manually
breaks = [-1,-0.75,-0.5,-0.4,-0.25,-0.20,-0.10,-0.05,-0.04,-0.03,-0.02,-0.01,-0.005,0,0.005,0.01,0.02,0.03,0.04,0.05,0.10,0.20,0.25,0.4,0.5,0.75,1,2]
# add the label for areas
for idx, row in London_boroughs.iterrows():
centroid = row.geometry.centroid
text = ax.text(centroid.x, centroid.y, row['name'], fontsize=7, color='black',ha='center', va='center', alpha=0.5,
path_effects=[PathEffects.withStroke(linewidth=0.5, foreground='white')])
# add the label for wards
for idx, row in London_wards.iterrows():
centroid = row.geometry.centroid
text = ax.text(centroid.x, centroid.y, row['NAME'], fontsize=2, color='black',ha='center', va='center', alpha=0.5,
path_effects=[PathEffects.withStroke(linewidth=0.2, foreground='white')])
# classify the data with breaks
gdf_listing_dropinf['income_category'] = np.digitize(gdf_listing_dropinf['log_sum_income_normalized_scaled'], breaks)
#plot the boundary of wards and boroughs
London_boroughs.boundary.plot(ax=ax, edgecolor='black', linewidth=1, alpha=0.4)
London_wards.boundary.plot(ax=ax, edgecolor='black', linewidth=0.5, alpha=0.2)
# sccater plot
scatter = ax.scatter(gdf_listing_dropinf.geometry.x, gdf_listing_dropinf.geometry.y,
c=gdf_listing_dropinf['log_sum_income_normalized_scaled'], edgecolors=None, s=40, cmap='bwr_r',
vmax=0.4, vmin=-0.5, alpha=0.2)
# add the color bar
cbar = plt.colorbar(scatter, ax=ax, label='Profit-cost Ratio', shrink=0.5, pad=0.02)
cbar.ax.set_aspect(20)
# hide the axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik, alpha=0.7) #OSM map
plt.subplots_adjust(bottom=0.1) # save extra space for label
# add (b) in the left top corner
# reference :https://www.datascience.ch/articles/self-learning-change-urban-housing-street-level
fig.text(0.08, 0.75, '(b)', ha='left', va='top', fontsize=24, color='black', weight='bold')
fig.text(0.45, 0.2, 'Figure3:(a)Statistical Distribution of Annual Revenue for Listings in London.\n(b)Geographical Distribution of Cost-Benefit Ratio for Listings in London', ha='center', va='bottom', fontsize=24)
# save to png,150 dpi
fig.savefig(os.path.join("Images","Listings_profit_ratio.png"), dpi=350)
#the image rendered by quarto would be multi-layers and really slow to reload and appear in PDF file,
#so I will just read the rendered picture locally or remotely
#and definitely they are the same version with no difference
plt.close(fig)
#plt.show()
```
![](Images/Listings_profit_ratio.png)
We use several numeric columns to calculate the total income for every listing. Though, technically this is an approximate number with normal distribution *Figure3a*, but it aligns with the data from the [Inside Airbnb](http://insideairbnb.com/london). Afterwards, we compare 'sum_income' with the average in that wards to indicate this listing's 'profit-cost ratio'. Then we standarlize the data and visualize it in the map *Figure3b*. The blue area means potential for more profit and more lease, which should be highlighted and coordinated with *Figure2b* when branding and promoting.
### 6.4 How does the indicator correlate with textual information?
By using the SVM model for better predicting the 'profit-cost ratio' according to the textual information, we gat an [trained model](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/Word2Vec_Modelling_by_SVM.ipynb) with accuracy more than **85%**, which could help the Airbnb platform or Government to evaluate the listings before they are promoted and recommended to the potential renters.
### 6.5 Summary
After the analysis, we have the key topics and words for better generalization(*Figure1*), the features spatial distribution for better classification(*Figure2*) and the 'profit-cost ratio' spatial distribution for better investment(*Figure3*), all of which would be utilized to inform the strategies for Airbnb, landlords, communities and governments(*Figure4*).
```{python}
#It is really hard, complicated and less-effective to use graghviz package in python to draw the framework diagram
# So I draw it in XML, and re-load the image remotely from my Girhub
# The Girhub url for this XML is: https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/mid_processing_files/outline.drawio.xml
```
![](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Images/Framework_Diagram.png?raw=true)
## 7. Drawing on your previous answers, and supporting your response with evidence (e.g. figures, maps, and statistical analysis/models), how *could* this data set be used to inform the regulation of Short-Term Lets (STL) in London?
### 7.1 Short Term Lets(STL)
In an effort to preserve the city's current housing supply, the government legalized short-term rentals in London for a maximum of 90 days per calendar year with the introduction of the [2011 Localism Act](https://www.legislation.gov.uk/ukpga/2011/20/contents/enacted) and the [2015 Deregulation Act](https://www.legislation.gov.uk/ukpga/2015/20/contents/enacted). Nevertheless, a number of studies[@jefferson-jones_can_2015] point out that this regulation isn't always adhered to in reality. Most of the [Airbnb listings](https://www.london.gov.uk/programmes-strategies/housing-and-land/housing-and-land-publications/housing-research-note-short-term-and-holiday-letting-london) (77%) did respect the 90-day limit. Out from the listings surpassing the 90-day limit, the [average estimated occupancy](https://commonslibrary.parliament.uk/research-briefings/cbp-8395/) was 145 nights a year. Of these lettings, 6,140 (or 55%) were entire homes and 5,000 (or 45%) were private rooms. Hence, much of the existing research [@shabrina_airbnb_2022] has focused on the role of Airbnb as the most prominent and prevalent online platform for short-term lets in the UK and internationally.
### 7.2 Airbnb Branding
To enhance the Airbnb platform strategically, leveraging text features for branding and recommendation algorithms is crucial. Based on the comparison between *Figure2b* and *Figure3b* and some perspectives from Question5, The following strategies can be implemented:
#### 7.2.1 Positive feedback cycle:
In regions with lower occupancy rates, recommendation algorithms adjustment ensures balanced occupancy rate over different areas. This proactive approach mitigates property vacancy concerns and boosts hosts' profitability, thereby fostering a dynamic equilibrium within London's housing market. Moreover, for listings with high rental profitability, providing additional positive feedback serves to incentivize competitive listings, which means positive feedback cycles, promoting business operations beneficial for both Airbnb and landlords.
#### 7.2.2 Negative Homogeneous listing:
Considering the potential contribution of housing homogeneity to market distortions [@zhou_asymmetric_2015; @Nieuwland_2018], in areas like London Bridge & West Bermondsey, where low income rates and property feature similarity coincide, the platform and housing department should explore the incorporation of text-based features. By leveraging these features, authorities can identify and filter out homogeneous listings in concentrated areas. This strategic approach could assist platform in rationally branding homogenous properties in time series and making arrangement according to various peak demand period, as well as promoting a more balanced housing landscape.
#### 7.2.3 Airbnb's trade-off:
In pursuing the core interests of its business, Airbnb undoubtedly seeks to foster a positive cycle by promoting competitive listings to renters [@Hoffman2020]. However, this could inadvertently contribute to homogeneity, counteracting the intended positive cycle [@H_bscher_2022]. Hence, personalized guidance to hosts in competitive areas could modify their amenities/descriptions to enhance their appeal to renters. As discussed in the Question 5, once Airbnb got valuable textual information, social responsibility they should take to establish a framework for communication and collaboration with hosts and provide insights towards market trends. Overall, the trade-off between promoting competitiveness and maintaining area diversity should be approached with flexibility by implementing a dynamic system that takes into account local preferences, seasonal variations, and emerging trends.
### 7.3 Government Regulatory Options
Furukawa & Onuki's tri-categorical definition [@furukawa_design_2019] indicates that effective policies should be less restrictive for Primary Hosted & Unhosted Short-term lets within appropriate timeframes, while regulating Nonprimary short-term lets more firmly to provide the right incentives to landlords to rent long-term.
#### 7.3.1 Tailored Policies Based on Spatial Distribution Features
Tailoring policies for diverse community types is essential. In high-density areas, consider limiting the addition of new listings to prevent overcrowding. In contrast, for areas with lower occupancy rates, policies can encourage landlords to adopt more proactive occupancy promotion strategies.
#### 7.3.2 Dynamic Policy Adjustments for Supply-Demand Balance
Utilize spatial distribution features to monitor market dynamics and make adjustments based on actual demand. In high-demand areas, policies can be more flexible, encouraging short-term rentals, while in oversupplied regions, stricter policies can reduce vacancy rates. Connect the identified branding opportunities with STL regulations to balance encouraging tourism and preventing negative impacts on housing markets. Regulation should preserve the uniqueness and solve the shortages for areas with distinctive features.
#### 7.3.3 Encouraging Landlord Engagement in Community Development
Airbnb transforms residential communities into tourist spaces and changes the socio-cultural landscape of urban neighborhoods. It specifically propagates the experience of 'living like a local'[@Ferreri_2018], but this consumption of everyday local residential life has implications for the well-being of long-term tenants, including the disruption and erasure of long-term communities and housing insecurity[@Rozena_2021]. Critical urbanists [@Cocola_Gant_2019; @Freytag_2018] have accordingly linked Airbnb to touristification/gentrification - 'Airbnbification'[@T_rnberg_2022]. Governments can consider incentivizing landlords to participate in community development, aiming to increase the 90-day occupancy rate. This not only reduces long-term property vacancies but also fosters community vitality and helps maintain supply-demand equilibrium.
#### 7.3.4 Create a Registration Service to Bridge Gaps in Data
In a context where the limitation of in research and decision-making outcomes [@Fonda2021], a registration service could provide some of the information necessary to bridge this gap. Utilizing statistical analysis and modelling, regulatory decisions can be evidence-based, considering the unique characteristics of each area. A collaborative effort between cities and Airbnb is suggested for the development of a centralized registration platform. The streamlined online monitoring and fine collection system could significantly enhance planning authorities' ability on balancing housing prices and availability, also improving community well-being.
## Reference