Spider for Lianjia,to obtain information about second-hand housing in Shanghai.
Spider/- directory for scrapy code and data.data/- data obtained by scrapydata_with_coordinites.csv- data with coordinateshtmls.csv- all htmlsoriginal_data.csv- data without coordinatessubway.csv- subway informationurls.csv- start urlsvalid_htmls.csv- valid htmls
spiders/- scrapy codecrawldata.py- scrapy code, to obtain data from htmlsgeturl.py- scrapy code, to obtain htmls from start urls
items.py- scrapy project code, define itemsmiddlewares.py- scrapy project code, define middlewarespipelines.py- scrapy project code, define pipelinessettings.py- scrapy project code, define settings
predict- data preprocess and model traindata/- data after preprocessout/- output directorynn_pred.py- predictionpreprocess.py- data proprecessrun.py- model definition and trainrun.sh- run scripttitle_wordcloud.py- make wordcloud for titlesword_embedding.py- word embedding
utils/- tool functionbaidu_get_LLitude.py- get coordinates from baidu mapgaode_get_LLitude.py- get coordinates from gaode maptencent_get_LLitude.py- get coordinates from tencent mapBeautifulSoup.py- BeautifulSoup crawler script, to obtain data from valid htmlsdel_invalid_urls.py- delete invalid html urlsdelete_used_urls.py- delete used html urls
scrapy.cfg- scrapy project code, define settingsREADME.md- README file
# 数据预处理
python ./preprocess.py
# 词嵌入
python ./word_embedding.py# 传统模型预测(后两个参数只在Model为Multi-layer Perceptron时起效果)
python ./run.py --model [Model Name] --hidden_layer_sizes [隐藏层大小] --max_iter [最大迭代次数]
# 运行所有预测模型
chmod +x ./run.sh
./run.sh
# 神经网络预测
python ./nn_pred.py- 王鑫 - 520021910700
- 郑宇森 - 520021911173
- 江彦泽 - 520021910629