Spider

Spider for Lianjia，to obtain information about second-hand housing in Shanghai.

File Structure

Spider/ - directory for scrapy code and data.
- data/ - data obtained by scrapy
  - data_with_coordinites.csv - data with coordinates
  - htmls.csv - all htmls
  - original_data.csv - data without coordinates
  - subway.csv - subway information
  - urls.csv - start urls
  - valid_htmls.csv - valid htmls
- spiders/ - scrapy code
  - crawldata.py - scrapy code, to obtain data from htmls
  - geturl.py - scrapy code, to obtain htmls from start urls
- items.py - scrapy project code, define items
- middlewares.py - scrapy project code, define middlewares
- pipelines.py- scrapy project code, define pipelines
- settings.py - scrapy project code, define settings
predict - data preprocess and model train
- data/ - data after preprocess
- out/ - output directory
- nn_pred.py - prediction
- preprocess.py - data proprecess
- run.py - model definition and train
- run.sh - run script
- title_wordcloud.py - make wordcloud for titles
- word_embedding.py - word embedding
utils/ - tool function
- baidu_get_LLitude.py - get coordinates from baidu map
- gaode_get_LLitude.py - get coordinates from gaode map
- tencent_get_LLitude.py - get coordinates from tencent map
- BeautifulSoup.py - BeautifulSoup crawler script, to obtain data from valid htmls
- del_invalid_urls.py - delete invalid html urls
- delete_used_urls.py - delete used html urls
scrapy.cfg - scrapy project code, define settings
README.md - README file

Data Preprocess

# 数据预处理
python ./preprocess.py
# 词嵌入
python ./word_embedding.py

Model Prediction

# 传统模型预测（后两个参数只在Model为Multi-layer Perceptron时起效果）
python ./run.py --model [Model Name] --hidden_layer_sizes [隐藏层大小] --max_iter [最大迭代次数]
# 运行所有预测模型
chmod +x ./run.sh
./run.sh
# 神经网络预测
python ./nn_pred.py

Team Member

王鑫 - 520021910700
郑宇森 - 520021911173
江彦泽 - 520021910629

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider

File Structure

Data Preprocess

Model Prediction

Team Member

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Spider		Spider
predict		predict
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
scrapy.cfg		scrapy.cfg

Folders and files

Latest commit

History

Repository files navigation

Spider

File Structure

Data Preprocess

Model Prediction

Team Member

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages