Skip to content

Commit 28dd571

Browse files
committed
minor reorganization
1 parent 76824fb commit 28dd571

File tree

7 files changed

+30
-23
lines changed

7 files changed

+30
-23
lines changed

browserNavigator.py BrowserNavigator/browserNavigator.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import configparser
33
import numpy as np
44
from selenium.common.exceptions import NoSuchElementException
5-
from manageExcelFile import ManageExcelFile
5+
from ExcelFileHandler.manageExcelFile import ManageExcelFile
66

77
config = configparser.ConfigParser()
88
config.read('config.ini')

cookieManager.py BrowserNavigator/cookieManager.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import json
22
from selenium import webdriver
3-
from browserNavigator import BrowserNavigator
3+
from BrowserNavigator.browserNavigator import BrowserNavigator
44

55

66
class CookieManager:
File renamed without changes.

README.md

+20-14
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ As written in [Linkedin User Agreement](https://www.linkedin.com/legal/user-agre
1616

1717
# LinkedIn Web Scraper
1818

19-
Python Web Scraper for LinkedIn companies. The script fully simulate an human activity in order to get data from LinkedIn web pages. The purpose is store data from companies of a certain zone, such as:
19+
This is a LinkedIn Python Web Scraper for companies. The script fully simulate a human activity (using [Selenium](https://selenium-python.readthedocs.io) library) in order to get data from LinkedIn web pages. The purpose is store data from companies of a certain zone, such as:
2020

2121
- Name
2222
- Overview
@@ -25,34 +25,40 @@ Python Web Scraper for LinkedIn companies. The script fully simulate an human ac
2525
- Industry
2626
- etc.
2727

28-
After collected the above information, these will be stored into an .xls file.
28+
After collected the above information, these will be stored into an `.xls` file.
2929

3030
### Demo
3131

3232
[![Watch the video](https://img.youtube.com/vi/TKkJEo-4NTg/maxresdefault.jpg)](https://youtu.be/TKkJEo-4NTg)
3333

34+
# Table of Contents
35+
- [Usage](https://github.com/J4NN0/linkedin-web-scraper#usage)
36+
- [Troubleshooting](https://github.com/J4NN0/linkedin-web-scraper#troubleshooting)
37+
- [Resources](https://github.com/J4NN0/linkedin-web-scraper#resources)
38+
3439
# Usage
3540

36-
First of all, donwload the web driver you prefer (Firefox or Chrome) and put it inside the folder. Then put you credential inside the **config.ini** file and specify the web driver you donwloaded. Also others kind of parameters can be setted.
41+
First of all, download the web driver you prefer (either [Firefox](https://github.com/mozilla/geckodriver/releases) or [Chrome](https://chromedriver.chromium.org/downloads)) and put it inside project folder. After that, put your credentials in `config.ini` file and specify the `webdriver` you have downloaded. Also, others kind of parameters can be set.
3742

38-
The method *get_companies_name(...)* requires a link (in this case a link of a company) and will return an array of links in which each link is the page of the company.
43+
Method `get_companies_name(...)` requires a link (in this case a link of a company) and will return an array of links in which each link is the LinkedIn company web page.
3944

40-
After that, you can run *retrive_data(...)* that requires the array with the links and the name of the .xls file in which you want to store information that will be collected from each link for each company.
45+
After that, you can run `retrieve_data(...)` that requires the array with the links and the name of the `.xls` file in which you want to store all the information that will be collected from each link for each company.
4146

42-
Class *ManageExcelFile* will handle the I/O operation for the .xls file.
47+
Class `ManageExcelFile` will handle the I/O operation to the `.xls` file.
4348

44-
# Issues
49+
# Troubleshooting
4550

46-
It could happen that, after the loggin phase, LinkedIn could ask you to perform some operations instead of rediricet you to the feed (https://www.linkedin.com/feed/) page.
51+
It could happen that, after the logging phase, LinkedIn could ask you to perform some actions/operations (e.g. "I'm not a robot", etc.) instead of redirecting you to the feed (https://www.linkedin.com/feed/) page.
4752

48-
In this case just:
49-
1. Stop the script
50-
2. Log with a browser in your account
51-
3. Skip the required operation
52-
4. Re-run the code
53+
In this case:
54+
1. Stop the script.
55+
2. Log in with a browser in your account.
56+
3. Skip the required actions.
57+
4. Re-run the code.
5358

54-
# Utility
59+
# Resources
5560

5661
- [Chrome Webdriver](https://chromedriver.chromium.org/downloads)
62+
- [Firefox Webdriver](https://github.com/mozilla/geckodriver/releases)
5763
- [Selenium](https://selenium-python.readthedocs.io/installation.html)
5864
- [Scrapy](https://docs.scrapy.org/en/latest/intro/tutorial.html)

ScrapyPackage/linkedin.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# -*- coding: utf-8 -*-
22
import scrapy
3-
from cookieManager import CookieManager
3+
from BrowserNavigator.cookieManager import CookieManager
44

55

66
class LinkedinSpider(scrapy.Spider):

config.ini

+5-4
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
[BROWSER]
2-
;set or "Firefox" or "Chrome"
3-
WEBDRIVER = Chrome
2+
;set either "Firefox" or "Chrome"
3+
WEBDRIVER = <DOWNLOADED_WEBDRIVER>
44
;number of attempts before selenium stops working considering a possible connection issue or design
55
;issue.
66
MAX_LOADING_ATTEMPTS = 30
77
;sleep time between click actions. Increase only if scrolling the project's page is giving issues.
88
DEFAULT_SLEEP_TIME = 1
99

1010
[LOGIN]
11-
EMAIL = yourEmail
12-
PASSWORD = yourPassword
11+
;your linkedin credentials
12+
EMAIL = <YOUR_LINKEDIN_EMAIL>
13+
PASSWORD = <YOUR_LINKEDIN_PASSWORD>

main.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
import configparser
22
import time
33
from selenium import webdriver
4-
from browserNavigator import BrowserNavigator
5-
from cookieManager import CookieManager
4+
from BrowserNavigator.browserNavigator import BrowserNavigator
5+
from BrowserNavigator.cookieManager import CookieManager
66

77

88
def main():

0 commit comments

Comments
 (0)