Web Scrapping

arasan21 · web-flow · commit 1fa7d93cd8c0 · 2018-11-29T18:58:12.000+05:30
Created python class to pull the  any table data from any website given that the table class name is provided
diff --git a/Web Crawling and Scrapping.ipynb b/Web Crawling and Scrapping.ipynb
@@ -0,0 +1 @@
+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Web Crawling and Scrapping.ipynb","version":"0.3.2","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"metadata":{"id":"uEP3S5vIDKxZ","colab_type":"text"},"cell_type":"markdown","source":["# **Web Scrapping :** <br >\n"," Web scrapping is a technique used to extract data from websites through an automated process. It is useful techinque when you want to do deal with data in the web page. "]},{"metadata":{"id":"_ju1yI5v-rIa","colab_type":"code","colab":{}},"cell_type":"code","source":["from IPython.core.interactiveshell import InteractiveShell\n","InteractiveShell.ast_node_interactivity = \"all\""],"execution_count":0,"outputs":[]},{"metadata":{"id":"zayIMyVTIp1g","colab_type":"text"},"cell_type":"markdown","source":["The class **WikiTableDataParser** is a web scrapper which has multiple functionalities.\n","\n","\n","1.   get_webpage(url) - to get the response form the given webpage(url) \n","2.   parse_url(url, tableClass) - process the response of the given url  and return the data present in the tableClass as dataframe \n","3.   parse_html_table (table) - return the data present in the tableClass as dataframe \n","4.   save_as_csv_local - to save the dataframe as csv file in the local drive(current working directory)\n","5.   get_folder_list_drive - to get the list of folder in google drive\n","6.   save_as_csv_drive - to save the dataframe as csv file in the google drive\n","7.   save_as_csv_tkinter - to save datafram as csv using tkinter(UI)\n","\n","\n"]},{"metadata":{"id":"J2ZG1g3sasJ6","colab_type":"code","colab":{}},"cell_type":"code","source":["#WikiTableDataParser class fetch the data present in a table from given URL.\n","#We need to give the url along with class name of the table in the web page.\n","#We can save the data from wepage to csv file in the local directory and google drive\n","#We can also use tkinter to save the data in local directory\n","\n","import requests\n","import pandas as pd\n","from bs4 import BeautifulSoup\n","from contextlib import closing\n","!pip install -U -q PyDrive\n","from pydrive.auth import GoogleAuth\n","from pydrive.drive import GoogleDrive\n","from google.colab import auth\n","from google.colab import files\n","from oauth2client.client import GoogleCredentials\n","\n","class WikiTableDataParser:\n","  \n","    def get_webpage(self,url):\n","      \n","      #Attempts to get the content at url by making an HTTP GET request with time out .\n","      #If the content-type of response is some kind of HTML, return the\n","      #HTML content, otherwise return None.\n","      try:\n","          with closing(requests.get(url, timeout=5)) as resp:\n","              if self.is_good_response(resp):\n","                  return resp.content\n","              else:\n","                  return None\n","\n","      except RequestException as e:\n","          self.log_error('Error during requests to {0} : {1}'.format(url, str(e)))\n","          return None\n","\n","\n","    def is_good_response(self,resp):\n","      \n","       # Returns True if the response seems to be HTML, False otherwise.\n","        content_type = resp.headers['Content-Type'].lower()\n","        return (resp.status_code == 200 \n","                and content_type is not None \n","                and content_type.find('html') > -1)\n","\n","\n","    def log_error(self,e):\n","      \n","        #It is always a good idea to log errors. \n","        #This function just prints them, but you can\n","        #make it do anything.\n","        print(e)\n","        \n","    def parse_url(self, url, tableClass):\n","        response = self.get_webpage(url)\n","        # raises error for invalid URL\n","        if response is None:\n","            raise Exception(\"Invalid URL\")\n","        soup = BeautifulSoup(response, 'html.parser')\n","        tables = soup.find_all('table',{\"class\":tableClass})\n","        # raises error if data tables with given class are not found \n","        if len(tables) <= 0:\n","           raise Exception(\"No tables found with clas : \"+tableClass)\n","        return [(index,self.parse_html_table(table))\\\n","                for index,table in enumerate(tables)]   \n","\n","    def parse_html_table(self, table):\n","        n_columns = 0\n","        n_rows=0\n","        column_names = []\n","\n","        # Find number of rows and columns in the table\n","        # we also find the column titles if we can\n","        for row in table.find_all('tr'):\n","\n","            # Determine the number of rows in the table\n","            td_tags = row.find_all('td')\n","            if len(td_tags) > 0:\n","                n_rows+=1\n","                if n_columns == 0:\n","                    # Set the number of columns for our table\n","                    n_columns = len(td_tags)\n","\n","            # Handle column names if we find them\n","            th_tags = row.find_all('th') \n","            if len(th_tags) > 0 and len(column_names) == 0:\n","                for th in th_tags:\n","                    column_names.append(th.get_text())\n","\n","        # raises error if column Titles and no.of.colmuns doesnt match\n","        if len(column_names) > 0 and len(column_names) != n_columns:\n","            raise Exception(\"Column titles do not match the number of columns\")\n","\n","        columns = column_names if len(column_names) > 0 else range(0,n_columns)\n","        df = pd.DataFrame(columns = columns,\n","                          index= range(0,n_rows))\n","        row_marker = 0\n","        for row in table.find_all('tr'):\n","            column_marker = 0\n","            columns = row.find_all('td')\n","            for column in columns:\n","                df.iat[row_marker,column_marker] = column.get_text()\n","                column_marker += 1\n","            if len(columns) > 0:\n","                row_marker += 1\n","\n","        # Convert to float if possible\n","        for col in df:\n","            try:\n","                df[col] = df[col].astype(float)\n","            except ValueError:\n","                pass\n","\n","        return df\n","    \n","    # method to save the data in the local directory\n","    def save_as_csv_local(self, df, filename):\n","      try:\n","        df.to_csv(filename+\".csv\", index=False, header=True)\n","        return \"Created in the working directory with filename : \"+filename+\".csv\"\n","      except Exception as e:\n","        return e\n","      \n","    #Method to list folder in drive with id\n","    def get_folder_list_drive(self):\n","         try:\n","          auth.authenticate_user()\n","          gauth = GoogleAuth()\n","          gauth.credentials = GoogleCredentials.get_application_default()\n","          drive = GoogleDrive(gauth)\n","          file_list = drive.ListFile({'q': \"'root' in parents and trashed=false\"}).GetList()\n","          for file1 in file_list:\n","            print('title: %s, id: %s' % (file1['title'], file1['id']))\n","          return file_list\n","         except Exception as e:\n","          return e\n","        \n","    # method to save the data in the google drive\n","    def save_as_csv_drive(self, df, filename, folderId):\n","      try:\n","        auth.authenticate_user()\n","        gauth = GoogleAuth()\n","        gauth.credentials = GoogleCredentials.get_application_default()\n","        drive = GoogleDrive(gauth)\n","        fantasy_df.to_csv(filename+\".csv\", index=False, header=True)\n","        file = drive.CreateFile({'parents':[{u'id':folderId}]})\n","        file.SetContentFile(filename+\".csv\")\n","        file.Upload()\n","        return \"Created in the drive with filename : \"+filename+\".csv\"\n","      except Exception as e:\n","        return e\n","      \n","     # method to save the data in file using UI\n","    def save_as_csv_tkinter(self, df, filename):\n","      try:\n","        import tkinter as tk\n","        from tkinter import filedialog\n","        from pandas import DataFrame\n","        canvas1 = tk.Canvas(root, width = 300, height = 300, bg = 'lightsteelblue2', relief = 'raised')\n","        canvas1.pack()\n","        def exportCSV ():\n","            global df\n","            export_file_path = filedialog.asksaveasfilename(defaultextension='.csv')\n","            df.to_csv (export_file_path, index = None, header=True)\n","        saveAsButton_CSV = tk.Button(text='Export CSV', command=exportCSV, bg='green', fg='white', font=('helvetica', 12, 'bold'))\n","        canvas1.create_window(150, 150, window=saveAsButton_CSV)\n","        root.mainloop()\n","        return \"Created in with filename : \"+filename+\".csv\"\n","      except Exception as e:\n","        return e   \n","    "],"execution_count":0,"outputs":[]},{"metadata":{"id":"r9iOfR_yM-5N","colab_type":"text"},"cell_type":"markdown","source":["Create an object for the WikiTableDataParser class and call parse_url function with url and table name. <br>\n","parse_url function returns list of dataframe. using for each loop to go through all table and to find the  our required table.<br>\n","\n","\n"]},{"metadata":{"id":"4_2r7xlIf8GW","colab_type":"code","colab":{}},"cell_type":"code","source":["url = \"https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)\"\n","#url = \"https://en.wikipedia.org/wiki/List_of_bus_routes_in_London\"\n","#url = \"https://www.fantasypros.com/nfl/reports/leaders/qb.php?year=2015\"\n","hp = WikiTableDataParser()\n","table_list = hp.parse_url(url, \"wikitable sortable plainrowheaders\")\n","#table_list = hp.parse_url(url, \"table\")\n","\n","print(\"total no of tables in the page\",len(table_list))\n","\n","for index,table in enumerate(table_list):\n","  print(\"index : \", index)\n","  table[1].head()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"uWvc0YYwOYgT","colab_type":"text"},"cell_type":"markdown","source":["After find thing the our required table name index , fetch that dataframe from table_list using **table_list[index][1]**"]},{"metadata":{"id":"zT4REYLUiUQl","colab_type":"code","colab":{}},"cell_type":"code","source":["population_df = table_list[0][1]#[index][1]\n","population_df.info()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"z3fobpxDlEkK","colab_type":"text"},"cell_type":"markdown","source":["save_as_csv_local () - function pass datafarme , filename which you want to save. The file will be saved in the working directory"]},{"metadata":{"id":"taKVMs9JlEHa","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.save_as_csv_local(population_df,'mycsvfile2')"],"execution_count":0,"outputs":[]},{"metadata":{"id":"hOBlhMJYOzS7","colab_type":"text"},"cell_type":"markdown","source":["get_folder_list_drive() function list all the folder in our google drive along with the id "]},{"metadata":{"id":"fiCfnO1n2gRI","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.get_folder_list_drive()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"e_qwvbDbPAq6","colab_type":"text"},"cell_type":"markdown","source":["save_as_csv_drive () - function pass datafarme , filename which you want to save and folder id to save the file in google drive"]},{"metadata":{"id":"WpJVD03a_Bzi","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.save_as_csv_drive(population_df,'mycsvfile2',FolderID)#folder id in which you want to save the file"],"execution_count":0,"outputs":[]},{"metadata":{"id":"0a1uGGK1noY4","colab_type":"text"},"cell_type":"markdown","source":["save_as_csv_tkinter () - function pass datafarme , filename which you want to save using UI"]},{"metadata":{"id":"Bdshldjqnnri","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.save_as_csv_tkinter(population_df,'mycsvfile2')"],"execution_count":0,"outputs":[]}]}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Web Crawling and Scrapping.ipynb","version":"0.3.2","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"metadata":{"id":"uEP3S5vIDKxZ","colab_type":"text"},"cell_type":"markdown","source":["# Web Scrapping : <br >\n"," Web scrapping is a technique used to extract data from websites through an automated process. It is useful techinque when you want to do deal with data in the web page. "]},{"metadata":{"id":"_ju1yI5v-rIa","colab_type":"code","colab":{}},"cell_type":"code","source":["from IPython.core.interactiveshell import InteractiveShell\n","InteractiveShell.ast_node_interactivity = \"all\""],"execution_count":0,"outputs":[]},{"metadata":{"id":"zayIMyVTIp1g","colab_type":"text"},"cell_type":"markdown","source":["The class WikiTableDataParser is a web scrapper which has multiple functionalities.\n","\n","\n","1. get_webpage(url) - to get the response form the given webpage(url) \n","2. parse_url(url, tableClass) - process the response of the given url and return the data present in the tableClass as dataframe \n","3. parse_html_table (table) - return the data present in the tableClass as dataframe \n","4. save_as_csv_local - to save the dataframe as csv file in the local drive(current working directory)\n","5. get_folder_list_drive - to get the list of folder in google drive\n","6. save_as_csv_drive - to save the dataframe as csv file in the google drive\n","7. save_as_csv_tkinter - to save datafram as csv using tkinter(UI)\n","\n","\n"]},{"metadata":{"id":"J2ZG1g3sasJ6","colab_type":"code","colab":{}},"cell_type":"code","source":["#WikiTableDataParser class fetch the data present in a table from given URL.\n","#We need to give the url along with class name of the table in the web page.\n","#We can save the data from wepage to csv file in the local directory and google drive\n","#We can also use tkinter to save the data in local directory\n","\n","import requests\n","import pandas as pd\n","from bs4 import BeautifulSoup\n","from contextlib import closing\n","!pip install -U -q PyDrive\n","from pydrive.auth import GoogleAuth\n","from pydrive.drive import GoogleDrive\n","from google.colab import auth\n","from google.colab import files\n","from oauth2client.client import GoogleCredentials\n","\n","class WikiTableDataParser:\n"," \n"," def get_webpage(self,url):\n"," \n"," #Attempts to get the content at url by making an HTTP GET request with time out .\n"," #If the content-type of response is some kind of HTML, return the\n"," #HTML content, otherwise return None.\n"," try:\n"," with closing(requests.get(url, timeout=5)) as resp:\n"," if self.is_good_response(resp):\n"," return resp.content\n"," else:\n"," return None\n","\n"," except RequestException as e:\n"," self.log_error('Error during requests to {0} : {1}'.format(url, str(e)))\n"," return None\n","\n","\n"," def is_good_response(self,resp):\n"," \n"," # Returns True if the response seems to be HTML, False otherwise.\n"," content_type = resp.headers['Content-Type'].lower()\n"," return (resp.status_code == 200 \n"," and content_type is not None \n"," and content_type.find('html') > -1)\n","\n","\n"," def log_error(self,e):\n"," \n"," #It is always a good idea to log errors. \n"," #This function just prints them, but you can\n"," #make it do anything.\n"," print(e)\n"," \n"," def parse_url(self, url, tableClass):\n"," response = self.get_webpage(url)\n"," # raises error for invalid URL\n"," if response is None:\n"," raise Exception(\"Invalid URL\")\n"," soup = BeautifulSoup(response, 'html.parser')\n"," tables = soup.find_all('table',{\"class\":tableClass})\n"," # raises error if data tables with given class are not found \n"," if len(tables) <= 0:\n"," raise Exception(\"No tables found with clas : \"+tableClass)\n"," return [(index,self.parse_html_table(table))\\\n"," for index,table in enumerate(tables)] \n","\n"," def parse_html_table(self, table):\n"," n_columns = 0\n"," n_rows=0\n"," column_names = []\n","\n"," # Find number of rows and columns in the table\n"," # we also find the column titles if we can\n"," for row in table.find_all('tr'):\n","\n"," # Determine the number of rows in the table\n"," td_tags = row.find_all('td')\n"," if len(td_tags) > 0:\n"," n_rows+=1\n"," if n_columns == 0:\n"," # Set the number of columns for our table\n"," n_columns = len(td_tags)\n","\n"," # Handle column names if we find them\n"," th_tags = row.find_all('th') \n"," if len(th_tags) > 0 and len(column_names) == 0:\n"," for th in th_tags:\n"," column_names.append(th.get_text())\n","\n"," # raises error if column Titles and no.of.colmuns doesnt match\n"," if len(column_names) > 0 and len(column_names) != n_columns:\n"," raise Exception(\"Column titles do not match the number of columns\")\n","\n"," columns = column_names if len(column_names) > 0 else range(0,n_columns)\n"," df = pd.DataFrame(columns = columns,\n"," index= range(0,n_rows))\n"," row_marker = 0\n"," for row in table.find_all('tr'):\n"," column_marker = 0\n"," columns = row.find_all('td')\n"," for column in columns:\n"," df.iat[row_marker,column_marker] = column.get_text()\n"," column_marker += 1\n"," if len(columns) > 0:\n"," row_marker += 1\n","\n"," # Convert to float if possible\n"," for col in df:\n"," try:\n"," df[col] = df[col].astype(float)\n"," except ValueError:\n"," pass\n","\n"," return df\n"," \n"," # method to save the data in the local directory\n"," def save_as_csv_local(self, df, filename):\n"," try:\n"," df.to_csv(filename+\".csv\", index=False, header=True)\n"," return \"Created in the working directory with filename : \"+filename+\".csv\"\n"," except Exception as e:\n"," return e\n"," \n"," #Method to list folder in drive with id\n"," def get_folder_list_drive(self):\n"," try:\n"," auth.authenticate_user()\n"," gauth = GoogleAuth()\n"," gauth.credentials = GoogleCredentials.get_application_default()\n"," drive = GoogleDrive(gauth)\n"," file_list = drive.ListFile({'q': \"'root' in parents and trashed=false\"}).GetList()\n"," for file1 in file_list:\n"," print('title: %s, id: %s' % (file1['title'], file1['id']))\n"," return file_list\n"," except Exception as e:\n"," return e\n"," \n"," # method to save the data in the google drive\n"," def save_as_csv_drive(self, df, filename, folderId):\n"," try:\n"," auth.authenticate_user()\n"," gauth = GoogleAuth()\n"," gauth.credentials = GoogleCredentials.get_application_default()\n"," drive = GoogleDrive(gauth)\n"," fantasy_df.to_csv(filename+\".csv\", index=False, header=True)\n"," file = drive.CreateFile({'parents':[{u'id':folderId}]})\n"," file.SetContentFile(filename+\".csv\")\n"," file.Upload()\n"," return \"Created in the drive with filename : \"+filename+\".csv\"\n"," except Exception as e:\n"," return e\n"," \n"," # method to save the data in file using UI\n"," def save_as_csv_tkinter(self, df, filename):\n"," try:\n"," import tkinter as tk\n"," from tkinter import filedialog\n"," from pandas import DataFrame\n"," canvas1 = tk.Canvas(root, width = 300, height = 300, bg = 'lightsteelblue2', relief = 'raised')\n"," canvas1.pack()\n"," def exportCSV ():\n"," global df\n"," export_file_path = filedialog.asksaveasfilename(defaultextension='.csv')\n"," df.to_csv (export_file_path, index = None, header=True)\n"," saveAsButton_CSV = tk.Button(text='Export CSV', command=exportCSV, bg='green', fg='white', font=('helvetica', 12, 'bold'))\n"," canvas1.create_window(150, 150, window=saveAsButton_CSV)\n"," root.mainloop()\n"," return \"Created in with filename : \"+filename+\".csv\"\n"," except Exception as e:\n"," return e \n"," "],"execution_count":0,"outputs":[]},{"metadata":{"id":"r9iOfR_yM-5N","colab_type":"text"},"cell_type":"markdown","source":["Create an object for the WikiTableDataParser class and call parse_url function with url and table name. <br>\n","parse_url function returns list of dataframe. using for each loop to go through all table and to find the our required table.<br>\n","\n","\n"]},{"metadata":{"id":"4_2r7xlIf8GW","colab_type":"code","colab":{}},"cell_type":"code","source":["url = \"https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)\"\n","#url = \"https://en.wikipedia.org/wiki/List_of_bus_routes_in_London\"\n","#url = \"https://www.fantasypros.com/nfl/reports/leaders/qb.php?year=2015\"\n","hp = WikiTableDataParser()\n","table_list = hp.parse_url(url, \"wikitable sortable plainrowheaders\")\n","#table_list = hp.parse_url(url, \"table\")\n","\n","print(\"total no of tables in the page\",len(table_list))\n","\n","for index,table in enumerate(table_list):\n"," print(\"index : \", index)\n"," table[1].head()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"uWvc0YYwOYgT","colab_type":"text"},"cell_type":"markdown","source":["After find thing the our required table name index , fetch that dataframe from table_list using table_list[index][1]"]},{"metadata":{"id":"zT4REYLUiUQl","colab_type":"code","colab":{}},"cell_type":"code","source":["population_df = table_list[0][1]#[index][1]\n","population_df.info()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"z3fobpxDlEkK","colab_type":"text"},"cell_type":"markdown","source":["save_as_csv_local () - function pass datafarme , filename which you want to save. The file will be saved in the working directory"]},{"metadata":{"id":"taKVMs9JlEHa","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.save_as_csv_local(population_df,'mycsvfile2')"],"execution_count":0,"outputs":[]},{"metadata":{"id":"hOBlhMJYOzS7","colab_type":"text"},"cell_type":"markdown","source":["get_folder_list_drive() function list all the folder in our google drive along with the id "]},{"metadata":{"id":"fiCfnO1n2gRI","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.get_folder_list_drive()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"e_qwvbDbPAq6","colab_type":"text"},"cell_type":"markdown","source":["save_as_csv_drive () - function pass datafarme , filename which you want to save and folder id to save the file in google drive"]},{"metadata":{"id":"WpJVD03a_Bzi","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.save_as_csv_drive(population_df,'mycsvfile2',FolderID)#folder id in which you want to save the file"],"execution_count":0,"outputs":[]},{"metadata":{"id":"0a1uGGK1noY4","colab_type":"text"},"cell_type":"markdown","source":["save_as_csv_tkinter () - function pass datafarme , filename which you want to save using UI"]},{"metadata":{"id":"Bdshldjqnnri","colab_type":"code","colab":{}},"cell_type":"code","source":["hp.save_as_csv_tkinter(population_df,'mycsvfile2')"],"execution_count":0,"outputs":[]}]}