Strolling with Aman A voyage to infinity.

Web Scrapping with Python

Internet has superabundant information and knowledge but to extract that knowledge in organised, scalable and systemetic way for data analysis and other research purpose is known as Web Scrapping.

Web Scrapping automatically extracts the data from a particular website and present it in user desired form. We can scrap many things from the web like Stock updates, Price of any product from Amazon. In this post I’ll scrap news updates from my college’s website.

Getting Started

We can use almost any scripting language for web scrapping however the popular options are Javascript and Python. Both Javascript and Python have rich set of scrapping libraries like CheerioJS for Javascript and Scrapy, BeautifulSoup for Python.

I’ll use Python and BeautifulSoup for this post because both Python and BeautifulSoup are beginner friendly.

  • For Mac and Linux users, Python is preinstalled.
  • Windows users can install Python from official website.

For installing BeautifulSoup library, simply enter following command in terminal.

pip install beautifulsoup4

Finding Target from HTML

All the news & updates of IIT Bhubaneswar are enlisted on this webpage. Visit this Webpage and press Ctr + Shift + I to inspect source code of the webpage. Now click on the the mouse button present on the top-left part of inspector panel and hover over list of news updates. We can see in inspector panel that all news and updates of that webpage are inside an unordered list with class rectlist.

<ol class="rectlist">

Thus we have to scrap all list items present in aforementioned unordered list.

Writing script for scrapping

After running this script we’ll get required result in terminal.

You can learn more about BeautifulSoup library by visiting it’s official documentation.

Other Resources:

Update: I recently came to know another very good resource( for Web Scrapping. I think you may also like it.

Warning: Scrapping data from certain website may violate their Terms and Conditions or Privacy Policy. Please check such things before deploying any scrapper. Most of the websites also have a file named robots.txt on their server which lists all such webpages that shouldn’t be crawled so check for such files before scrapping from any site. For e.g. robot.txt file of Google is present on this link.

You can comment below any doubt / criticism or compliment. Also if you like this post please share it among your hacker friends. Don’t forget to bookmark my blog and keep visiting frequently. Bonne journee!