Python Urllib Module | URL Handling Module

Urllib is a python 3 Standard module for handling URL. Urllib module allows you to access any website using python program. It allows us to make different types of protocol like - HTTP, FTP, etc. But mostly used protocol is HTTP or HTTPS . With Urllib, we can make different types of HTTP requests like get request, post request, pull request, etc.

Urllib package uses several different modules for their work with URL :-
  • Urllib.request => for opening and reading urls.
  • Urllib.parse => for parsing urls.
  • Urllib.error => contains exceptions to be raised by urllib.request .
  • Urllib.roborparser => for parsing robots.txt files.
In this article, we are going to talk about all of the these Urllib modules in an easy and understandable manner along with examples.

Urllib.request

Urllib.request is the most important module of the Urllib library. It defines several different classes and functions for working with urls. 

The simplest way to use this module is to call the urlopen function. This function accepts the string of url or a Request object (object of the Request class of urllib.request). It opens the url and returns the file-like object as a result. Read() function is used to read file-like object.

Let's see it's example (http GET request) :-

Example 1 
import urllib.request

url = 'https://mrjayideas.blogspot.com/'

resp= urllib.request.urlopen(url)

print(resp.read())

This will returns the html source code of specified URL


Key points

1. Urllib.request.urlopen is a function, which returns an object of http.client.HTTPResponse class.

2. read() function we use above is a method of class HTTPResponse, you can see other methods of this class by using command :-

Example 2
import urllib.request

url = 'https://mrjayideas.blogspot.com/'

resp= urllib.request.urlopen(url)

print(dir(resp))

Which returns =>
['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines']
We will talk about it more (sending data, headers,etc) later in this article.

Urllib.parse

This module is used for as the name says, for parsing the URL (uniform resources locator) i.e to split the URL string into its components like scheme, network location, path, params, query, etc. Or to combine url components in url string.

Here, we cover some of the functions of the urllib.parse modules, but mostly used parse modules functions are urlparse(), urlunparse() and urlencode() (from my experience).

Functions of urllib.parse

 Functions name uses 
 Urllib.parse.urlparse
 Used to separate components of url string.
 Urllib.parse.urlunparse Joint components of url string.
 Urllib.parse.urlencode It mainly encodes the mapping object into percent encoded ASCII string. Mainly used during sending POST request.
 Urllib.parse.urlsplit It is similar to urlparse() but doesn't split params form url.
 Urllib.parse.urlunsplit It combines tuple elements returned by the urlsplit() to form a url string.
 Urllib.parse.urljoin It returns an absolute url by combining base url with another url. 
 Urllib.parse.urldeflag
 If the url contains a fragments in url, it returns the url by removing it else return same url.
llib.parse.unwrap Extract the url from wrap url.

Example of urllib.parse module

Example 3
import urllib.parse

url = 'https://mrjayideas.blogspot.com/'

parse = urllib.parse.urlparse(url)
unparse= urllib.parse.urlunparse(parse)
split = urllib.parse.urlsplit(url)

print(parse)
print(unparse)
print(split)

Returns => 
ParseResult(scheme='https', netloc='mrjayideas.blogspot.com', path='/', params='', query='', fragment='')
https://mrjayideas.blogspot.com/
SplitResult(scheme='https', netloc='mrjayideas.blogspot.com', path='/', query='', fragment='')


Example 4 (urlencode)

import urllib.parse

data = {
		'q':'hello world',
		's':'search'
}

enc = urllib.parse.urlencode(data)
print(enc)
Returns =>
q=hello+world&s=search

Urllib.error

This module contains the exception class to be raised by the urllib.request module. Whenever the error occurs, this module will helps in raising exception.

Exception raised are :-
  • URLError => It is base class. It raises the error because of no network connection, specified URL doesn't exists, or any other problem in fetching the url .
  • HTTPError => It is a subclass of URLError. Typically, error incudes 401(for authentication required), 403(request forbidden), or 404(page not found)

Example 5 (for URLError)
import urllib.request

try :
	url= 'https://mrjayideas.blogspot.com/'
	
	resp= urllib.request.urlopen(url)
	
	print(resp.read())
	
except Exception as e:
	print(str(e))

Returns => 
<urlopen error [Errno 7] No address associated with hostname>
This raises a URLError because of no internet connection.

Example 6 (for HTTPError)
import urllib.request

try :
	url= 'https://mrjayideas.blogspot.com/xyz'
	
	resp= urllib.request.urlopen(url)
	
	print(resp.read())
	
except Exception as e:
	print(str(e))

Returns =>
HTTP Error 404: Not Found

 This raise an http 404 error because this url doesn't exist.

Urllib.robotparser

This urllib module contains only one class, RobotFileParser. This module is used to answer questions about whether or not a particular user can fetch a URL on the site that published the robots.txt file . This robots.txt file allows the web scrappers what to access from website and what to not.

Example 7
import urllib.robotparser

robo = urllib.robotparser.RobotFileParser()

x = robo.set_url('https://https://mrjayideas.blogspot.com/robots.txt')

a= robo.can_fetch(" * "," https://mrjayideas.blogspot.com ")
b= robo.entries
c = robo.allow_all
d = robo.parse

print(a)
print(b)
print(c)
print(d)

Returns =>
False
[]
False
<bound method RobotFileParser.parse of <urllib.robotparser.RobotFileParser object at 0x7f96910550>>


Other examples


Example 8 (sending data with url for making POST request)

import urllib.request
import urllib.parse

url= 'https://pythonprogramming.net/search/'

data= {'q':'machine learning'}

enc = urllib.parse.urlencode(data)
enc = enc.encode('utf-8')

req= urllib.request.Request(url,enc)
resp= urllib.request.urlopen(req)
print(resp.read())

Returns => this will returns a html code like :-

Above example (example-8) uses post http request. When we send any data with the url, it automatically become post request.

Example 9 (some websites block their access when surffing using program, to overcome this problem we can use 'headers'.)

import urllib.request
import urllib.parse

url= 'https://www.google.com/search?q=types+of+protocol'

case 1 try: resp = urllib.request.urlopen(url) # This will raise an http 403 error because google blocks the access of his website programatically except Exception as e : print(str(e))
#case 2 try: headers ={} headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17' req= urllib.request.Request(url,headers=headers) resp= urllib.request.urlopen(req) print(resp.read()) except Exception as e: print(str(e))

Returns => In case 1 it returns an http 403 error and in case 2 because we use a headers it returns html code.

Conclusion

I hope you like this article, if you want to know more about this urllib module I suggest you to see the source code of this module ,which is available at GitHub https://github.com/urllib3/urllib3. And I suggest you to use dir() function to know all the functions/ method of this module.







Post a Comment

0 Comments