Urllib is a python 3 Standard module for handling URL. Urllib module allows you to access any website using python program. It allows us to make different types of protocol like - HTTP, FTP, etc. But mostly used protocol is HTTP or HTTPS . With Urllib, we can make different types of HTTP requests like get request, post request, pull request, etc.
Urllib package uses several different modules for their work with URL :-
- Urllib.request => for opening and reading urls.
- Urllib.parse => for parsing urls.
- Urllib.error => contains exceptions to be raised by urllib.request .
- Urllib.roborparser => for parsing robots.txt files.
In this article, we are going to talk about all of the these Urllib modules in an easy and understandable manner along with examples.
Urllib.request
Urllib.request is the most important module of the Urllib library. It defines several different classes and functions for working with urls.
The simplest way to use this module is to call the urlopen function. This function accepts the string of url or a Request object (object of the Request class of urllib.request). It opens the url and returns the file-like object as a result. Read() function is used to read file-like object.
Let's see it's example (http GET request) :-
Example 1
import urllib.request url = 'https://mrjayideas.blogspot.com/' resp= urllib.request.urlopen(url) print(resp.read())
This will returns the html source code of specified URL
Key points
1. Urllib.request.urlopen is a function, which returns an object of http.client.HTTPResponse class.
2. read() function we use above is a method of class HTTPResponse, you can see other methods of this class by using command :-
Example 2
import urllib.request url = 'https://mrjayideas.blogspot.com/' resp= urllib.request.urlopen(url) print(dir(resp))
Which returns =>
['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines']
We will talk about it more (sending data, headers,etc) later in this article.
Urllib.parse
This module is used for as the name says, for parsing the URL (uniform resources locator) i.e to split the URL string into its components like scheme, network location, path, params, query, etc. Or to combine url components in url string.
Here, we cover some of the functions of the urllib.parse modules, but mostly used parse modules functions are urlparse(), urlunparse() and urlencode() (from my experience).
Functions of urllib.parse
Functions name | uses |
Urllib.parse.urlparse | Used to separate components of url string. |
Urllib.parse.urlunparse | Joint components of url string. |
Urllib.parse.urlencode | It mainly encodes the mapping object into percent encoded ASCII string. Mainly used during sending POST request. |
Urllib.parse.urlsplit | It is similar to urlparse() but doesn't split params form url. |
Urllib.parse.urlunsplit | It combines tuple elements returned by the urlsplit() to form a url string. |
Urllib.parse.urljoin | It returns an absolute url by combining base url with another url. |
Urllib.parse.urldeflag | If the url contains a fragments in url, it returns the url by removing it else return same url. |
llib.parse.unwrap | Extract the url from wrap url. |
Example of urllib.parse module
Example 3
import urllib.parse url = 'https://mrjayideas.blogspot.com/' parse = urllib.parse.urlparse(url) unparse= urllib.parse.urlunparse(parse) split = urllib.parse.urlsplit(url) print(parse) print(unparse) print(split)
Returns =>
ParseResult(scheme='https', netloc='mrjayideas.blogspot.com', path='/', params='', query='', fragment='') https://mrjayideas.blogspot.com/ SplitResult(scheme='https', netloc='mrjayideas.blogspot.com', path='/', query='', fragment='')
Example 4 (urlencode)
import urllib.parse data = { 'q':'hello world', 's':'search' } enc = urllib.parse.urlencode(data) print(enc)
Returns =>
q=hello+world&s=search
Urllib.error
This module contains the exception class to be raised by the urllib.request module. Whenever the error occurs, this module will helps in raising exception.
Exception raised are :-
- URLError => It is base class. It raises the error because of no network connection, specified URL doesn't exists, or any other problem in fetching the url .
- HTTPError => It is a subclass of URLError. Typically, error incudes 401(for authentication required), 403(request forbidden), or 404(page not found)
Example 5 (for URLError)
import urllib.request try : url= 'https://mrjayideas.blogspot.com/' resp= urllib.request.urlopen(url) print(resp.read()) except Exception as e: print(str(e))
Returns =>
<urlopen error [Errno 7] No address associated with hostname>
This raises a URLError because of no internet connection.
Example 6 (for HTTPError)
import urllib.request try : url= 'https://mrjayideas.blogspot.com/xyz' resp= urllib.request.urlopen(url) print(resp.read()) except Exception as e: print(str(e))
Returns =>
HTTP Error 404: Not Found
This raise an http 404 error because this url doesn't exist.
Urllib.robotparser
This urllib module contains only one class, RobotFileParser. This module is used to answer questions about whether or not a particular user can fetch a URL on the site that published the robots.txt file . This robots.txt file allows the web scrappers what to access from website and what to not.
Example 7
import urllib.robotparser robo = urllib.robotparser.RobotFileParser() x = robo.set_url('https://https://mrjayideas.blogspot.com/robots.txt') a= robo.can_fetch(" * "," https://mrjayideas.blogspot.com ") b= robo.entries c = robo.allow_all d = robo.parse print(a) print(b) print(c) print(d)
Returns =>
False [] False <bound method RobotFileParser.parse of <urllib.robotparser.RobotFileParser object at 0x7f96910550>>
Other examples
Example 8 (sending data with url for making POST request)
import urllib.request import urllib.parse url= 'https://pythonprogramming.net/search/' data= {'q':'machine learning'} enc = urllib.parse.urlencode(data) enc = enc.encode('utf-8') req= urllib.request.Request(url,enc) resp= urllib.request.urlopen(req) print(resp.read())
Returns => this will returns a html code like :-
Above example (example-8) uses post http request. When we send any data with the url, it automatically become post request.
Example 9 (some websites block their access when surffing using program, to overcome this problem we can use 'headers'.)
import urllib.request import urllib.parse url= 'https://www.google.com/search?q=types+of+protocol'
case 1 try: resp = urllib.request.urlopen(url) # This will raise an http 403 error because google blocks the access of his website programatically except Exception as e : print(str(e))
#case 2 try: headers ={} headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17' req= urllib.request.Request(url,headers=headers) resp= urllib.request.urlopen(req) print(resp.read()) except Exception as e: print(str(e))
Returns => In case 1 it returns an http 403 error and in case 2 because we use a headers it returns html code.
Conclusion
I hope you like this article, if you want to know more about this urllib module I suggest you to see the source code of this module ,which is available at GitHub https://github.com/urllib3/urllib3. And I suggest you to use dir() function to know all the functions/ method of this module.
0 Comments