. Advertisement .
..3..
. Advertisement .
..4..
Since the 3.0 version, developers can decode UTF-8 in Python easily out of the box. We are going to explain Python’s Unicode support and how you can decode popular encodings like UTF-8 below.
Decode UTF-8 In Python
Unicode Support In Python
Since the 3.0 version, the str type of Python can contain characters as Unicode (UTF-8 in particular) by default, regardless of whether you use the single, double, or triple-quoted syntax.
UTF-8 is also the default encoding system for all source code of this programming language. This allows Python’s program to include Unicode characters in string literals. Additionally, identifiers like variables can also use Unicode characters in Python 3.
>>> désignation = "présidente"
When for some reason, you want to keep your codebase ASCII-only, or you can’t enter Unicode characters into your text editor, escape sequences can be used in string literals to represent those characters.
>>> string = "\u0152"
Note: instead of u escapes, some platforms and software will display capital-delta glyph.
Python supports about 100 different encoding systems out of the box, including UTF-8. Some of them have more than one name. For instance, ‘8859’, ‘iso_8859_1’, and ‘latin-1’ are the names of the same encoding in Python (formally known as ISO/IEC 8859-1).
str.encode() and bytes.decode()
When you want your Python program to work with Unicode data like UTF-8 , input and output are your main problems. You will need to deal with how to get UTF-8 strings into the program and how to convert them back to a suitable form for transmission and storage, such as reading the first line of a file.
If you are lucky with your output destinations or input sources, you aren’t required to do anything. In those cases, you just need to check whether UTF-8 is supported natively by the libraries in your program.
Most of the time, you need to rely on the str.encode() and bytes.decode() methods. They are closely related but have opposite purposes.
While the str type is designed for representing human-readable text and can have any Unicode character, bytes objects in Python are sequences of single bytes without any encoding attached.
The str.encode() method returns a bytes object that contains an encoded version of a string. Its syntax:
str.encode(encoding, errors)
The default value of the encoding parameter is ‘utf-8’. You can also give the errors parameters to specify the error handling scheme.
>>> name = "ITTutoria"
>>> nameBytes = name.encode()
>>> type(nameBytes)
<class 'bytes'>
>>> print(nameBytes)
b'ITTutoria'
The encode() method above stores a representation of the “ITTutoria” string into a bytes object. We don’t need to provide the encoding since it uses UTF-8 by default.
We can see the difference in the str and bytes objects of the same string literal by using other Unicode characters:
>>> designation = "présidente"
>>> bytesObj = designation.encode()
>>> print(designation)
présidente
>>> print(bytesObj)
b'pr\xc3\xa9sidente'
The print() function prints every byte in the bytes object. To convert those bytes back to str, we can use the bytes.decode() method. It decodes the given bytes and returns a resulting string. Like str.encode(), this method also uses UTF-8 as the default character encoding.
>>> b'ITTutoria'.decode()
'ITTutoria'
>>> b'pr\xc3\xa9sidente'.decode()
'présidente'
As you can see, the method has converted two bytes literas to two strings that contain non-ASCII Unicode characters.
Conclusion
You can easily decode UTF-8 in Python thanks to this programming language’s official support of Unicode characters. In fact, all strings and Python source code use UTF-8 by default.
Leave a comment