Python String Prefixes and How to Deal

Python strings are one of those things that works most of the time without having to understand their inner workings. Then it doesn’t work for some inexplicable reason. Once you start digging into the details things start getting a little more complicated.

It gets more complicated considering there are differences in the way Python 2 and Python 3 handle strings. What didn’t work in Python 2 seems to work in Python 3, but now there are more things to consider.

Character encoding and decoding

Computers spend a lot of time switching between making data easy for humans to consume and making data easier to them to process. This happens a lot with strings. We should be able to read them, but they also should be able to store and transmit them.

The process of converting strings into a way that can be stored and transmitted efficiently is called encoding. Encoding a letter from the alphabet to a sequence of computer bits or bytes.

ASCII

ASCII was one of the first standards to switch between human readable code and data. 7 bits would be used to represent 128 characters. This was enough for all the English upper and lower case letters, numbers from 0-9 as well as well as control characters such as new line.

It was good enough to do most of what it was made to. A standard to efficiently store or transmit a sequence of bytes and faithfully translate that sequence into the characters they represented. ASCII doesn’t work very well when you need to display less common characters or characters from other languages.

There were versions of ASCII that use all 8 bits in the byte. It doubles the number of characters that can be represented to 256, but that is still nowhere near the number of characters that are needed.

UNICODE

Unicode was a standard that was built to deal with this issue. Its aim to represent all characters in all human languages. Because of this there are always new languages and characters being added to the standard. This can include symbols as well (like a smiley face).

There are multiple implementations of Unicode. Although you can use all these implementations in Python, the most commonly used one is utf-8. Python 3 uses utf-8 as a standard way of representing strings.

Representing a larger number of characters means that each character needs to be stored in a larger data structure. Considering that a majority of the characters will be standard English characters, this could lead to a lot of waste. UTF-8 gets around this by making the first 128 characters the same as ASCII. Which means they can be stored as single byte. Anything beyond this is stored as multiple bytes.

It is efficient at storing data. ASCII characters are stored in a byte and non ASCII are stored in 2 bytes. More than one byte is used only when it is needed. This minimizes wastage.

Python 2.x and 3.x differences in handling strings

Python 2.x strings are encoded in ASCII. so if you want to use non ASCII characters they will need to be encoded as bytes.

> print 'café'
SyntaxError: Non-ASCII character '\xc3' in file main.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

> # print 'café'
SyntaxError: Non-ASCII character '\xc3' in file main.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

> print b'caf\xc3\xa9'
café

> print b'caf\xc3\xa9'.decode('utf-8')
café

You can’t even have non ASCII characters in comments.

The only way to work with non ASCII characters is to encode them into their Unicode equivalent.

The fundamental difference between Python 2.x and Python 3.x with regard to string processing, is that Python 2.x encodes strings using ASCII where as Python 3.x encodes stings using utf-8 as standard.

This is an extract from the utf-8 character map.

Because of this, handling non ASCII characters in Python 3 is standard.

>>> print('café
')
café

>>> 'café' == u'café'
True

In Python 3.x Unicode is standard.

Different string prefixes

The bytes string type can contain only ASCII characters. Any non ASCII characters need to be escaped with their Unicode representation.

>>> s = 'café'.encode('utf-8')
>>> print(s)
b'caf\xc3\xa9'

>>> s.decode('utf-8')
'café'

If you try to encode this to ASCII you will get the expected error. But there are ways of dealing with it by ignoring the error or substituting the non ASCII character with another one.

>>> 'café'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)
>>> 'café'.encode('ascii', 'ignore')
b'caf'
>>> 'café'.encode('ascii', 'replace')
b'caf?'

Other string prefixes

Although not related to this topic, there are other string prefixes in Python that are worth knowing.

Raw strings

Raw strings keep from actioning control characters within the string. For instance the new line (\n) will be actioned within a print statement, or escaped Unicode will automatically be escaped. Using the r prefix will keep this from happening.

>>> print('hello\nhi')
hello
hi
>>> print(r'hello\nhi')
hello\nhi

>>> 'caf\xc3\xa9'
'café'
>>> r'caf\xc3\xa9'
'caf\\xc3\\xa9'

f strings

Python 3 has a handy way of injecting formulas and variables into strings. Prefixing a string with f means you can include these within the string it’s self.

>>> f'My name is {name} and I am {40 + 10} years old'
'My name is mc gee and I am 50 years old'

Summary

Knowing Python strings at a deeper level will make you a better programmer. More importantly, it won’t catch you off guard when you encounter some code with a unknown prefix. There are times when you get errors that completely break your assumptions on what you think you know about the language and having this deeper level of understanding will help with that.

We have looked at the different string encoding systems out there and why they exist and the different types of string prefixes (u, b, r and f) in Python and how and why they are used.

Leave a comment

Your email address will not be published. Required fields are marked *