Demystifying Python Pickle: Serialization and Deserialization Made Easy
What is pickling?
Pickling is the process of converting an object into a byte stream (serialization). This byte stream can then be stored in a file, a database, or transmitted over a network. The reverse process of converting a byte stream back into an object is called un-pickling (deserialization).
Why do we need to serialization?
In general serialization is used to convert complex data structures, python built-in and user-defined objects into a format that can be easily persisted, transmitted and so on.
How to pickle and un-pickle a Python object?
The PICKLE module is used to serialise a Python object directly into a file or a variable as a byte stream, de-serialise the pickled object from a file or a byte stream stored in a variable are known as pickling and un-pickling respectively. The Pickle module includes four methods to serialise and deserializes objects :: dump, dumps, load, loads.
import pickle
#This example serialises the object into byte stream and store
data = {
‘a’: [1, 2.0, 3+4j],
‘b’: ("character string", b"byte string"),
‘c’: {None, True, False}
}
pickled_object = pickle.dumps(obj=data,protocol=pickle.HIGHEST_PROTOCOL)
print(pickled_object)
#output -> b’\x80\x05\x95w\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x01a\x94]\x94(K\x01G@\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x07complex\x94\x93\x94G@\x08\x00\x00\x00\x00\x00\x00G@\x10\x00\x00\x00\x00\x00\x00\x86\x94R\x94e\x8c\x01b\x94\x8c\x10character string\x94C\x0bbyte string\x94\x86\x94\x8c\x01c\x94\x8f\x94(\x89\x88N\x90u.'
deserialized_object = pickle.loads(pickled_object)
print(deserialized_object)
#output -> {'a': [1, 2.0, (3+4j)], 'b': ('character string', b'byte string'), 'c': {False, True, None}}
protocols?? Wait what?!
The pickle module uses a set of instructions known as opcodes to serialise and deserializes objects. These opcodes are intricately tied to different protocol versions.
The pickle module offers six different protocols, ranging from 0 to 5, with the most recent protocol being 5. Pickling with a higher protocol version is more efficient than pickling with a lower protocol version.
The newer protocol version may not be compatible with older python versions.
How efficiency is related to protocol version during serialization?
The efficiency of pickling is influenced by the protocol version in several ways:
Newer protocol versions support more efficient serialization of common data types. For example, protocol version two introduced more efficient pickling of new-style classes, and protocol version four added support for pickling huge objects.
Newer protocol versions may take advantage of new features in the Python language or runtime. For example, protocol version five can use out-of-band data to accelerate pickling of large objects.
The recent the protocol versions may provide additional optimizations to make more efficient serialization and deserialization for specific hardware platforms.
But in practical how do these protocol versions make a difference?
We will understand protocol versions with the same example we have used previously
import pickle
#This example serialises the object into byte stream and store
data = {
'a': [1, 2.0, 3+4j],
'b': ("character string", b"byte string"),
'c': {None, True, False}
}
pkld_obj_v_5 = pickle.dumps(obj=data,protocol=pickle.HIGHEST_PROTOCOL)
print(pkld_obj_v_5)
#b'\x80\x05\x95y\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x01a\x94]\x94(K\x01G@\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x07complex\x94\x93\x94G@\x08\x00\x00\x00\x00\x00\x00G@\x10\x00\x00\x00\x00\x00\x00\x86\x94R\x94e\x8c\x01b\x94\x8c\x10character string\x94C\rthis is bytes\x94\x86\x94\x8c\x01c\x94\x8f\x94(\x89\x88N\x90u.'
pkld_obj_v_2 = pickle.dumps(obj=data,protocol=2)
print(pkld_obj_v_2)
#b'}q\x00(X\x01\x00\x00\x00aq\x01]q\x02(K\x01G@\x00\x00\x00\x00\x00\x00\x00c__builtin__\ncomplex\nq\x03(G@\x08\x00\x00\x00\x00\x00\x00G@\x10\x00\x00\x00\x00\x00\x00tq\x04Rq\x05eX\x01\x00\x00\x00bq\x06(X\x10\x00\x00\x00character stringq\x07c_codecs\nencode\nq\x08(X\r\x00\x00\x00this is bytesq\tX\x06\x00\x00\x00latin1q\ntq\x0bRq\x0ctq\rX\x01\x00\x00\x00cq\x0ec__builtin__\nset\nq\x0f(]q\x10(I00\nI01\nNetq\x11Rq\x12u.'
We can clearly observe that the serialized byte stream varies depending on the protocol version used from the above example.
Pickle is majorly used for persisting machine learning models, where these persisted models might be transmitted over network. But, now how should I send this model over the network? should I share it as an object? or should I share it as a file?
There are cases where objects needed to be transmitted over the network either as a file or as a python object.
We have already learnt how to serialise and de-serialise a python object stored in a variable, we will be using pickle.dumps and pickle.loads
Now, to persist the model into a file we will be using pickle.dump and pickle.load
import pickle
#This example serialises the object into byte stream and store
data = {
'a': [1, 2.0, 3+4j],
'b': ("character string", b"byte string"),
'c': {None, True, False}
}
with open("learning.pkl", "wb") as f:
#pickling the object & writing to a file
pickle.dump(obj=data, file=f, protocol=pickle.HIGHEST_PROTOCOL)
with open("learning.pkl", "rb") as f:
#reading the file & deserializing data
deserialised_object = pickle.load(f)
print(deserialised_object)
#output -> b'\x80\x05\x95w\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x01a\x94]\x94(K\x01G@\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x07complex\x94\x93\x94G@\x08\x00\x00\x00\x00\x00\x00G@\x10\x00\x00\x00\x00\x00\x00\x86\x94R\x94e\x8c\x01b\x94\x8c\x10character string\x94C\x0bbyte string\x94\x86\x94\x8c\x01c\x94\x8f\x94(\x89\x88N\x90u.'
Security Considerations
Warning: Using Python’s pickle
module can pose security risks, especially with untrusted data. Pickle may execute arbitrary code during un-pickling, potentially leading to security vulnerabilities. Always ensure that the data you are un-pickling comes from a reliable and trusted source.
Refer the link for examples of pickle serialization and deserialization → link
Have wondered what happens when we use pickle.dumps ? or how does pickle work internally? We will dig more deeper on the upcoming blog on the internals of pickle