Python Serialization - Why Pickle

Python serialization - Why pickle?

Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

As for where the pickled information is stored, usually one would do:

with open('filename', 'wb') as f:
var = {1 : 'a' , 2 : 'b'}
pickle.dump(var, f)

That would store the pickled version of our var dict in the 'filename' file. Then, in another script, you could load from this file into a variable and the dictionary would be recreated:

with open('filename','rb') as f:
var = pickle.load(f)

Another use for pickling is if you need to transmit this dictionary over a network (perhaps with sockets or something.) You first need to convert it into a character stream, then you can send it over a socket connection.

Also, there is no "compression" to speak of here...it's just a way to convert from one representation (in RAM) to another (in "text").

About.com has a nice introduction of pickling here.

Python: why pickle?

Pickle is unsafe because it constructs arbitrary Python objects by invoking arbitrary functions. However, this is also gives it the power to serialize almost any Python object, without any boilerplate or even white-/black-listing (in the common case). That's very desirable for some use cases:

  • Quick & easy serialization, for example for pausing and resuming a long-running but simple script. None of the concerns matter here, you just want to dump the program's state as-is and load it later.
  • Sending arbitrary Python data to other processes or computers, as in multiprocessing. The security concerns may apply (but mostly don't), the generality is absolutely necessary, and humans won't have to read it.

In other cases, none of the drawbacks is quite enough to justify the work of mapping your stuff to JSON or another restrictive data model. Maybe you don't expect to need human readability/safety/cross-language compatibility or maybe you can do without. Remember, You Ain't Gonna Need It. Using JSON would be the right thing™ but right doesn't always equal good.

You'll notice that I completely ignored the "slow" downside. That's because it's partially misleading: Pickle is indeed slower for data that fits the JSON model (strings, numbers, arrays, maps) perfectly, but if your data's like that you should use JSON for other reasons anyway. If your data isn't like that (very likely), you also need to take into account the custom code you'll need to turn your objects into JSON data, and the custom code you'll need to turn JSON data back into your objects. It adds both engineering effort and run-time overhead, which must be quantified on a case-by-case basis.

Is there any difference between Pickling and Serialization?

You are misreading the article. Pickling and serialisation are not synonymous, nor does the text claim them to be.

Paraphrasing slighly, the text says this:

This module implements an algorithm for turning an object into a series of bytes. This process is also called serializing the object.

I removed the module name, pickle, deliberately. The module implements a process, an algorithm, and that process is commonly known as serialisation.

There are other implementations of that process. You could use JSON or XML to serialise data to text. There is also the marshal module. Other languages have other serialization formats; the R language has one, so does Java. Etc.

See the WikiPedia article on the subject:

In computer science, in the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

Python picked the name pickle because it modelled the process on how this was handled in Modula-3, where it was also called pickling. See Pickles: Why are they called that?

Why not use pickle instead of struct?

I think you have a misunderstanding of what struct does.

Struct

Struct is not meant to store Python objects into a byte stream. What it does is producing a byte stream by transforming Python objects into structures that represent the data the objects contained. For instance used a signed 32-bit representation for an integer. But a struct for instance is not designed to store a dictionary, since there are a lot of ways to serialize a dictionary.

It is used to construct a (binary) file that meets the criteria of a protocol. For instance if you have a 3d model, then you perhaps want to write an exporter to the .3ds file format. This format follows a certain protocol (for instance it will start with 0x4d4d). You can not use pickle to dump to such format, since Pickle is actually a specific protocol.

The same with reading binary files into Python objects. You can not run Pickle over a .3ds file, since Pickle does not know the protocol. It does not know what 0x4d4d in the beginning of the file means. It can be a 16-bit integer (19789), it can be a 2-character ASCII string ('MM'), etc. Usually most binary files are designed for one purpose. And you need to understand the protocol in order to read/write such files.

Pickle

Pickle on the other hand is a tool designed to store Python objects in a binary stream, such that we can load these objects back once we need these. It defines a protocol. For instance pickle always starts the stream with byte 128, followed by the protocol version (1, 2, or 3). The next byte specifies an identifier of the type of object we are going to pickle (for instance 75 for an integer, 88 for a string, etc.

Pickle also has to serialize all references of the object, and keep track of the objects it has already serialized, since there can be cyclic structures into it. For instance if we have two dictionaries:

d = {}
e = {'a': d}
d['a'] = e

then we can not simply serialize d, and serialize e as part of e. We have to keep track that we serialized d already, since serializing e would otherwise resulting serializing d, etc. until we run out of memory.

Pickle is thus a specific protocol to store Python objects. But we can not use it to serialize to a specific format, such that other (non-Python) programs can read it.

Why is Pickle not serializing my array of classes?

Pickle only saves the instance attributes of a class, but Synonyms is a list, defined at class level. You should create the list in a __init__-method:

import pickle

class Entry:
def __init__(self, text):
self.Text = text

class Term:
def __init__(self):
self.Main = None
self.Synonyms = []

def Save():
term = Term()
term.Main = Entry("Dog")
term.Synonyms.append(Entry("Canine"))
term.Synonyms.append(Entry("Pursue"))
term.Synonyms.append(Entry("Follow"))
term.Synonyms.append(Entry("Plague"))

terms = []
terms.append(term)

with open('output.pickle', 'wb') as p:
pickle.dump(terms, p)

def Load():
with open('output.pickle', 'rb') as p:
loadedTerms = pickle.load(p)
return loadedTerms

why python's pickle is not serializing a method as default argument?

pickle is loading your dictionary data before it has restored the attributes on your instance. As such the self.cond attribute is not yet set when __setitem__ is called for the dictionary key-value pairs.

Note that pickle will never call __init__; instead it'll create an entirely blank instance and restore the __dict__ attribute namespace on that directly.

You have two options:

  • default to cond=None and ignore the condition if it is still set to None:

    class CustomDict(dict):
    def __init__(self, cond=None):
    super().__init__()
    self.cond = cond

    def __setitem__(self, key, value):
    if getattr(self, 'cond', None) is None or self.cond(value):
    dict.__setitem__(self, key, value)

    The getattr() there is needed because a blank instance has no cond attribute at all (it is not set to None, the attribute is entirely missing). You could add cond = None to the class:

    class CustomDict(dict):
    cond = None

    and then just test for if self.cond is None or self.cond(value):.

  • Define a custom __reduce__ method to control how the initial object is created when restored:

    def _default_cond(v): return v is not None

    class CustomDict(dict):
    def __init__(self, cond=_default_cond):
    super().__init__()
    self.cond = cond

    def __setitem__(self, key, value):
    if self.cond(value):
    dict.__setitem__(self, key, value)

    def __reduce__(self):
    return (CustomDict, (self.cond,), None, None, iter(self.items()))

    __reduce__ is expected to return a tuple with:

    • A callable that can be pickled directly (here the class does fine)
    • A tuple of positional arguments for that callable; on unpickling the first element is called passing in the second as arguments, so by setting this to (self.cond,) we ensure that the new instance is created with cond passed in as an argument and now CustomDict.__init__() will be called.
    • The next 2 positions are for a __setstate__ method (ignored here) and for list-like types, so we set these to None.
    • The last element is an iterator for the key-value pairs that pickle then will restore for us.

    Note that I replaced the default value for cond with a function here too so you don't have to rely on dill for the pickling.

Limitations of python pickle? Can it serialize anything?

From my experience the tensorflow library has some issues with having not picklable objects. See for example this github issue.

I actually don't know where exactly this issue comes from, but it definitely is a negative example.

Is there an easy way to pickle a python function (or otherwise serialize its code)?

You could serialise the function bytecode and then reconstruct it on the caller. The marshal module can be used to serialise code objects, which can then be reassembled into a function. ie:

import marshal
def foo(x): return x*x
code_string = marshal.dumps(foo.__code__)

Then in the remote process (after transferring code_string):

import marshal, types

code = marshal.loads(code_string)
func = types.FunctionType(code, globals(), "some_func_name")

func(10) # gives 100

A few caveats:

  • marshal's format (any python bytecode for that matter) may not be compatable between major python versions.

  • Will only work for cpython implementation.

  • If the function references globals (including imported modules, other functions etc) that you need to pick up, you'll need to serialise these too, or recreate them on the remote side. My example just gives it the remote process's global namespace.

  • You'll probably need to do a bit more to support more complex cases, like closures or generator functions.



Related Topics



Leave a reply



Submit