Pickle or JSON

Pickle or json?

If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.

If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).

Is there a faster way to store a big dictionary, than pickle or regular Python file?

Python File

Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.

EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.

Pickle File

If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.

JSON File

If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json module that's just as easy to use as pickle, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson module is a faster, drop-in replacement for the standard json module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson).

Python: why pickle?

Pickle is unsafe because it constructs arbitrary Python objects by invoking arbitrary functions. However, this is also gives it the power to serialize almost any Python object, without any boilerplate or even white-/black-listing (in the common case). That's very desirable for some use cases:

  • Quick & easy serialization, for example for pausing and resuming a long-running but simple script. None of the concerns matter here, you just want to dump the program's state as-is and load it later.
  • Sending arbitrary Python data to other processes or computers, as in multiprocessing. The security concerns may apply (but mostly don't), the generality is absolutely necessary, and humans won't have to read it.

In other cases, none of the drawbacks is quite enough to justify the work of mapping your stuff to JSON or another restrictive data model. Maybe you don't expect to need human readability/safety/cross-language compatibility or maybe you can do without. Remember, You Ain't Gonna Need It. Using JSON would be the right thing™ but right doesn't always equal good.

You'll notice that I completely ignored the "slow" downside. That's because it's partially misleading: Pickle is indeed slower for data that fits the JSON model (strings, numbers, arrays, maps) perfectly, but if your data's like that you should use JSON for other reasons anyway. If your data isn't like that (very likely), you also need to take into account the custom code you'll need to turn your objects into JSON data, and the custom code you'll need to turn JSON data back into your objects. It adds both engineering effort and run-time overhead, which must be quantified on a case-by-case basis.

Advantages and disadvantages of the pickle, JSON and CSV methods for saving a dictionary

pickle:

On the plus side, it can handle arbitrary objects (with varying levels of work). On the minus side the flat format is not human-readable, and it shouldn't be used with untrusted input. There are versioning issues, too; there are various different protocols defined.

json:

It's easy to move back and forth between some container (dict, list) and value (string and number) objects and JSON. It's also generally human-readable (subject to "pretty" formatting), widely used and well-supported by most (all?) languages. It can't handle arbitrary objects like pickling can, though.

csv:

Arguably the simplest format, but won't handle nesting well while remaining readable and easy to parse (it's probably best suited to persisting a simple table). There's generally more work to convert back and forth than JSON or pickle, too.

mulithreading environment and modules like pickle or json

Using pickle and json will work fine in a multi-threaded environment (but probably is not thread-safe so make sure the data you're pickling can't changing at the time, for example by using a lock). The catch is that you will be restricted to the kind of data you can actually save to disk.

Not all objects are serialisable, as you have found. The simplest approach is to make sure your dictionary only has values that are compatible with pickle or the json serialiser. For example, you seem to have stored a lock object in your dictionary that is making pickle fail. You might want to create a new dictionary with only the values that can be pickled, and then pickle that.

Alternatively, if you want to create a custom object to store your data, you can tell pickle exactly how to pickle it. This is more advanced and probably unnecessary in your case, but you can find more documentation here: https://docs.python.org/3.4/library/pickle.html#pickling-class-instances



Related Topics



Leave a reply



Submit