July 7, 2007

Serializing arbitrary Python objects to JSON using __dict__

Filed under: Software Blog — marcstober @ 9:58 am

Python is my favorite programming language and although I don’t (officially) use it at work, I keep it around for writing quick utilities for my own use. Lately I’ve been using it for some simple code generation.

Anyhow, as my little code generation utility got fancier I had the need to be able to serialize some objects into a text file, edit the text file, and deserialize it back into objects. Basically a poor-man’s configuration editor.

I’ve done this before with the Microsoft .NET Framework by using the System.Xml.Serialization.XmlSerializer class. There are a lot of quirks and subtleties to XML Serialization, but for this sort of simple task, it works well.

I don’t mind XML, but I thought I’d try serialization using JSON. In part just to try out the latest fad, but even more so because I’ve always wanted to try bulding something with YAML, and now that JSON is YAML, that seemed like the way to go. I was initially turned on to YAML because of its bulleted lists, which looked a lot like a to-do list I’d type out for myself, and it seemed like magic that this same format could be machine-readable data without any further translation (the same feeling you get from Python in general). Maybe it was magic, because I’d never really gotten YAML figured out (not that I spent so much time on it) and while JSON doesn’t have the bulleted lists it’s easier to understand overall. Essentially, JSON reduces all structured data to ordered lists or key-value pairs; it’s one of those things that seems so simple but that zillions of other data formats tried and failed to do.

The next question is which JSON parser to use. You can do just about anything with Python but there’s a bit of a dichotomy between the stuff in the standard library that follows the one-right-way principle, and newer things where there are still a few ways competing and you have to get a feel for what’s been accepted by the community. I first tried json-py, then after reading this, simplejson which promised to be more “extensible” (though to be fair, json-py is at least worth more than you pay for it).

It turned out I couldn’t do what I wanted to do in either (or any) JSON implementation, I got errors that my object is “not JSON Serializable.” It seemed that it wasn’t going to be quite so easy to serialize my arbitrary object to JSON. This led to a long bout of searching the web and comp.lang.python for answer.

(Fortunately by this point it was 5:00 on Friday. I couldn’t really justify figuring out a new object-serialization scheme as part of my day’s work, so it became a weekend-morning programming project of my own, which also means I feel more free to spend time blogging about it.)

I found a couple interesting things along the way. First, what I am really trying to do is pickle, the standard Python way of “preserving” objects (i.e., in vinegar?) – but it doesn’t use a human readable format (or just barely, they are strings). Second, there is an XML pickle, although I didn’t try it.

I was thinking I would have to write some subsclass of simplejson’s objects to do what I wanted, which really wasn’t what I wanted to do; the whole point was *not* to write my own serialization/deserialization logic. Then I realized that the reason it seemed so easy with .NET is because XmlSerializer didn’t just take an arbitrary object, it also needed to be explictly told that object’s type. The type information wasn’t contained or implied by the XML file, it was specified by the code calling the serialization and deserialization. In fact, when I’d first encountered this I had thought that the need to specify a type made XmlSerializer seem a little less “magic” than the general idea of dehydrating and rehydrating objects to/from text. Anyhow, once I realized that my objects needed to be translated to and from built-in data types in my own code, and I was okay with that, I found a really Pythonic way to do it, using __dict__:

import simplejson
class Person:
    def __init__(self, name=None):
        if name:
            self.name = name
people = [ Person('Marc'), Person('Rachel') ]

# Fails with error that Field class "is not JSON serializable"
#s = simplejson.dumps(fields)

# This is what we want.
s = simplejson.dumps([p.__dict__ for p in people])
print s

# Deserialize
clones = simplejson.loads(s)
print clones

# Now give our clones some life
for clone in clones:
    p = Person()
    p.__dict__ = clone
    print p
    print p.name

Of course, this doesn’t work for arbitrary object graphs, but it satisfys the “80/20 rule” of what I need most of the time.

Note that there are probably some security risks (at least) in rehydrating objects using __dict__ like this so make sure you only use this technique with trusted data, or come up with some other defensive mechanism.

Comments welcome.