A few days ago, Google open sourced one of its key data serialization formats, protocol buffers. There's already been some chat on how they're similar to or different than other wire formats, but I thought it would be useful to post some useful tips I've come across over the years about how to make them do useful things.
Don't expect any deep insights into computer science here, just a few notes about working with these libraries.
When to use protocol buffers
Basically, use them when you have data you may want to serialize. Don't use them for structs in inner loops, since there's a performance overhead of checking the has_* bits during each access, but if you aren't writing anything that needs to be that tightly optimized, you can just go ahead. PB's are excellent things to stick in tables, but you should also stick some note somewhere of which PB is actually being used, since their wire format doesn't have that data. A lot of the time that's obvious from context, so you needn't worry.
If you have data that needs to be compressed, use common sense. A posting list from an inverted index should not be expressed as a repeated message inside a protocol buffer; it should be written with a good compression function so you can scan it quickly. But it does make sense to have a protocol buffer whose fields describe the compression type, and which has one string field that holds the compressed payload; that makes adding new compression types and various other list-processing tasks a hell of a lot easier.
Things to remember about the wire format
- The basic format of PB's is a sequence of entries, one for each tag that shows up. There's a header, which consists of a combination of the tag number which is being encoded and the wire format as which it's being written, written out as a varint, and that's followed immediately by the payload. So if you have a 'required int32 foo = 1', the header will be a combination of 1 (the tag number) and the enum NUMERIC indicating that the wire format is a varint32.
- As an interesting side effect, this means that for tag numbers 1 through 15, the header is always a single byte. So if you care about space usage of your proto on the wire, and one field is going to be very common and/or repeated a lot, make sure to give it a low tag number. Give fields that are going to be used for debugging only really high tag numbers.
- If a PB includes another PB, the wire format for an included message is the same as that for a string -- it's just that the string in question is another PB, serialized. So you can nest things, pass serialized PB's around, and play various games in the obvious fashion.
- Another interesting thing is that the order in which PB tags are serialized isn't specified, so you can play various games by smashing two serialized PB's together. For example, if you need to generate N serialized PB's of the same type which are identical except for one field, you can set up the PB without that field and serialize it to a string; then clear it, and N times (fill just that field, serialize it, append that to the fixed string). Much faster than doing a full serialization over and over.
- You can also use this to put together protocol buffers based on other kinds of user input, like CGI arguments; read in a bunch of key/value pairs from wherever, use the PB's introspection methods to get a tag number from the key and the protocol definition, then use the low-level serialization methods to write out (tag, wire format, value) to a string. Deserialize when you're done. It's a little ugly, but it's fairly efficient and typically only needs to be done once or twice. Of course, if you're accepting CGI arguments from the outside world, sanitizing them is your own lookout.
- One thing I've come to truly appreciate is how useful the ASCII serializations can be. Common tricks: Use the short ASCII format (one-line) as a way to interpret something users type. Command-line flags are one good application, as are config files.
Another neat trick is if one is writing complicated types like query trees that require a special parser. (You can write tree structures entirely in PB's. The resulting ASCII format is not very easy to read. Write your own. I'm personally fond of s-expressions) You'll often want to be able to add elaborate annotations to each node of a tree; use a protocol buffer to represent that, and (de)serialize them with the short ASCII representation. It ends up looking something like (AND [onetag:3 anothertag:"foo"]wombat [onetag:4]soufflé). What's nice is that you can then add a new field to the protocol message, and it's instantly added to both the serialized and internal (tree) representations, without having to muck about with the s-expression parser.
Another version of this trick: At one point I had to write a rather complex data serializer and compressor. Its wire format was a protocol buffer, the first bit of which was data about how the main field (a string) was compressed; the functions that generated it took this protocol buffer, minus the payload field, and then serialized data into it. So for unittests, one could simply have a test function that took a string argument which was the ASCII representation of the non-payload part of the PB; the function then made up some data, serialized and deserialized it, etc. But this meant that a lot of the body of the test looked like TestCompression("format:UNCOMPRESSED bytesperentry:8"); and other similarly legible stuff. Honestly, these minor legibility improvements make life a lot simpler than one would guess.
Important useful thing about protos which may or may not be initially obvious
If the protocol deserializer comes across a tag number which isn't in its copy of the protocol definition, it will just keep it as uninterpreted data and pass it along when it reserializes the proto. So if you have three servers, and A sends a message to B which processes it and then sends it on to C, and you want to add a new field which A uses to communicate something to C, you don't need to update the B server; it will just pass the updated protocol message along to C. I can't even begin to tell you how much more pleasant this can make your life.
One more thing
The constructors and destructors of protocol messages are fairly expensive. Reuse them whenever possible, even at the expense of having a slightly broader variable scoping than you would like.