I'd be really interested in hearing why it wasn't faster! I expect the answer is along the lines of: "Well theoretically the zero-copy design should be faster, but in practice factors X and Y dominate performance and Protobuf wins on those." I'd love to know exactly what X and Y are...
(I'm the original author of proto2/c++, but I'm mostly interested for any lessons that might apply to my work on Cap'n Proto...)
The C++ proto implementation is just already tuned to an absurd degree and it is hard to beat. Any place where copying was an important problem has already been eradicated with aliasing (ctype declarations) so flatbuffers' supposed advantage isn't there to begin with. It's much more important to eliminate branches, stores, and other costs in generated code.
I'm guessing you were trying to use it with Stubby?
Admittedly the networked-RPC use case is not a particularly compelling one for zero-copy (the mmaped-file case is much more so, and maybe even shared-memory RPC).
Still, I'd expect that not having to parse varints nor allocate memory would count for something. Wish I could see the test setup.
(I'm the original author of proto2/c++, but I'm mostly interested for any lessons that might apply to my work on Cap'n Proto...)