gobエンコーディング頑張って調べる回 - ちりもつもればミルキーウェイ

はじめに

以前gobについてどういうところが嬉しいんだっけ？みたいなのをまとめました https://convto.hatenablog.com/entry/2022/05/22/104428

メリットもわかったところであとはバイナリとダンスするだけなので、いくつかに分けてgobと戯れてそのうちしっかり理解したいぞみたいな趣旨です

適当にドキュメント読んで、小さい構造でgobバイナリ生成して様子をみてみる！
今回は小さいstructでどうencodeされてるか理解することをスコープにしてやっていく

だいたいパッケージドキュメントにあるのでそのへんみる https://pkg.go.dev/encoding/gob

バイナリを眺める

とりあえずなにはともあれ小さいコードを書いてgobバイナリをみてみよう

package main

import (
    "bytes"
    "encoding/gob"
    "fmt"

    "github.com/convto/bit"
)

func main() {
    type item struct {
        Name  string
        Price int
    }
    banana := item{Name: "banana", Price: 100}

    var buf bytes.Buffer
    enc := gob.NewEncoder(&buf)
    enc.Encode(banana)

    fmt.Println(bit.Dump(buf.Bytes()))

    // Output:
    // 00000000: 00100101 11111111 10000001 00000011 00000001 00000001  %.....
    // 00000006: 00000100 01101001 01110100 01100101 01101101 00000001  .item.
    // 0000000c: 11111111 10000010 00000000 00000001 00000010 00000001  ......
    // 00000012: 00000100 01001110 01100001 01101101 01100101 00000001  .Name.
    // 00000018: 00001100 00000000 00000001 00000101 01010000 01110010  ....Pr
    // 0000001e: 01101001 01100011 01100101 00000001 00000100 00000000  ice...
    // 00000024: 00000000 00000000 00001110 11111111 10000010 00000001  ......
    // 0000002a: 00000110 01100010 01100001 01101110 01100001 01101110  .banan
    // 00000030: 01100001 00000001 11111111 11001000 00000000           a....
}

https://go.dev/play/p/fUnb8TgydCc

実行すると

00000000: 00100101 11111111 10000001 00000011 00000001 00000001  %.....
00000006: 00000100 01101001 01110100 01100101 01101101 00000001  .item.
0000000c: 11111111 10000010 00000000 00000001 00000010 00000001  ......
00000012: 00000100 01001110 01100001 01101101 01100101 00000001  .Name.
00000018: 00001100 00000000 00000001 00000101 01010000 01110010  ....Pr
0000001e: 01101001 01100011 01100101 00000001 00000100 00000000  ice...
00000024: 00000000 00000000 00001110 11111111 10000010 00000001  ......
0000002a: 00000110 01100010 01100001 01101110 01100001 01101110  .banan
00000030: 01100001 00000001 11111111 11001000 00000000           a....

が得られます。
これだけでもエンコーディングについていくつかの情報が得られます

型とフィールドの名前がバイナリに埋め込まれている
stringはそのまま見えてる(ので多分big endian)
型情報を先に書いてて、値はその後ろから送られてきそう
(self-describingだからそれはそうとはいえ)メタ情報が多くてバイト効率あんましよくなさそう

このくらい目星たててから仕様をみるといい感じに読める

gobのエンコーディング仕様ながめる

エンコーディング仕様をざっとながめる。
https://pkg.go.dev/encoding/gob#hdr-Encoding_Details あたりに詳細が記載されてる。(ど頭に大抵の人には役にたたんみたいなこと書いてあるが)

今回はさっきの例で書いた

type item struct {
    name  string
    price int
}

の小さい構造をパースしたいので

数値
文字列
struct

の仕様に目を通すぞ！

以降それぞれの仕様をみていく過程でいくつか引用がありますが、特別な言及が無い限りすべて https://pkg.go.dev/encoding/gob#hdr-Encoding_Details からの引用です

struct

まずはstructの仕様みていくぞ！

なんでいきなりcomposite型の仕様からみるんやもっと簡単なやつからやってくれないと脳が追いつかへんでと思ってるそこのあなた！

その疑問はもっともなのでまず取り急ぎ説明すると、なんとgobは処理の簡単のために渡された値がstructでないときは単一フィールドのstructとみなす仕様があるので、structのエンコード仕様がわからんとなにも始まらんのです

If a value is passed to Encode and the type is not a struct (or pointer to struct, etc.), for simplicity of processing it is represented as a struct of one field. The only visible effect of this is to encode a zero byte after the value, just as after the last field of an encoded struct, so that the decode algorithm knows when the top-level value is complete.

さっき生バイト列みたときに型情報がさきで、valueが後！みたいな形式であろう目星はついてるので、そのへんを意識しつつ仕様をみるぞ
基本的には複雑な構造向けの説明は端折って、今回解き明かしたいシンプルな構造をパースするための知識をかいつまんで紹介する

Structs are sent as a sequence of (field number, field value) pairs.

field numberと値のペアの形でおくられるっぽい

The field number is defined by the type of the encoded struct: the first field of the encoded type is field 0, the second is field 1, etc. When encoding a value, the field numbers are delta encoded for efficiency and the fields are always sent in order of increasing field number; the deltas are therefore unsigned.

field numberはstructの型から作られる。最初のフィールドは0でつぎは1。iotaと雰囲気が似てるな(constDeclのなかのconstSpecのindex番号なので)

ほんで効率のためにdelta encodingでおくられるよう。

Finally, after all the fields have been sent a terminating mark denotes the end of the struct. That mark is a delta=0 value, which has representation (00).

全フィールドおわったら終端として 00 がくるとのこと。

エンコード時は (typeid, encoded-v) のペアで送るっぽい。
で、typeidとか管理するうえでtypeをどう取り扱うかを考えねばならなくて、それがこの辺に記載されてる

To define a type, the encoder chooses an unused, positive type id and sends the pair (-type id, encoded-type) where encoded-type is the gob encoding of a wireType description

未使用のtypeidを適当にかっぱらってきて型構造とくっつけて管理してるっぽいな！

ドキュメントには型定義がズラーとでててて、今回の例に関係ありそうなのは以下

type CommonType struct {
    Name string // the name of the struct type
    Id  int    // the id of the type, repeated so it's inside the type
}

type structType struct {
    CommonType
    Field []*fieldType // the fields of the struct.
}

type fieldType struct {
    Name string // the name of the field.
    Id   int    // the type id of the field, which must be already defined
}

For simplicity in setup, the connection is defined to understand these types a priori, as well as the basic gob types int, uint, etc. Their ids are:

basicな型については事前にtypeid準備されてるぽい

bool        1
int         2
uint        3
float       4
[]byte      5
string      6
complex     7
interface   8
// gap for reserved ids.
WireType    16
ArrayType   17
CommonType  18
SliceType   19
StructType  20
FieldType   21
// 22 is slice of fieldType.
MapType     23

Finally, each message created by a call to Encode is preceded by an encoded unsigned integer count of the number of bytes remaining in the message.

ほんでメッセージごとに残りバイト数がご丁寧に書かれておるらしい。サンキュー

サマるとこれや！というのが記載されてた。 * は0回以上の繰り返し

(byteCount (-type id, encoding of a wireType)* (type id, encoding of a value))*

byte長が先頭で、そのあとに(-type id, encoding of a wireType) のペアが並んで、そこで定義されたtypeに紐づく値として (type id, encoding of a value) のペアがダラーと並ぶよう。
バイナリ眺めたときと同じで最初に型情報、そのあとにvalueみたいな雰囲気っぽいな！

なんか軽いサンプルとかあるともっと理解できるけど、だいたい把握した

数値

uint/intで仕様違うっぽいのでみていく。
ここを区別する理由って zigzag encoding とかそれに準ずるなにかをつかってるってことかな

uint

まずは符号なし整数から

An unsigned integer is sent one of two ways. If it is less than 128, it is sent as a byte with that value. Otherwise it is sent as a minimal-length big-endian (high byte first) byte stream holding the value, preceded by one byte holding the byte count, negated. Thus 0 is transmitted as (00), 7 is transmitted as (07) and 256 is transmitted as (FE 01 00).

符号なし整数は128以下だったら1byteでおくるといっている。で、そうじゃなければbig-endianのバイトストリームでくる。

このときにはじめのbyteはstreamのbyte数を返すっぽいけど、なにやらひっくり返すっぽい。これなんでかなーとおもうけど、多分先頭1bitが0なら(128以下で)1byteで完結！そうじゃなければbyte数をあらわす！みたいな立て付けにしてる都合で複数byteの値を送る時のbyte countを示すbyteは絶対先頭1bitが立ってなきゃアカンからだとおもう。先頭1bitがたってるとそこだけ無視して云々みたいなのも面倒だから2の補数表現で表した負数を反転させてbyte数として扱ってるんじゃないかな

挙げられてる例を整理すると

00000000
↓
0

00000111
↓
7

11111110 00000001 00000000
↓
1byte目は後続のバイト長。2の補数表現で評価すると-2で、反転して2なのでbyte streamの長さは2
で先の2byteを見ると1<<8なので256

これでかんがえると、例えば

512は 1<<9 なので 11111110 00000010 00000000
1<<25-1 とかなら25bitに1が立ってて 25/3 = 3 .. 1で追加で4byte必要なので 11111100 00000001 11111111 11111111 11111111

となるわけですね。uintは完全に理解した

int

つぎは符号付き整数。スーパーわかりやすく仕様がかいてある

var u uint
if i < 0 {
    u = (^uint(i) << 1) | 1 // complement i, bit 0 is 1
} else {
    u = (uint(i) << 1) // do not complement i, bit 0 is 0
}
encodeUnsigned(u)

つまり

負の値だったらbit反転して末尾bitに1つける
正の値だったら末尾bitに0つける

ということで、末尾bitが正負の判定になってる

実際の実装見てもそんなかんじ
https://cs.opensource.google/go/go/+/refs/tags/go1.18.2:src/encoding/gob/encode.go;l=123-131;drc=d4dd7acd2e9654e9769f4fa131540ad4e991adb5

たんなるzigzag encodingだとよく

uint64(n<<1) ^ uint64(n>>63)

みたいなことして終わりにしてるので、それにくらべると手間がかかってる。
なんだけど、やってることは同一じゃないかこれ

シュッとbit演算で値つくったほうが計算負荷低そうだけど、一応なんか考慮してるでみたいな記載がある。よくわからんのでこれは今後の宿題としよう

The low bit is therefore analogous to a sign bit, but making it the complement bit instead guarantees that the largest negative integer is not a special case. For example, -129=^128=(^256>>1) encodes as (FE 01 01).

まあだいたいzigzagということで符号付き整数も概ね理解した

最小で試して中身を詳しくみてみよう

package main

import (
    "bytes"
    "encoding/gob"
    "fmt"

    "github.com/convto/bit"
)

func main() {
    test := 100
    var buf bytes.Buffer
    enc := gob.NewEncoder(&buf)
    enc.Encode(test)

    fmt.Println(bit.Dump(buf.Bytes()))
    // Output:
    // 00000000: 00000100 00000100 00000000 11111111 11001000           .....
}

https://go.dev/play/p/FXJ1z7gsWYS

数値のエンコーディング仕様をふくめて考えると、

先頭1byteはメッセージのbyte長。今回は4
つぎはtype id
- 符号付き整数で送られるっぽいが多分intのメッセージと同様のルール
- 末尾0だから正の値、のこりは0000010だから2！
struct以外の値連携なので単一フィールドのstructとして扱われる。00000000はstruct情報の終端表現
つぎはbyte streamの長さで-1だからひっくり返して1
- uintのときは128以下は1byteで表現だったけど、intのときは末尾1bitが正負判定にされるから、絶対値64以上で先頭が立ってbyte countと判別できなくなるっぽい。だから値自体は1byteで収まってもbyte countが送られるときあるぞと
で値は11001000で末尾0だから正の値、右に1bitシフトで1100100だから値は100

文字列

The low bit is therefore analogous to a sign bit, but making it the complement bit instead guarantees that the largest negative integer is not a special case. For example, -129=^128=(^256>>1) encodes as (FE 01 01).

とあるので、最初にバイト長きて、そのあとその数ぶんのバイト列が投げ込まれる模様。

最小で試して中身を詳しくみてみよう

package main

import (
    "bytes"
    "encoding/gob"
    "fmt"

    "github.com/convto/bit"
)

func main() {
    test := "banana"
    var buf bytes.Buffer
    enc := gob.NewEncoder(&buf)
    enc.Encode(test)

    fmt.Println(bit.Dump(buf.Bytes()))
    // Output:
    // 00000000: 00001001 00001100 00000000 00000110 01100010 01100001  ....ba
    // 00000006: 01101110 01100001 01101110 01100001                    nana
}

https://go.dev/play/p/SC8bSjjP-aR

stringのエンコーディング仕様的に、今回の出力は

先頭1bitはメッセージのbyte count。今回は9
つぎはtype id
- さっきと同様にintのメッセージと同様のルールと想定
- 末尾0だから正の値、のこりは0000110だから6！
struct以外の値連携なので単一フィールドのstructとして扱われる。00000000はstruct情報の終端表現
4byteめはstringのbyte長。今回は6
5~10byteまではstringがそのまま渡ってきていて、dump出力からも確認できる

という感じだな！

余談だけどsliceとかmapも先頭に後続のbyte長あるのはおなじっぽい。

最初のバイナリを仕様にあてはめて整理してみる

最初のコードとバイナリをおさらいすると

type item struct {
    Name  string
    Price int
}
banana := item{Name: "banana", Price: 100}

00000000: 00100101 11111111 10000001 00000011 00000001 00000001  %.....
00000006: 00000100 01101001 01110100 01100101 01101101 00000001  .item.
0000000c: 11111111 10000010 00000000 00000001 00000010 00000001  ......
00000012: 00000100 01001110 01100001 01101101 01100101 00000001  .Name.
00000018: 00001100 00000000 00000001 00000101 01010000 01110010  ....Pr
0000001e: 01101001 01100011 01100101 00000001 00000100 00000000  ice...
00000024: 00000000 00000000 00001110 11111111 10000010 00000001  ......
0000002a: 00000110 01100010 01100001 01101110 01100001 01101110  .banan
00000030: 01100001 00000001 11111111 11001000 00000000           a....

いまの僕はこの構造に関わる主要な仕様をざっくり読んだので、なにをいってるのかわかるようになってるはず

まず1byte目にbyte countがあるはず！ 00100101 はとりあえず37byte読めということだな。

1byte目+37byteまでで区切ると

00000000: 00100101 11111111 10000001 00000011 00000001 00000001  %.....
00000006: 00000100 01101001 01110100 01100101 01101101 00000001  .item.
0000000c: 11111111 10000010 00000000 00000001 00000010 00000001  ......
00000012: 00000100 01001110 01100001 01101101 01100101 00000001  .Name.
00000018: 00001100 00000000 00000001 00000101 01010000 01110010  ....Pr
0000001e: 01101001 01100011 01100101 00000001 00000100 00000000  ice...
00000024: 00000000 00000000                                      ..

みたいなbyte列が得られた。さっきの流れで行くとここが型情報や！

さっきのstructのドキュメントみるにおそらくこういう構造が表現されているはず！(あとでもうちょい丁寧に整理するかも)

wireType {
    StructType {
        CommonType {
            Name: "item",
            Elem: typeID,
        },
        Field: []*fieldType{
            {Name: "Name", Id: 6},
            {Name: "Price", Id: 2}
        }
    }
}

で、値はこの部分だ

00000024:                   00001110 11111111 10000010 00000001  ....
0000002a: 00000110 01100010 01100001 01101110 01100001 01101110  .banan
00000030: 01100001 00000001 11111111 11001000 00000000           a....

この部分はさっきみたintとstringの合体で、さっきのを見返すとbyteよめますね！

まとめ

gobのencoding仕様をみながら簡単なメッセージのバイナリを理解するとこまでできたぞ！

そうじてバイト効率はそんなによくないが、self-describing + Go1での互換が担保されてるのが大きくて、元の構造を失ってもメッセージ本体から完全な内容を復元できるのはかなりユニークじゃないだろうか！
client同士で構造の定義を共有できないときにstreamでポコポコやりとりしたい用途とかにめっちゃ適切なきがするぞ！