A short introduction to Avro binary encoding
When developing applications processing Avro data, a basic understanding of Avro schema and Avro binary encoding is helpful. I disscussed a small topic on Avro schema here. The focus of this post is Avro binary encoding.
Avro file structure
The structure of a binary Avro file can be described with the following informal production rules:
Avro file := header + block 1 [+ ... + block n]
header := 'Obj1' (4 B) + metadata + sync marker (16 B)
block := object count (long) + objects byte size after compression (long) + serialized objects + sync marker (16 B)
object := field 1 + [+ ... + field n]
field := (length or structural info if needed) + binary encoded field value
The above rules translated in plain English are:
-
An Avro file consists of a header and
n
blocks. -
The header consists of the string literal
Obj1
, metadata, and a sync marker. The metadata are persisted as key-value pairs. The most important ones among them are the schema and the compression codec (with the keysavro.schema
andavro.codec
). - A block starts with information about the number of objects it contains, followed by the total size of those objects in bytes after compression, then by the serialized objects, and finally ends with the sync marker. The sync marker has two major purposes:
- It enables detection of corrupt blocks and helps to ensure data integrity.
- It permits efficient splitting of files for MapReduce processing.
-
An object is serialized as the sequence of its fields in binary form.
- A field’s binary form consists of:
- the field length or structural information, which is optional and present only if it can not be derived from the schema (e.g., the lengths of an
int
orlong
fields are defined), and - the binary encoded field value.
- the field length or structural information, which is optional and present only if it can not be derived from the schema (e.g., the lengths of an
Example
Lets say we have the following simple Avro schema stored in a file Person.avsc
:
And some data of this schema in JSON format stored in a file person.json
:
The data can be converted from JSON to Avro binary file with avro-tools like this:
On Linux, the Avro binary file can be viewed with xxd
:
$ xxd person.avro
0000000: 4f62 6a01 0416 6176 726f 2e73 6368 656d Obj...avro.schem
0000010: 6198 017b 2274 7970 6522 3a22 7265 636f a..{"type":"reco
0000020: 7264 222c 226e 616d 6522 3a22 5065 7273 rd","name":"Pers
0000030: 6f6e 222c 2266 6965 6c64 7322 3a5b 7b22 on","fields":[{"
0000040: 6e61 6d65 223a 226e 616d 6522 2c22 7479 name":"name","ty
0000050: 7065 223a 2273 7472 696e 6722 7d5d 7d14 pe":"string"}]}.
0000060: 6176 726f 2e63 6f64 6563 086e 756c 6c00 avro.codec.null.
0000070: fa4b c7d2 52a1 aa57 92cb cdfd 20d8 c341 .K..R..W.... ..A
0000080: 0416 084a 6f68 6e0a 416c 6963 65fa 4bc7 ...John.Alice.K.
0000090: d252 a1aa 5792 cbcd fd20 d8c3 41 .R..W.... ..A
The string area on the right-hand side reveals two key-value pairs with keys being avro.schema
and avro.codec
. The sync marker is fa4b c7d2 52a1 aa57 92cb cdfd 20d8 c341
, which starts at the byte addresses 0x0000070
and 0x000008D
. The bytes 0x08
at address 0x0000082
and 0x0A
at address 0x0000087
are the lengths of the strings following them (John
and Alice
).