Avro schema field name and type name differences

Avro schema supports nested record definitions, i.e., definition of a custom record, whose fields are again some other custom records. One of the questions I had while working on an Avro schema was: why do I have to repeat the name field in a subrecord definition?

richard-r-schunemann-EIeQUi77QGg-unsplash

Let’s take a look at the “address” part of this sample schema:

{
    "namespace": "io.github.ouyi.avro",
    "fields": [
        {
            "name": "id",
            "type": "int"
        },
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "fields": [
                    {
                        "name": "street",
                        "type": "string"
                    },
                    {
                        "name": "city",
                        "type": "string"
                    }
                ],
                "name": "Address",
                "type": "record"
            }
        }
    ],
    "name": "Company",
    "type": "record"
}

It turns out that the first address is the field name and the second Address (note the uppercase “A”) is the type name.

The difference between a field name and a type name becomes clear when the Avro schema is compiled into Java classes. In our example, “address” is used as the name of a member field in the Company class, and “Address” is used as the class name for the generated Address class.

Our sample schema (Company.avsc) can be compiled to Java classes as follows:

$ java -jar "$AVRO_TOOLS_JAR" compile schema Company.avsc output

The generated classes are:

$ tree output/
output/
`-- io
    `-- github
        `-- ouyi
            `-- avro
                |-- Address.java
                `-- Company.java

The Company class has a member field “address” of the type “Address”:

private io.github.ouyi.avro.Address address;

Considering the Java coding convention, it makes sense that the first letter of the field name is in lower case (“address”), while the same of the type name is in upper case (“Address”).

For completeness, the following are a few JSON records based on our sample schema:

{"id":1,"name":"msft","address": {"street": "MS Redmond Campus", "city": "Redmond"}}
{"id":2,"name":"aol","address": {"street": "770 Broadway", "city": "NY City"}}
{"id":3,"name":"google","address": {"street": "Googleplex", "city": "Mountain View"}}

Note that in contrast to the Avro binary format, the JSON representation does not carry the schema (type information). Therefore, we only see the lower case field names here.