Avro vs JSON: What are the differences?

avro json data formats hadoop rest api

JSON is a data format that has been around what seems like ages, being used widely as an export format and common exchange format for web APIs everywhere. You might ask yourself, why would I bother using anything else when JSON is so common?

Well, the answer is that there are some cases using JSON can be a bad idea.

To make our decision, we should first understand what does Avro have that JSON doesn’t.

Avro is a data exchange format developed for use in Apache Hadoop, altough it was released around 9 years ago, it is only now that it is starting to see some more usage in the data community.

People are starting to use it more and more is because of the following reasons:

Data integrity

Avro has support for schemas, which it basically means that an Avro file also describes the shape of the data, and everytime an Avro file is written or read, we ensure that the data fits this shape.

This simple feature allows us to safely exchange data without worrying about losing or having corrupt information.

For example, if we had to store a list of people, we could specify a schema as a list of records that would look something like this:

  • Person (Record)
    • ID (integer)
    • FirstName (String)
    • LastName (String)
    • Birthdate (Date)

Now when we write to an Avro file with this schema, it won’t let us write anything other than a number in the ID field, or anything other than a string in the FirstName field, and so on.

Then when you send this Avro file to your colleage and they want to read it, all they have to do is check the schema to understand the shape of the data, and the Avro format will ensure that all the data is consistent.

JSON on the other hand, allows you to encode whatever you want, which is great when you need flexibility, but it can cause all sorts of issues down the line.

Smaller size

Avro stores it’s data in a compact binary format, which means that any data stored in Avro will be much smaller than the same data stored in JSON.

This is a big reason why Avro is gaining in popularity, people working with huge amounts of data can store more information using less storage with Avro, meaning it can also save you money.

One downside of having your data in binary format is that it is not human readable, you would always have to use an application to read your data, so that’s something to take into account.

Richer data types

In JSON you can only store the following types of data: objects, arrays, strings, nulls and numbers.

With Avro, you get all of those types and more, for example it supports native dates or binary data, which could be very useful depending on what you’re storing.

When to use Avro?

The thruth is that for smaller files, the savings on the size don’t really matter that much. And losing human readability can be annoying, especially because these small files are often edited directly with a text editor.

We suggest that you use Avro when you’re storing a huge amount of data (more than a few megabytes) and when your data follows a consistent shape.

Those kinds of files are too large to be edited in a text editor so the human readability aspect is not important.