Representing Dataset-JSON as NDJSON

Introduction

The purpose of the NDJSON (new-line delimited JSON) representation of Dataset-JSON is to simplify streaming large datasets. With NDJSON, datasets can easily be read or written 1 row at a time without loading the entire dataset into memory. Most programming languages have libraries that can read a large JSON dataset as a stream, but in cases where such a library is not available or performs poorly, the NDJSON format makes it easy for the program to read and write a row at a time. The NDJSON format is an alternative to the JSON format, and both are part of the Dataset-JSON standard.

DJSON and JSON datasets have the same content. The only difference is that the NDJSON content is written or read as 1 line of valid JSON at a time.

In a data-exchange scenario, the sender and receiver determine whether to use the JSON or NDJSON representation of Dataset-JSON. Given the relative simplicity of the Dataset-JSON specification, converting between the 2 formats is straightforward. NDJSON example datasets were generated by converting existing JSON Dataset-JSON example datasets.

NDJSON Dataset-JSON datasets have a file extension of .ndjson (e.g., ae.ndjson). The JSON format uses the .json extension. This is one way to determine the format used for a dataset file.

The Dataset-JSON NDJSON format

NDJSON is a standard for delimiting JSON in stream protocols. In NDJSON, each line is valid JSON. The JSON is delimited by the newline character (\n or 0x0A) which may be preceeded by a carriage return character (\r or 0x0D). UTF-8 encoding is expected.

The Dataset-JSON NDJSON format is created from the Dataset-JSON standard by:

NDJSON

Each row can be parsed and processed as standalone JSON text.

Examples

The NDJSON example datasets have been converted from the JSON versions, so they contain the same content. The examples are available in the DataExchange-DatasetJson repository examples folder and use .NDJSON as the extension.