| Title | CDISC Compressed Dataset-JSON Specification | ||||||
|---|---|---|---|---|---|---|---|
| Version | 1.1 | ||||||
| Prepared by | CDISC Data Exchange Standards Team | ||||||
| Notes to Readers | This is the specification for Version 1.0 of CDISC Compressed Dataset-JSON. | ||||||
| Revision History |
|
The DSJC (Dataset-JSON Compressed) format is a standardized method for compressing Dataset-JSON content. It specifically uses the Dataset-NDJSON format as the base structure and applies zLib compression to create a more compact representation of the data. This specification describes the DSJC format in a language-neutral manner to ensure interoperability across different platforms and programming environments.
DSJC is defined as a direct zLib compression stream of Dataset-NDJSON format content without additional headers, signatures, or metadata beyond what the zLib format itself provides.
The Dataset-NDJSON format serves as the base format for DSJC as it provides better streaming capabilities. It consists of:
Usage of whitespace characters should be reduced to a minimum required by the Dataset-NDJSON format. Avoid spaces between JSON array elements or attribute definitions.
Example of Dataset-NDJSON before compression:
| {"datasetJSONCreationDateTime":"2023-01-01T12:00:00","datasetJSONVersion":"1.1","records":3,"name":"ADSL","label":"Subject Level Analysis Dataset","columns":[...]} ["SUBJ001",45,"M"] ["SUBJ002",52,"F"] ["SUBJ003",38,"M"] |
DSJC applies standard zLib compression to the entire Dataset-NDJSON content:
A DSJC file consists solely of the zLib-compressed byte stream without any additional headers, signatures, or metadata wrapping. The zLib format itself contains sufficient information for decompressors to reconstruct the original Dataset-NDJSON content.
Files following this specification should use the .dsjc file extension. The recommended MIME type is:
| application/vnd.cdisc.dataset-json.compressed |
The process to create a DSJC file consists of:
The process to read a DSJC file consists of:
When processing large datasets, implementers should consider streaming approaches for both compression and decompression to avoid loading the entire dataset into memory.
Implementers may choose different compression levels based on their specific use case. As main purpose of the Dataset-JSON format is data exchange, the maximum compression level (9) is recommended by default.
Implementations should gracefully handle the following error cases:
To ensure maximum compatibility, implementations should:
The DSJC format offers several benefits:
Implementers should be aware of the following limitations:
A conforming implementation must:
| CDISC | Clinical Data Interchange Standards Consortium |
| CPU | Central processing unit. The primary processor in a given computer. |
| DSJC | Dataset-JSON Compressed |
| JSON | JavaScript Object Notation |
| MIME | Multipurpose Internet Mail Extensions. Two-part identifier for file formats and content formats, that defines how different types of data, such as text, images, or other binary files, can be formatted and sent over the internet. |
| NDJSON | Newline delimited JSON |
| UTF-8 | Unicode Transformation Format 8-bit. A character encoding standard used for electronic communication. |
| zLib | Software library used for data compression as well as a data format |