CDISC Compressed Dataset-JSON v1.1 Specification (DSJC)

Title CDISC Compressed Dataset-JSON Specification
Version 1.1
Prepared by CDISC Data Exchange Standards Team
Notes to Readers This is the specification for Version 1.0 of CDISC Compressed Dataset-JSON.
Revision History
Date Version Summary of Changes
2025-06-19 1.1 Final
2025-03-18 1.1 Draft

Introduction

The DSJC (Dataset-JSON Compressed) format is a standardized method for compressing Dataset-JSON content. It specifically uses the Dataset-NDJSON format as the base structure and applies zLib compression to create a more compact representation of the data. This specification describes the DSJC format in a language-neutral manner to ensure interoperability across different platforms and programming environments.

Format Definition

DSJC is defined as a direct zLib compression stream of Dataset-NDJSON format content without additional headers, signatures, or metadata beyond what the zLib format itself provides.

Base Format: Dataset-NDJSON

The Dataset-NDJSON format serves as the base format for DSJC as it provides better streaming capabilities. It consists of:

  1. First Line: A JSON object containing Dataset-JSON metadata (without the rows property)
  2. Subsequent Lines: Each line is a valid JSON array or object representing a single data record

Usage of whitespace characters should be reduced to a minimum required by the Dataset-NDJSON format. Avoid spaces between JSON array elements or attribute definitions.

Example of Dataset-NDJSON before compression:

{"datasetJSONCreationDateTime":"2023-01-01T12:00:00","datasetJSONVersion":"1.1","records":3,"name":"ADSL","label":"Subject Level Analysis Dataset","columns":[...]} ["SUBJ001",45,"M"] ["SUBJ002",52,"F"] ["SUBJ003",38,"M"]

Compression

DSJC applies standard zLib compression to the entire Dataset-NDJSON content:

  1. Compression Algorithm: zLib (DEFLATE)
  2. Compression Level: Implementers may choose an appropriate compression level, with 9 (maximum compression) recommended for storage-optimized scenarios and lower levels for performance-sensitive applications
  3. Window Size: 15 bits (32 KB window) is recommended
  4. Strategy: Default compression strategy is recommended

File Format

A DSJC file consists solely of the zLib-compressed byte stream without any additional headers, signatures, or metadata wrapping. The zLib format itself contains sufficient information for decompressors to reconstruct the original Dataset-NDJSON content.

File Extension and MIME Type

Files following this specification should use the .dsjc file extension. The recommended MIME type is:

application/vnd.cdisc.dataset-json.compressed

Processing Overview

Creating a DSJC File

The process to create a DSJC file consists of:

  1. Construct a Dataset-NDJSON format representation of the data:
  2. Apply zLib compression to the entire Dataset-NDJSON content
  3. Write the compressed byte stream to a file

Reading a DSJC File

The process to read a DSJC file consists of:

  1. Open the file and apply zLib decompression to the content
  2. Process the decompressed content as Dataset-NDJSON:

Implementation Considerations

Memory Efficiency

When processing large datasets, implementers should consider streaming approaches for both compression and decompression to avoid loading the entire dataset into memory.

Compression Level Selection

Implementers may choose different compression levels based on their specific use case. As main purpose of the Dataset-JSON format is data exchange, the maximum compression level (9) is recommended by default.

Error Handling

Implementations should gracefully handle the following error cases:

Compatibility

To ensure maximum compatibility, implementations should:

Benefits

The DSJC format offers several benefits:

  1. Reduced Size: Significant reduction in file size compared to uncompressed Dataset-JSON
  2. Simplified Implementation: Leverages widely available zLib libraries, available in SAS, R, Python, and other languages
  3. Streaming Support: Enables record-by-record processing without decompressing or loading the entire dataset
  4. Platform Independence: Works consistently across different operating systems and programming languages

Limitations

Implementers should be aware of the following limitations:

  1. Processing Overhead: Compression and decompression require additional CPU resources
  2. Random Access: DSJC doesn't provide direct random access to specific records without decompressing the preceding content

Conformance

A conforming implementation must:

  1. Correctly compress Dataset-NDJSON content using the zLib compression algorithm
  2. Successfully decompress DSJC files and interpret the content as Dataset-NDJSON
  3. Handle errors gracefully as specified in section Error Handling
  4. Preserve all semantic information from the original Dataset-JSON content