CDISC Compressed Dataset-JSON v1.1 Specification (DSJC)

Title

CDISC Compressed Dataset-JSON Specification

Version

1.1

Prepared by

CDISC Data Exchange Standards Team

Notes to Readers

This is the specification for Version 1.0 of CDISC Compressed Dataset-JSON.

Revision History

Date	Version	Summary of Changes
2025-12-11	1.1	Final

Introduction
Format Definition
Base Format: Dataset-NDJSON
Compression
File Format
File Extension and MIME Type
Processing Overview
- Creating a DSJC File
- Reading a DSJC File
Implementation Considerations
Benefits
Limitations
Conformance
Glossary and Abbreviations

Introduction

The DSJC (Dataset-JSON Compressed) format is a standardized method for compressing Dataset-JSON content. It specifically uses the Dataset-NDJSON format as the base structure and applies zLib compression to create a more compact representation of the data. This specification describes the DSJC format in a language-neutral manner to ensure interoperability across different platforms and programming environments.

Format Definition

DSJC is defined as a direct zLib compression stream of Dataset-NDJSON format content without additional headers, signatures, or metadata beyond what the zLib format itself provides.

Base Format: Dataset-NDJSON

The Dataset-NDJSON format serves as the base format for DSJC as it provides better streaming capabilities. It consists of:

First line: A JSON object containing Dataset-JSON metadata (without the rows property)
Subsequent lines: Each line is a valid JSON array or object representing a single data record

Usage of whitespace characters should be reduced to a minimum required by the Dataset-NDJSON format. Avoid spaces between JSON array elements or attribute definitions.

Example of Dataset-NDJSON before compression:

{"datasetJSONCreationDateTime":"2023-01-01T12:00:00","datasetJSONVersion":"1.1","records":3,"name":"ADSL","label":"Subject Level Analysis Dataset","columns":[...]} ["SUBJ001",45,"M"] ["SUBJ002",52,"F"] ["SUBJ003",38,"M"]

Compression

DSJC applies standard zLib compression to the entire Dataset-NDJSON content:

Compression algorithm: zLib (DEFLATE)
Compression level: Implementers may choose an appropriate compression level, with 9 (maximum compression) recommended for storage-optimized scenarios and lower levels for performance-sensitive applications
Window size: 15 bits (32 KB window) is recommended
Strategy: Default compression strategy is recommended

File Format

A DSJC file consists solely of the zLib-compressed byte stream without any additional headers, signatures, or metadata wrapping. The zLib format itself contains sufficient information for decompressors to reconstruct the original Dataset-NDJSON content.

File Extension and MIME Type

Files following this specification should use the .dsjc file extension. The recommended MIME type is:

application/vnd.cdisc.dataset-json.compressed

Processing Overview

Creating a DSJC File

The process to create a DSJC file consists of:

Construct a Dataset-NDJSON format representation of the data:
- First line contains dataset metadata as a JSON object
- Each subsequent line contains a single record as a JSON array or object
Apply zLib compression to the entire Dataset-NDJSON content
Write the compressed byte stream to a file

Reading a DSJC File

The process to read a DSJC file consists of:

Open the file and apply zLib decompression to the content
Process the decompressed content as Dataset-NDJSON:
- First line: Parse as JSON object to extract metadata
- Subsequent lines: Parse each line as a JSON array or object representing a data record

Implementation Considerations

Memory Efficiency

When processing large datasets, implementers should consider streaming approaches for both compression and decompression to avoid loading the entire dataset into memory.

Compression Level Selection

Implementers may choose different compression levels based on their specific use case. As main purpose of the Dataset-JSON format is data exchange, the maximum compression level (9) is recommended by default.

Level 1-3: Faster compression, lower compression ratio
Level 4-6: Balanced compression speed and ratio
Level 7-9: Higher compression ratio, slower compression

Error Handling

Implementations should gracefully handle the following error cases:

Invalid or corrupt zLib compressed data
Malformed JSON in the decompressed content
Inconsistency between metadata and actual data records

Compatibility

To ensure maximum compatibility, implementations should:

Use standard zLib libraries available on most platforms
Follow the Dataset-NDJSON format strictly, with proper line delimiters
Ensure proper handling of character encodings (UTF-8 recommended)

Benefits

The DSJC format offers several benefits:

Reduced size: Significant reduction in file size compared to uncompressed Dataset-JSON
Simplified implementation: Leverages widely available zLib libraries, available in SAS, R, Python, and other languages
Streaming support: Enables record-by-record processing without decompressing or loading the entire dataset
Platform independence: Works consistently across different operating systems and programming languages

Limitations

Implementers should be aware of the following limitations:

Processing overhead: Compression and decompression require additional CPU resources
Random access: DSJC doesn't provide direct random access to specific records without decompressing the preceding content

Conformance

A conforming implementation must:

Correctly compress Dataset-NDJSON content using the zLib compression algorithm
Successfully decompress DSJC files and interpret the content as Dataset-NDJSON
Handle errors gracefully, as specified in Error Handling
Preserve all semantic information from the original Dataset-JSON content

Glossary and Abbreviations

The following table lists some of the abbreviations and terms used in this document.

CDISC	Clinical Data Interchange Standards Consortium
CPU	Central processing unit. The primary processor in a given computer.
DSJC	Dataset-JSON Compressed
JSON	JavaScript Object Notation
MIME	Multipurpose Internet Mail Extensions. Two-part identifier for file formats and content formats, that defines how different types of data, such as text, images, or other binary files, can be formatted and sent over the internet.
NDJSON	Newline delimited JSON
UTF-8	Unicode Transformation Format 8-bit. A character encoding standard used for electronic communication.
zLib	Software library used for data compression as well as a data format