CDISC Compressed Dataset-JSON v1.1 Specification (DSJC)
Title |
CDISC Compressed Dataset-JSON Specification |
Version |
1.1 |
Prepared by |
CDISC Data Exchange Standards Team |
Notes to Readers |
This is the specification for Version 1.0 of CDISC Compressed
Dataset-JSON. |
Revision History |
Date |
Version |
Summary of Changes |
2025-06-19 |
1.1 |
Final |
2025-03-18 |
1.1 |
Draft |
|
Introduction
The DSJC (Dataset-JSON Compressed) format is a standardized method for
compressing Dataset-JSON content. It specifically uses the Dataset-NDJSON
format as the base structure and applies zLib compression to create a more
compact representation of the data. This specification describes the DSJC
format in a language-neutral manner to ensure interoperability across
different platforms and programming environments.
DSJC is defined as a direct zLib compression stream of Dataset-NDJSON
format content without additional headers, signatures, or metadata beyond
what the zLib format itself provides.
The Dataset-NDJSON format serves as the base format for DSJC as it
provides better streaming capabilities. It consists of:
- First Line: A JSON object containing Dataset-JSON
metadata (without the rows property)
- Subsequent Lines: Each line is a valid JSON array or
object representing a single data record
Usage of whitespace characters should be reduced to a minimum required by
the Dataset-NDJSON format. Avoid spaces between JSON array elements or
attribute definitions.
Example of Dataset-NDJSON before compression:
{"datasetJSONCreationDateTime":"2023-01-01T12:00:00","datasetJSONVersion":"1.1","records":3,"name":"ADSL","label":"Subject Level Analysis Dataset","columns":[...]}
["SUBJ001",45,"M"]
["SUBJ002",52,"F"]
["SUBJ003",38,"M"] |
Compression
DSJC applies standard zLib compression to the entire Dataset-NDJSON
content:
- Compression Algorithm: zLib (DEFLATE)
- Compression Level: Implementers may choose an
appropriate compression level, with 9 (maximum compression) recommended for
storage-optimized scenarios and lower levels for performance-sensitive
applications
- Window Size: 15 bits (32 KB window) is
recommended
- Strategy: Default compression strategy is
recommended
A DSJC file consists solely of the zLib-compressed byte stream without any
additional headers, signatures, or metadata wrapping. The zLib format itself
contains sufficient information for decompressors to reconstruct the original
Dataset-NDJSON content.
File Extension and MIME Type
Files following this specification should use the .dsjc file extension.
The recommended MIME type is:
application/vnd.cdisc.dataset-json.compressed |
Processing Overview
Creating a DSJC File
The process to create a DSJC file consists of:
- Construct a Dataset-NDJSON format representation of the data:
- First line contains dataset metadata as a JSON object
- Each subsequent line contains a single record as a JSON array or
object
- Apply zLib compression to the entire Dataset-NDJSON content
- Write the compressed byte stream to a file
Reading a DSJC File
The process to read a DSJC file consists of:
- Open the file and apply zLib decompression to the content
- Process the decompressed content as Dataset-NDJSON:
- First line: Parse as JSON object to extract metadata
- Subsequent lines: Parse each line as a JSON array or object
representing a data record
Implementation Considerations
Memory Efficiency
When processing large datasets, implementers should consider streaming
approaches for both compression and decompression to avoid loading the entire
dataset into memory.
Compression Level Selection
Implementers may choose different compression levels based on their
specific use case. As main purpose of the Dataset-JSON format is data
exchange, the maximum compression level (9) is recommended by default.
- Level 1-3: Faster compression, lower compression ratio
- Level 4-6: Balanced compression speed and ratio
- Level 7-9: Higher compression ratio, slower compression
Error Handling
Implementations should gracefully handle the following error cases:
- Invalid or corrupt zLib compressed data
- Malformed JSON in the decompressed content
- Inconsistency between metadata and actual data records
Compatibility
To ensure maximum compatibility, implementations should:
- Use standard zLib libraries available on most platforms
- Follow the Dataset-NDJSON format strictly, with proper line
delimiters
- Ensure proper handling of character encodings (UTF-8 recommended)
Benefits
The DSJC format offers several benefits:
- Reduced Size: Significant reduction in file size compared to
uncompressed Dataset-JSON
- Simplified Implementation: Leverages widely available zLib libraries,
available in SAS, R, Python, and other languages
- Streaming Support: Enables record-by-record processing without
decompressing or loading the entire dataset
- Platform Independence: Works consistently across different operating
systems and programming languages
Limitations
Implementers should be aware of the following limitations:
- Processing Overhead: Compression and decompression require additional
CPU resources
- Random Access: DSJC doesn't provide direct random access to specific
records without decompressing the preceding content
A conforming implementation must:
- Correctly compress Dataset-NDJSON content using the zLib compression
algorithm
- Successfully decompress DSJC files and interpret the content as
Dataset-NDJSON
- Handle errors gracefully as specified in section Error Handling
- Preserve all semantic information from the original Dataset-JSON
content