Title CDISC Dataset-JSON Specification
Version 1.1
Prepared by CDISC Data Exchange Standards Team
Notes to Readers This is the specification for Version 1.1 of CDISC Dataset-JSON.
Revision History
Date Version Summary of Changes
2024-12-05 1.1 Final
2023-08-23 1.0 Final

Introduction

Dataset-JSON is a data exchange standard for sharing tabular data using JSON. It is designed to meet a wide range of data exchange scenarios, including regulatory submissions and API-based data exchange. Each Dataset-JSON dataset can optionally reference a Define-XML document containing more complete metadata for the dataset. One aim of Dataset-JSON is to address as many of the relevant requirements in the PHUSE 2017 Transport for the Next Generation paper as possible, including the efficient use of storage space.

Dataset-JSON uses lowerCamelCase notation for attribute names.

The JSON standard does not allow specifying or controlling the order of attributes. However, since most JSON encoders and decoders allow control over attribute order, it is strongly recommended to follow the attribute order documented in this specification. Due to the potentially large size of Dataset-JSON datasets, adhering to the specified attribute order allows software using streaming approaches to read the file more efficiently and quickly.

Dataset-JSON must contain only 1 dataset per file. Dataset-JON uses the file extension .json.

Although adapted from the Dataset-XML Version 1.0 specification, Dataset-JSON uses the JSON format and includes many enhancements.

The LinkML model representation, the JSON schema, both for JSON and NDJSON representations, and examples can be found at the GitHub repository for the Dataset-JSON Version 1.1 standard.

The specification and user's guide are for both clinical and non-clinical research data.

Top-level Metadata Attributes

The following table summarizes the technical and dataset attributes at the top level of the Dataset-JSON object.

Attribute order Attribute JSON Schema data type Enumeration Allowed string pattern Usage Description
1 datasetJSONCreationDateTime string YYYY-MM-DDThh:mm:ss(.n+)?(((+|-)hh:mm)|Z)? Required The date/time the Dataset-JSON file was created.
2 datasetJSONVersion string 1.1(.(0|([1-9][0-9]*)))? Required The version of the Dataset-JSON standard used to create the dataset.
3 fileOID string "minLength": 1 Optional A unique identifier for this dataset. See the ODM specification for OID considerations.
4 dbLastModifiedDateTime string YYYY-MM-DDThh:mm:ss(.n+)?(((+|-)hh:mm)|Z)? Optional The date/time the source database was last modified before creating the Dataset-JSON file. dbLastModifiedDateTime must be on or before the date/time the Dataset-JSON file was created (datasetJSONCreationDateTime).
5 originator string Optional The organization that generated the Dataset-JSON dataset.
6 sourceSystem object Optional The information system from which the content of this dataset was sourced.
7 sourceSystem.name string Required The name of the sourceSystem above. The sourceSystem itself is an optional attribute. However, when the sourceSystem is present, the name is a required attribute of the sourceSystem.
8 sourceSystem.version string Required The version of the sourceSystem above. The sourceSystem itself is an optional attribute. However, when the sourceSystem is present, the version is a required attribute of the sourceSystem.
9 studyOID string "minLength": 1 Optional Unique identifier for the study that may also function as a foreign key to a Study/@OID in an associated Define-XML document, or to any studyOID references that are used as keys in other documents; i.e., documents created based on the ICH eCTD STF Specification. See the ODM specification for OID considerations.
10 metaDataVersionOID string "minLength": 1 Optional Unique identifier for the metadata version that may also function as a foreign key to a MetaDataVersion/@OID in an associated Define-XML file. See the ODM specification for OID considerations.
11 metaDataRef string Optional URI for the metadata file describing the dataset (e.g., a Define-XML file).
12 itemGroupOID string "minLength": 1 Required Unique identifier for the dataset that may also function as a foreign key to an ItemGroupDef/@OID in an associated Define-XML file. See the ODM specification for OID considerations.
13 records integer "minimum": 0 Required The total number of records in a dataset. Since "rows" is an optional object, records=0 allows for the transfer of metadata without sending data.
14 name string "minLength": 1 Required The human-readable name for the dataset.
15 label string Required A short description of the dataset.
16 columns array Required An array of metadata objects that describe the dataset variables.
17 rows array Optional An array of data record arrays that represent the dataset rows.

The following example illustrates the basic Dataset-JSON structure.

{ "datasetJSONCreationDateTime": "2023-03-22T11:53:27", "datasetJSONVersion": "1.1.0", "fileOID": "www.sponsor.xyz.org.project123.final", "dbLastModifiedDateTime": "2023-02-15T10:23:15", "originator": "Sponsor XYZ", "sourceSystem": { "name": "Software ABC", "version": "1.0.0" }, "studyOID": "xxx", "metaDataVersionOID": "xxx.y", "metaDataRef": "https://metadata.location.org/api.link", "itemGroupOID": "IG.DM", "records": 100, "name": "DM", "label": "Demographics", "columns": [ ... ], "rows": [ ... ] }

Column Metadata

columns is an array of basic information about dataset variables. The order of the elements in the array must be the same as the order of variables in the described dataset.

Attribute order Attribute JSON Schema data type Enumeration Allowed string pattern Usage Description
1 itemOID string "minLength": 1 Required Unique identifier for the variable that may also function as a foreign key to an ItemDef/@OID in an associated Define-XML file. See the ODM specification for OID considerations.
2 name string "minLength": 1 Required Variable name
3 label string Required Variable description
4 dataType string ["string", "integer", "decimal", "float", "double", "boolean", "datetime", "date", "time", "URI"] Required Logical data type of the variable. The dataType attribute represents the planned specificity of the data.

See the ODM Data Formats specification for details.

Note: Decimal numbers represented as a string in JSON must use the dot (.) as the decimal separator. When a thousand separator is used in a decimal represented as string, the comma is used.

Note: The boolean data type in JSON only supports true and false. Currently, CDISC standards like ADaM, SDTM, and SEND do not support the boolean data type. These standards instead use flags like "Y" and "N".

5 targetDataType string ["integer", "decimal"] Optional Indicates the data type into which the receiving system must transform the associated Dataset-JSON variable. The variable with the data type attribute of dataType must be converted into the targetDataType when transforming the Dataset-JSON dataset into a format for operational use (e.g., SAS dataset, R dataframe, loading into a system's data store). Only specify targetDataType when it is different from the dataType attribute or the JSON data type and the data needs to be transformed by the receiving system. See the Supported Column Data Type Combinations table for details on usage. See the User's Guide for additional information.
6 length integer "minimum":1 Optional Specifies the number of characters allowed for the variable value when it is represented as a text. See the definition and notes of the ItemDef Length attribute in the ODM specification for more details. Just like in the Define-XML specification, the variable lengths are planned lengths.
7 displayFormat string Optional A SAS display format value used for data visualization of numeric float and date values.
8 keySequence integer "minimum":1 Optional Indicates that this item is a key variable in the dataset structure. It also provides an ordering for the keys.

The following example shows the columns metadata structure.

"columns": [ { "itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1, }, { "itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2 }, ... ]

Supported Column Data Type Combinations

The JSON data type mentioned in the "Supported Column Data Type Combinations" table is the data type of the actual data in the "rows" arrays in the Dataset-JSON file. This JSON data type is defined by the Dataset-JSON Schema, but only includes "string", "integer", "boolean", number", and null. It does not include "array" or "object".

dataType (logical) JSON data type targetDataType Comment
string string
integer integer
decimal string decimal decimal is exchanged as a string and uses "." as the decimal separator
float number
double number
boolean boolean
datetime string ISO 8601 datetime as a string
date string ISO 8601 date as a string
time string ISO 8601 time as a string
datetime string integer ISO 8601 datetime as an integer (use case: ADaM)
date string integer ISO 8601 date as an integer (use case: ADaM)
time string integer ISO 8601 time as an integer (use case: ADaM)
URI string

Date/Time Variables

Timing variables (datetime, date, time) are stored as ISO 8601 strings in the JSON format. The targetDataType attribute needs to be specified when different from dataType attribute or the JSON data type.

For example, consider the following AE dataset. The AESTDTC and AEENDTC variables have "date" as their logical data type (dataType attribute), and since the values are already ISO 8601 strings in the data, no conversion is needed to a string in JSON format. The targetDataType for the date and datetime variables does not need to be mentioned for SDTM datasets as the logical type is the same as the JSON data type.

"columns": [ {"itemOID": "IT.AE.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12}, {"itemOID": "IT.AE.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.AE.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 1}, ... {"itemOID": "IT.AE.AESTDTC", "name": "AESTDTC", "label": "Start Date/Time of Adverse Event", "dataType": "date", "keySequence": 4}, {"itemOID": "IT.AE.AEENDTC", "name": "AEENDTC", "label": "End Date/Time of Adverse Event", "dataType": "date"}, ... ] "rows": [ ["CDISCPILOT01", "AE", "CDISC001", ..., "2012-12-02", "2013-05-20", ...] ]

For ADaM datasets, the targetDataType must be set to integer. Consider the following example of ADAE data. The targetDataType attribute is mentioned, as the logical data type is different from the JSON data type. Also note that the displayFormat attribute has been specified as "E8601DA" so that SAS can display these numeric values in a user-friendly way.

"columns": [ {"itemOID": "IT.ADAE.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12}, {"itemOID": "IT.ADAE.SITEID", "name": "SITEID", "label": "Study Site Identifier", "dataType": "string", "length": 3}, {"itemOID": "IT.ADAE.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 11, "keySequence": 1}, ... {"itemOID": "IT.ADAE.TRTSDT", "name": "TRTSDT", "label": "Date of First Exposure to Treatment", "dataType": "date", "targetDataType": "integer", "displayFormat": "E8601DA."}, {"itemOID": "IT.ADAE.TRTEDT", "name": "TRTEDT", "label": "Date of Last Exposure to Treatment", "dataType": "date", "targetDataType": "integer", "displayFormat": "E8601DA."}, {"itemOID": "IT.ADAE.ASTDT", "name": "ASTDT", "label": "Analysis Start Date", "dataType": "date", "targetDataType": "integer", "displayFormat": "E8601DA.", "keySequence": 3}, ... ] "rows": [ ["CDISCPILOT01", 701", "CDISC001", ..., "2014-01-02", "2014-07-02", "2014-01-03", ...] ... ]

Decimal Variables

Some decimal variables can be exchanged as a string in JSON to represent terminating decimal fractions without rounding. Hence, the targetDataType is "decimal" and JSON data type is "string".

For example:

"columns": [ ... {"itemOID": "IT.ADSL.BMIBL", "name": "BMIBL", "label": "Baseline BMI (kg/m^2)", "dataType": "decimal", "targetDataType": "decimal", "length": 16}, {"itemOID": "IT.ADSL.HEIGHTBL", "name": "HEIGHTBL", "label": "Baseline Height (cm)", "dataType": "decimal", "targetDataType": "decimal", "length": 5}, ... ] "rows": [ [ ..., "30.8983333232059", "162.9", ... ], [ ..., "28.977529926378", "171.1", ... ], ... ]

Row Data

rows is an array of records with variables values. Each record itself is also represented as an array of variables values.

"columns": [ {"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ... {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ... ], "rows": [ ["MyStudy", "DM", "CDISC001", ..., 56, "YEARS", ...], ["MyStudy", "DM", "CDISC002", ..., 26, "YEARS", ...], ... ]

Missing values are represented by null. Empty strings are represented by "".

Example:

"columns": [ {"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ... {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ... ], "rows": [ ["MyStudy", "DM", "CDISC001", ..., null, null, ...], ["MyStudy", "DM", "CDISC002", ..., null, "", ...], ... ]

A Full Example of a Dataset-JSON File

{ "datasetJSONCreationDateTime": "2023-06-28T15:38:43", "datasetJSONVersion": "1.1.0", "fileOID": "www.sponsor.xyz.org.project123.final", "dbLastModifiedDateTime": "2023-05-31T00:00:00", "originator": "Sponsor XYZ", "sourceSystem": { "name": "Software ABC", "version": "1.0.0" }, "studyOID": "cdisc.com.CDISCPILOT01", "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7", "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml", "itemGroupOID": "IG.DM", "records": 18, "name": "DM", "label": "Demographics", "columns": [ {"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ... {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ... ], "rows": [ ["CDISCPILOT01", "DM", "CDISC001", ..., 84, "YEARS", ...], ["CDISCPILOT01", "DM", "CDISC002", ..., 76, "YEARS", ...], ["CDISCPILOT01", "DM", "CDISC003", ..., 61, "YEARS", ...], ... ] }

NDJSON Representation of Dataset-JSON

The purpose of the NDJSON, or new-line delimited JSON, representation of Dataset-JSON is to simplify streaming large datasets. With NDJSON, a dataset can easily be read or written one row at a time without loading the entire dataset into memory. The NDJSON and JSON dataset content are the same. In a data-exchange scenario, the sender and receiver determine whether to use the JSON or the NDJSON representation of Dataset-JSON.

NDJSON is a standard for delimiting JSON in stream protocols. In NDJSON, each line is valid JSON. JSON is delimited by the new-line character (\n or 0x0A), which may be preceded by a carriage return character (\r or 0x0D). UTF-8 encoding is expected. NDJSON uses the file extension .ndjson.

The Dataset-JSON NDJSON format is created from the Dataset-JSON standard by:

A Full Example of an NDJSON Dataset-JSON File

{"datasetJSONCreationDateTime": "2023-06-28T15:38:43", "datasetJSONVersion": "1.1.0", "fileOID": "www.sponsor.xyz.org.project123.final", "dbLastModifiedDateTime": "2023-05-31T00:00:00", "originator": "Sponsor XYZ", "sourceSystem": {"name": "Software ABC", "version": "1.0.0"}, "studyOID": "cdisc.com.CDISCPILOT01", "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7", "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml", "itemGroupOID": "IG.DM", "records": 18, "name": "DM", "label": "Demographics", "columns": [{"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ..., {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ...]} ["CDISCPILOT01", "DM", "CDISC001", ..., 84, "YEARS", ...] ["CDISCPILOT01", "DM", "CDISC002", ..., 76, "YEARS", ...] ["CDISCPILOT01", "DM", "CDISC003", ..., 61, "YEARS", ...] ...

Each row in an NDJSON file can be parsed and processed as stand-alone JSON.

Glossary and Abbreviations

Term Stands for, plus Reference to CDISC Standard or source of information
ADaM Analysis Dataset Model. CDISC Foundational standard for modeling data: https://www.cdisc.org/standards/foundational/adam
API Application Programming Interface
Define-XML CDISC Data Exchange standard for sharing metadata: https://www.cdisc.org/standards/data-exchange/define-xml
GitHub GitHub is a web-based platform that allows developers to store, share, and collaborate on code
JSON JavaScript Object Notation
ICH International Council for Harmonisation: https://www.ich.org, https://www.ich.org/page/study-tagging-file-specification-and-related-files
LinkML Linked Data Modeling Language: https://linkml.io/linkml/
NDJSON Newline delimited JSON
ODM Operational Data Model: https://www.cdisc.org/standards/data-exchange/odm
SDTM Study Data Tabulation Model. CDISC Foundational standard for modeling data: https://www.cdisc.org/standards/foundational/sdtm
URI URI Uniform Resource Identifier
URL URI Uniform Resource Locator