Title | CDISC Dataset-JSON Specification | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Version | 1.1 | |||||||||
Prepared by | CDISC Data Exchange Standards Team | |||||||||
Notes to Readers | This is the specification for Version 1.1 of CDISC Dataset-JSON. | |||||||||
Revision History |
|
Dataset-JSON is a data exchange standard for sharing tabular data using JSON. It is designed to meet a wide range of data exchange scenarios, including regulatory submissions and API-based data exchange. Each Dataset-JSON dataset can optionally reference a Define-XML document containing more complete metadata for the dataset. One aim of Dataset-JSON is to address as many of the relevant requirements in the PHUSE 2017 Transport for the Next Generation paper as possible, including the efficient use of storage space.
Dataset-JSON uses lowerCamelCase notation for attribute names.
The JSON standard does not allow specifying or controlling the order of attributes. However, since most JSON encoders and decoders allow control over attribute order, it is strongly recommended to follow the attribute order documented in this specification. Due to the potentially large size of Dataset-JSON datasets, adhering to the specified attribute order allows software using streaming approaches to read the file more efficiently and quickly.
Dataset-JSON must contain only 1 dataset per file. Dataset-JON uses the file extension .json.
Although adapted from the Dataset-XML Version 1.0 specification, Dataset-JSON uses the JSON format and includes many enhancements.
The LinkML model representation, the JSON schema, both for JSON and NDJSON representations, and examples can be found at the GitHub repository for the Dataset-JSON Version 1.1 standard.
The specification and user's guide are for both clinical and non-clinical research data.
The following table summarizes the technical and dataset attributes at the top level of the Dataset-JSON object.
Attribute order | Attribute | JSON Schema data type | Enumeration | Allowed string pattern | Usage | Description |
---|---|---|---|---|---|---|
1 | datasetJSONCreationDateTime | string | YYYY-MM-DDThh:mm:ss(.n+)?(((+|-)hh:mm)|Z)? | Required | The date/time the Dataset-JSON file was created. | |
2 | datasetJSONVersion | string | 1.1(.(0|([1-9][0-9]*)))? | Required | The version of the Dataset-JSON standard used to create the dataset. | |
3 | fileOID | string | "minLength": 1 | Optional | A unique identifier for this dataset. See the ODM specification for OID considerations. | |
4 | dbLastModifiedDateTime | string | YYYY-MM-DDThh:mm:ss(.n+)?(((+|-)hh:mm)|Z)? | Optional | The date/time the source database was last modified before creating the Dataset-JSON file. dbLastModifiedDateTime must be on or before the date/time the Dataset-JSON file was created (datasetJSONCreationDateTime). | |
5 | originator | string | Optional | The organization that generated the Dataset-JSON dataset. | ||
6 | sourceSystem | object | Optional | The information system from which the content of this dataset was sourced. | ||
7 | sourceSystem.name | string | Required | The name of the sourceSystem above. The sourceSystem itself is an optional attribute. However, when the sourceSystem is present, the name is a required attribute of the sourceSystem. | ||
8 | sourceSystem.version | string | Required | The version of the sourceSystem above. The sourceSystem itself is an optional attribute. However, when the sourceSystem is present, the version is a required attribute of the sourceSystem. | ||
9 | studyOID | string | "minLength": 1 | Optional | Unique identifier for the study that may also function as a foreign key to a Study/@OID in an associated Define-XML document, or to any studyOID references that are used as keys in other documents; i.e., documents created based on the ICH eCTD STF Specification. See the ODM specification for OID considerations. | |
10 | metaDataVersionOID | string | "minLength": 1 | Optional | Unique identifier for the metadata version that may also function as a foreign key to a MetaDataVersion/@OID in an associated Define-XML file. See the ODM specification for OID considerations. | |
11 | metaDataRef | string | Optional | URI for the metadata file describing the dataset (e.g., a Define-XML file). | ||
12 | itemGroupOID | string | "minLength": 1 | Required | Unique identifier for the dataset that may also function as a foreign key to an ItemGroupDef/@OID in an associated Define-XML file. See the ODM specification for OID considerations. | |
13 | records | integer | "minimum": 0 | Required | The total number of records in a dataset. Since "rows" is an optional object, records=0 allows for the transfer of metadata without sending data. | |
14 | name | string | "minLength": 1 | Required | The human-readable name for the dataset. | |
15 | label | string | Required | A short description of the dataset. | ||
16 | columns | array | Required | An array of metadata objects that describe the dataset variables. | ||
17 | rows | array | Optional | An array of data record arrays that represent the dataset rows. |
The following example illustrates the basic Dataset-JSON structure.
{ "datasetJSONCreationDateTime": "2023-03-22T11:53:27", "datasetJSONVersion": "1.1.0", "fileOID": "www.sponsor.xyz.org.project123.final", "dbLastModifiedDateTime": "2023-02-15T10:23:15", "originator": "Sponsor XYZ", "sourceSystem": { "name": "Software ABC", "version": "1.0.0" }, "studyOID": "xxx", "metaDataVersionOID": "xxx.y", "metaDataRef": "https://metadata.location.org/api.link", "itemGroupOID": "IG.DM", "records": 100, "name": "DM", "label": "Demographics", "columns": [ ... ], "rows": [ ... ] } |
columns is an array of basic information about dataset variables. The order of the elements in the array must be the same as the order of variables in the described dataset.
Attribute order | Attribute | JSON Schema data type | Enumeration | Allowed string pattern | Usage | Description |
---|---|---|---|---|---|---|
1 | itemOID | string | "minLength": 1 | Required | Unique identifier for the variable that may also function as a foreign key to an ItemDef/@OID in an associated Define-XML file. See the ODM specification for OID considerations. | |
2 | name | string | "minLength": 1 | Required | Variable name | |
3 | label | string | Required | Variable description | ||
4 | dataType | string | ["string", "integer", "decimal", "float", "double", "boolean", "datetime", "date", "time", "URI"] | Required | Logical data type of the variable. The dataType attribute represents
the planned specificity of the data.
See the ODM Data Formats specification for details. Note: Decimal numbers represented as a string in JSON must use the dot (.) as the decimal separator. When a thousand separator is used in a decimal represented as string, the comma is used. Note: The boolean data type in JSON only supports true and false. Currently, CDISC standards like ADaM, SDTM, and SEND do not support the boolean data type. These standards instead use flags like "Y" and "N". |
|
5 | targetDataType | string | ["integer", "decimal"] | Optional | Indicates the data type into which the receiving system must transform the associated Dataset-JSON variable. The variable with the data type attribute of dataType must be converted into the targetDataType when transforming the Dataset-JSON dataset into a format for operational use (e.g., SAS dataset, R dataframe, loading into a system's data store). Only specify targetDataType when it is different from the dataType attribute or the JSON data type and the data needs to be transformed by the receiving system. See the Supported Column Data Type Combinations table for details on usage. See the User's Guide for additional information. | |
6 | length | integer | "minimum":1 | Optional | Specifies the number of characters allowed for the variable value when it is represented as a text. See the definition and notes of the ItemDef Length attribute in the ODM specification for more details. Just like in the Define-XML specification, the variable lengths are planned lengths. | |
7 | displayFormat | string | Optional | A SAS display format value used for data visualization of numeric float and date values. | ||
8 | keySequence | integer | "minimum":1 | Optional | Indicates that this item is a key variable in the dataset structure. It also provides an ordering for the keys. |
The following example shows the columns metadata structure.
"columns": [ { "itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1, }, { "itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2 }, ... ] |
The JSON data type mentioned in the "Supported Column Data Type Combinations" table is the data type of the actual data in the "rows" arrays in the Dataset-JSON file. This JSON data type is defined by the Dataset-JSON Schema, but only includes "string", "integer", "boolean", number", and null. It does not include "array" or "object".
dataType (logical) | JSON data type | targetDataType | Comment |
---|---|---|---|
string | string | ||
integer | integer | ||
decimal | string | decimal | decimal is exchanged as a string and uses "." as the decimal separator |
float | number | ||
double | number | ||
boolean | boolean | ||
datetime | string | ISO 8601 datetime as a string | |
date | string | ISO 8601 date as a string | |
time | string | ISO 8601 time as a string | |
datetime | string | integer | ISO 8601 datetime as an integer (use case: ADaM) |
date | string | integer | ISO 8601 date as an integer (use case: ADaM) |
time | string | integer | ISO 8601 time as an integer (use case: ADaM) |
URI | string |
Timing variables (datetime, date, time) are stored as ISO 8601 strings in the JSON format. The targetDataType attribute needs to be specified when different from dataType attribute or the JSON data type.
For example, consider the following AE dataset. The AESTDTC and AEENDTC variables have "date" as their logical data type (dataType attribute), and since the values are already ISO 8601 strings in the data, no conversion is needed to a string in JSON format. The targetDataType for the date and datetime variables does not need to be mentioned for SDTM datasets as the logical type is the same as the JSON data type.
"columns": [ {"itemOID": "IT.AE.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12}, {"itemOID": "IT.AE.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.AE.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 1}, ... {"itemOID": "IT.AE.AESTDTC", "name": "AESTDTC", "label": "Start Date/Time of Adverse Event", "dataType": "date", "keySequence": 4}, {"itemOID": "IT.AE.AEENDTC", "name": "AEENDTC", "label": "End Date/Time of Adverse Event", "dataType": "date"}, ... ] "rows": [ ["CDISCPILOT01", "AE", "CDISC001", ..., "2012-12-02", "2013-05-20", ...] ] |
For ADaM datasets, the targetDataType must be set to integer. Consider the following example of ADAE data. The targetDataType attribute is mentioned, as the logical data type is different from the JSON data type. Also note that the displayFormat attribute has been specified as "E8601DA" so that SAS can display these numeric values in a user-friendly way.
"columns": [ {"itemOID": "IT.ADAE.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12}, {"itemOID": "IT.ADAE.SITEID", "name": "SITEID", "label": "Study Site Identifier", "dataType": "string", "length": 3}, {"itemOID": "IT.ADAE.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 11, "keySequence": 1}, ... {"itemOID": "IT.ADAE.TRTSDT", "name": "TRTSDT", "label": "Date of First Exposure to Treatment", "dataType": "date", "targetDataType": "integer", "displayFormat": "E8601DA."}, {"itemOID": "IT.ADAE.TRTEDT", "name": "TRTEDT", "label": "Date of Last Exposure to Treatment", "dataType": "date", "targetDataType": "integer", "displayFormat": "E8601DA."}, {"itemOID": "IT.ADAE.ASTDT", "name": "ASTDT", "label": "Analysis Start Date", "dataType": "date", "targetDataType": "integer", "displayFormat": "E8601DA.", "keySequence": 3}, ... ] "rows": [ ["CDISCPILOT01", 701", "CDISC001", ..., "2014-01-02", "2014-07-02", "2014-01-03", ...] ... ] |
Some decimal variables can be exchanged as a string in JSON to represent terminating decimal fractions without rounding. Hence, the targetDataType is "decimal" and JSON data type is "string".
For example:
"columns": [ ... {"itemOID": "IT.ADSL.BMIBL", "name": "BMIBL", "label": "Baseline BMI (kg/m^2)", "dataType": "decimal", "targetDataType": "decimal", "length": 16}, {"itemOID": "IT.ADSL.HEIGHTBL", "name": "HEIGHTBL", "label": "Baseline Height (cm)", "dataType": "decimal", "targetDataType": "decimal", "length": 5}, ... ] "rows": [ [ ..., "30.8983333232059", "162.9", ... ], [ ..., "28.977529926378", "171.1", ... ], ... ] |
rows is an array of records with variables values. Each record itself is also represented as an array of variables values.
"columns": [ {"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ... {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ... ], "rows": [ ["MyStudy", "DM", "CDISC001", ..., 56, "YEARS", ...], ["MyStudy", "DM", "CDISC002", ..., 26, "YEARS", ...], ... ] |
Missing values are represented by null. Empty strings are represented by "".
Example:
"columns": [ {"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ... {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ... ], "rows": [ ["MyStudy", "DM", "CDISC001", ..., null, null, ...], ["MyStudy", "DM", "CDISC002", ..., null, "", ...], ... ] |
{ "datasetJSONCreationDateTime": "2023-06-28T15:38:43", "datasetJSONVersion": "1.1.0", "fileOID": "www.sponsor.xyz.org.project123.final", "dbLastModifiedDateTime": "2023-05-31T00:00:00", "originator": "Sponsor XYZ", "sourceSystem": { "name": "Software ABC", "version": "1.0.0" }, "studyOID": "cdisc.com.CDISCPILOT01", "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7", "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml", "itemGroupOID": "IG.DM", "records": 18, "name": "DM", "label": "Demographics", "columns": [ {"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ... {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ... ], "rows": [ ["CDISCPILOT01", "DM", "CDISC001", ..., 84, "YEARS", ...], ["CDISCPILOT01", "DM", "CDISC002", ..., 76, "YEARS", ...], ["CDISCPILOT01", "DM", "CDISC003", ..., 61, "YEARS", ...], ... ] } |
The purpose of the NDJSON, or new-line delimited JSON, representation of Dataset-JSON is to simplify streaming large datasets. With NDJSON, a dataset can easily be read or written one row at a time without loading the entire dataset into memory. The NDJSON and JSON dataset content are the same. In a data-exchange scenario, the sender and receiver determine whether to use the JSON or the NDJSON representation of Dataset-JSON.
NDJSON is a standard for delimiting JSON in stream protocols. In NDJSON, each line is valid JSON. JSON is delimited by the new-line character (\n or 0x0A), which may be preceded by a carriage return character (\r or 0x0D). UTF-8 encoding is expected. NDJSON uses the file extension .ndjson.
The Dataset-JSON NDJSON format is created from the Dataset-JSON standard by:
{"datasetJSONCreationDateTime": "2023-06-28T15:38:43", "datasetJSONVersion": "1.1.0", "fileOID": "www.sponsor.xyz.org.project123.final", "dbLastModifiedDateTime": "2023-05-31T00:00:00", "originator": "Sponsor XYZ", "sourceSystem": {"name": "Software ABC", "version": "1.0.0"}, "studyOID": "cdisc.com.CDISCPILOT01", "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7", "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml", "itemGroupOID": "IG.DM", "records": 18, "name": "DM", "label": "Demographics", "columns": [{"itemOID": "IT.DM.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1}, {"itemOID": "IT.DM.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2}, {"itemOID": "IT.DM.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2}, ..., {"itemOID": "IT.DM.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}, {"itemOID": "IT.DM.AGEU", "name": "AGEU", "label": "Age Units", "dataType": "string", "length": 5}, ...]} ["CDISCPILOT01", "DM", "CDISC001", ..., 84, "YEARS", ...] ["CDISCPILOT01", "DM", "CDISC002", ..., 76, "YEARS", ...] ["CDISCPILOT01", "DM", "CDISC003", ..., 61, "YEARS", ...] ... |
Each row in an NDJSON file can be parsed and processed as stand-alone JSON.
Term | Stands for, plus Reference to CDISC Standard or source of information |
---|---|
ADaM | Analysis Dataset Model. CDISC Foundational standard for modeling data: https://www.cdisc.org/standards/foundational/adam |
API | Application Programming Interface |
Define-XML | CDISC Data Exchange standard for sharing metadata: https://www.cdisc.org/standards/data-exchange/define-xml |
GitHub | GitHub is a web-based platform that allows developers to store, share, and collaborate on code |
JSON | JavaScript Object Notation |
ICH | International Council for Harmonisation: https://www.ich.org, https://www.ich.org/page/study-tagging-file-specification-and-related-files |
LinkML | Linked Data Modeling Language: https://linkml.io/linkml/ |
NDJSON | Newline delimited JSON |
ODM | Operational Data Model: https://www.cdisc.org/standards/data-exchange/odm |
SDTM | Study Data Tabulation Model. CDISC Foundational standard for modeling data: https://www.cdisc.org/standards/foundational/sdtm |
URI | URI Uniform Resource Identifier |
URL | URI Uniform Resource Locator |