Sub-challenge 1: Human blood gene signature as exposure response marker

Sub-challenge 1: Human blood gene signature as exposure response marker


Humans are constantly exposed to individual or mixtures of chemicals (e.g., cigarette smoke, pollutants, pesticides, drugs) that may trigger molecular changes in their cells. The identification of specific response markers is important to assess the exposure status of an individual. Blood is an easily accessible matrix; however, it is a complex biofluid to analyze.


To verify that robust and sparse (maximum of 40 genes) human-specific gene signatures can be extracted from whole blood gene expression data to predict smoking exposure (smoker vs. non-current smoker) or cessation (former smoker vs. never smoker) status in human.


Human blood gene expression datasets from two independent clinical studies are provided for training and testing. The test dataset includes additional samples (verification data, see below) used only for verification purposes, and will not be considered for scoring. The human blood samples were obtained from our clinical studies or a banked repository:

  • Human train dataset (dset1): The Queen Ann Street Medical Center (QASMC) clinical case–control study was conducted at The Heart and Lung Centre (London, UK), according to Good Clinical Practices. (Study description available here).
  • Human test dataset (dset2): Blood samples were obtained from a banked repository (BioServe Biotechnologies Ltd., Beltsville, MD, USA) based on well-defined inclusion criteria, and are referenced as BLD-SMK-01.
  • Human  verification dataset (dset3a and b): a series of reduced exposure studies comparing our RRP, THS 2.2 (a heat-not-burn technology also called Platform 1) with conventional cigarettes and cessation (for more details visit Two are five-day confinement studies conducted in Europe (more details about the study description: ZRHR-REXC-03-EU) and Japan (more details about the study description: ZRHR-REXC-04-JP).

In addition to the Informed Consent Form (ICF) for the participation in these studies, subjects were provided with information and asked for their consent to collect blood samples for bio-banking for transcriptomics profiling. The blood sampling for transcriptomics and the data related to these samples were anonymized. Anonymized data and samples were initially single or double coded where the link between the subjects’ identifiers and the unique code(s) was subsequently deleted.

The schema below provides a description of the composition of the datasets. The datasets provided for training are described in more detail in the “Data provided” sections and the Technical Document.

Scientific questions

  • Are gene expression changes in blood sufficiently informative to extract gene signatures predictive of smoking exposure or cessation status in human?
  • How do human clinical samples from the verification set classify?

Classification models

Participants are requested to develop inductive rather than transductive signature models to predict the sample class (for details see “Background: Microarray-based phenotype prediction” section). Human-specific signature model(s) will be developed: to predict smoking exposure status discriminating smoker vs never smoker, and to predict cessation status discriminating former smoker vs never smoker in human. The gene signature must be sparse with a maximum of 40 genes.

Stepwise class predictions

 Participants are requested to proceed with the class predictions stepwise as follows:



Step 1: The trained signature model will be applied on unlabeled sample data (test and verification sets) to classify samples as smoker or non-current smoker (including samples from former smokers and never smokers) with associated confidence level.

Step 2: The second trained signature model will be applied exclusively on samples predicted as non-current smokers in step 1 to classify those samples as former smoker or never smoker with associated confidence level.

Participants have the freedom to use two separate models for 2-class prediction for each step, or directly a 3-class prediction model.

Share this page