Diagnostic Signature Challenge - COPD Sub-Challenge

Diagnostic Signature Challenge - COPD Sub-Challenge

Team BCM CCEM AUPR_avg Rank-sum Rank
Team055 0.429375 0.371 0.405669 119 40
Team056 0.576383 0.59883 0.665135 41 12
Team063 0.585862 0.550414 0.722787 40 10
Team065 0.499604 0.411556 0.551641 104 37
Team071 0.500165 0.49827 0.520699 97 34
Team080 0.579107 0.597216 0.619236 50 17
Team081 0.568906 0.531252 0.731183 50 17
Team091 0.601917 0.571365 0.671969 41 12
Team106 0.5801 0.654617 0.655479 32 7
Team112 0.690719 0.645012 0.921685 11 3
Team114 0.575417 0.514185 0.697888 54 21
Team115 0.561 0.634013 0.645484 49 15
Team120 0.512204 0.480675 0.509676 100 35
Team122 0.682729 0.730237 0.941521 5 1
Team132 0.540635 0.521963 0.6354 75 25
Team149 0.635417 0.65 0.609744 34 9
Team158 0.578526 0.581698 0.636255 50 17
Team161 0.659955 0.683489 0.664695 21 4
Team163 0.566875 0.559625 0.634164 62 22
Team164 0.613333 0.534 0.810666 33 8
Team170 0.536979 0.492375 0.572607 86 29
Team171 0.480849 0.429977 0.478731 112 38
Team187 0.631503 0.583577 0.803358 25 6
Team202 0.604167 0.525 0.649405 53 20
Team203 0.544488 0.526563 0.547881 81 28
Team208 0.622242 0.625851 0.792707 21 4
Team210 0.50399 0.533248 0.496798 92 30
Team212 0.515156 0.473363 0.565757 93 31
Team221 0.701016 0.649335 0.937059 8 2
Team227 0.572917 0.5625 0.561221 65 24
Team235 0.502502 0.526501 0.514328 93 31
Team241 0.604167 0.6125 0.620833 41 12
Team251 0.495089 0.489323 0.522459 101 36
Team261 0.570494 0.603051 0.692334 40 10
Team269 0.418056 0.379576 0.424578 118 39
Team273 0.54923 0.437466 0.654894 77 26
Team276 0.562677 0.577838 0.604819 64 23
Team284 0.550759 0.591397 0.686726 49 15
Team290 0.51576 0.411588 0.694893 77 26
Team291 0.525368 0.497451 0.501597 95 33

The aim of this sub-challenge was to verify that it is possible to extract a robust diagnostic signature for Chronic Obstructive Pulmonary Disease (COPD) from gene expression data.

Participants were asked to develop and submit a classifier that can stratify patients into one of two phenotype groups — COPD or Control. The classifier was built by using publicly available small airways gene expression data with related clinical, demographic and batch information, and was tested on independent samples extracted from large airways (see Figures 1 and 2).

 

Overview of the COPD Sub-Challenge

as communicated to the Participants

 

Synopsis

The aim of this sub-challenge is to verify that it is possible to extract a robust diagnostic signature for Chronic Obstructive Pulmonary Disease (COPD) from gene expression data. Participants are asked to develop and submit a classifier that can stratify patients into one of two phenotype groups — COPD or Control. The classifier will be built by using publicly available small airways gene expression data with related clinical, demographic and batch information, and will be tested on independent samples extracted from large airways (see Figures 1 and 2).  

 

 

Figure 1: COPD is a disease that is manifested in the small airways. The challenge is to produce a COPD signature that is valid in large airways where sample collection is easier to perform.

 

Background

COPD encompasses chronic obstructive bronchiolitis with obstruction of small airways and emphysema with enlargement of airspaces and destruction of lung parenchyma, loss of lung elasticity, and closure of small airways. Although the disease is manifested in the small airways, the challenge is to produce a COPD signature that is valid in large airways where sample collection is easier to perform. (see Figure 1).

COPD causes a progressive airflow limitation that is not fully reversible and is associated with abnormal inflammatory responses to noxious particles or gases [1].  COPD is a major cause of chronic morbidity and mortality throughout the world with its prevalence being variable across different countries and groups. In developed countries smoking is a contributing factor to the disease. In the European Union, COPD accounts for 56% of the costs of treating respiratory disease are €38.6 billion, and in the US, the direct costs are $18 billion [2].

COPD treatment is still in the active research and development phase. Pharmacotherapy decreases symptoms and complications and includes the use of long-acting bronchodilators and inhaled glucocorticosteroids. However, none of the existing medications offers a cure for or prevention of the long-term decline in lung function.

The Global Initiative for Chronic Obstructive Lung Disease (GOLD) characterizes COPD patients into GOLD Stage 1-4 depending on the severity of disease (with GOLD Stage 4 being the most severe). Diagnosis is based on spirometry (a test which measures expiratory air flow) with or without a bronchodilator (to differentiate from asthma) and through questionnaires related to respiratory symptoms.  

Historically, a GOLD Stage 0 characterized a higher risk population who did not present the clearer symptoms used to describe stage 1 [4]. Since not all of these patients will eventually develop COPD, we did not include them in this challenge. In addition, subjects that suffer from alpha1-antitrypsin deficiency represent a unique group of COPD patients and were also excluded.

In summary, the COPD phenotype refers to GOLD stages 1-4 while Controls are asymptomatic subjects that have no consistent symptoms [4].

Currently, the gene expression profiling for COPD has been done on the small airway that is somewhat different from the large airway [5]. However, in smokers with and without COPD, the patterns of inflammatory processes are similar in both large and small airways [6].

 

The Sub-challenge

The sub-challenge is to identify a classifier that can distinguish between COPD and Control subjects in large airway tissue gene expression data (Figure 2). Publicly available training data are derived from large airways and small airways whereas test data consist large airway data only (figure 2).  While gene signatures are the typical components of classifiers from gene expression, we believe that there is room for exploration of other biologically-interpretable signatures that go beyond over- or under-expressing genes.

 

Figure 2:  Schematics diagram of COPD challenge. The training data (blue outline) consist of data from large airways (green symbols) and small airways (orange symbols), whereas test data (yellow outline) consist large airway data only.

 

The Data

Training data can be obtained from any publicly available source. For convenience, we include a list of third party publicly available datasets that participants may be able to use for training purposes:

Tissue

Smoking status

Control

COPD

Small airway

Smoker

91

39

Non-smoker

81

n/a

Large Airway

Smoker

76

n/a

Non-smoker

49

n/a

 Total

297

39

Table 1: Composition of possible training datasets. Each cell displays the number of samples available for the corresponding phenotype. A file with the accession codes for the training data test can be downloaded from the download tab.

Additional details (including the dataset IDs and class labels) are provided as a separate file “COPD Metadata Training.xls”. The corresponding sample sets can be downloaded from the Gene Expression Omnibus (GEO) Database by searching for the appropriate dataset and sample IDs. We note that we do not control these sites and that the use of the data available on those sites may be subject to restrictions.  

For testing, (including preparation of your submission), we provide participants with gene expression data from 40 large airway samples without revealing their diagnosis, together with the following clinical information: smoking status and dose, gender, age, race/ethnicity, height, and weight. An Excel file “COPD Clinical Info.xls” containing this information is available for download together with the test data. We note that your use of this data is subject to the restrictions described in the Challenge Rules. You must accept these Challenge Rules to participate in the Challenge.

Data for testing were generated using the Affymetrix® GeneChip Human Genome U133 Plus 2.0 platform.  The dataset is available for download as both “raw” data in the manufacturer’s CEL file format and as a table of quantified gene expression values. Raw CEL files were converted to gene expression values using the MAS5 algorithm implemented in Expression Console™, which is available for download if you choose to register on the third party site. In addition, participants are encouraged to use their preferred normalization method (RMA, GCRMA, etc) if so they choose. Additional third party methods for data normalization are available in the Bioconductor package for the R statistical computing environment.  Again, please note that we do not control these sites and that the use of the materials available on these sites may be subject to restrictions.

 

Format for Submission of Predictions

Challenge participants should upload their prediction for each sub-challenge separately, with the following naming convention:

COPDDiagnostic_<Team name>_predictions.txt

For each data ID, participants should provide the confidence score of the prediction that a sample belongs to COPD or Control class. The confidence of the classification should have a value between 0 and 1, with 1 being the most confident and 0 the least confident. The sum of the confidence scores across predicted classes for each sample has to be 1. Please provide a tab separated (\t) test file, including the header line, as indicated in the following example:

Sample ID

COPD_confidence

Control_confidence

COPD_1

0.20

0.80

COPD_2

0.95

0.05

COPD_3

0.94

0.06

COPD_39

0

1

COPD_40

0.85

0.15

Note: participants should provide class confidence predictions for all test samples.

 

Submission of Write-up

The complete details of the method should be provided including:

  1. Raw gene expression data processing (when relevant)
  2. Batch effect correction (when relevant)
  3. Feature selection (when relevant)
  4. Classification algorithm(s) with pseudo-code or scripts

In addition, participants must provide any other details that would allow easy reproduction and assessment of the results.

The write-up file should be submitted for each sub-challenge, with the following naming convention:

COPDDiagnostic_<Team name>_writeup.txt

In addition to plain text, the write-up can also be submitted as a word document.

The Submission must include all details in this Section and also set forth in the Challenge Rules document.

Please note, that by agreeing to the Challenge Rules document, you have granted certain rights and permissions in your submission and method.

 

Credits

The identity of the provider of the test data sets will be disclosed following the submission deadline.

 

References

  1. Balkissoon R, Lommatzsch S, Carolan B, Make B: Chronic obstructive pulmonary disease: a concise review. Med Clin N Am 2011, 95:1125-1141
  2. Global Strategy for the Diagnosis, management, and Prevention of Chronic Obstructive Pulmonary Disease (Dec. 2011) www.goldcopd.org/guidelines-global-strategy-for-diagnosis management.html
  3. Carolan BJ, Heguy A, Harvey BG, Leopold PL, Ferris B, Crystal RG: Up-regulation of expression of the ubiquitin carboxyl-terminal hydrolase L1 gene in human airway epithelium of cigarette smokers. Cancer Res. 2006, 66(22):10729-10740.
  4. Mannino DM: GOLD Stage 0 COPD. Is it Real? Does it Matter? Chest. 2006,130(2): 309-310.
  5. Harvey BG, Heguy A, Leopold PL, Carolan BJ, Ferris B, Crystal RG: Modification of gene expression of the small airway epithelium in response to cigarette smoking. J Mol Med (Berl). 2007 Jan, 85(1):39-53. www.ncbi.nlm.nih.gov/pubmed/17115125
  6. Isajevs S, Taivans I, Svirina D, Strazda G, Kopeika U.: Patterns of inflammatory responses in large and small airways in smokers with and without chronic obstructive pulmonary disease.  Respiration. 2011, 81(5):362-71. www.ncbi.nlm.nih.gov/pubmed/21228544 

 

 

Scoring Legend:

Belief Confusion Matrix: A matrix whose element {i,j} is the average confidence that a subject belonging to class i is in class j. Each prediction has its own belief confusion matrix. The perfect belief confusion matrix is the identity matrix.BCM (Belief Confusion Metric): This metric measures the trace of the difference between a prediction Belief Confusion Matrix and the Perfect Confusion Matrix (details).  The final value is normalized to be between 0 and 1.
CCEM (Correct Class Enrichment Metric): To compute this metric we add the confidence of the subjects whose classes were correctly predicted and subtract the confidence of the subjects whose classes were incorrectly predicted. In other words, this is a measure of enrichment of the correctly classified subjects. The final value is normalized to be between 0 and 1.
AUPR: For each class in a sub-challenge, a list of subjects is created ordered according to the confidence that the subject belongs to that class. Using this list we computed the Precision-Recall curve for each class, from which the Area under the precision recall curve is extracted. Precision is a measure of specificity whereas recall is a measure of completeness.
AUPR_avg: There are as many AUPRs as classes in a sub-challenge. The AUPR_avg metric is the arithmetic mean of the AUPR across the classes.
Rank-sum: For each team, the rank-sum is the sum of the ranking of that team in the three metrics BCM, CCEM and AUPR_avg
Rank: The rank of the sum of the ranks over the 3 computed metrics.

Click here to open the Scoring Metrics document.

Share this page