Diagnostic Signature Challenge - Lung Cancer Sub-Challenge

Diagnostic Signature Challenge - Lung Cancer Sub-Challenge

Team BCM CCEM AUPR_avg Rank-sum Rank
Team036 0.479442 0.5094 0.45764 12 1
Team050 0.38028 0.427125 0.39073 97 34
Team056 0.416721 0.447432 0.453125 62 21
Team063 0.426929 0.489219 0.488789 19 4
Team071 0.397204 0.441024 0.39464 88 33
Team080 0.422852 0.481674 0.4474 39 8
Team081 0.388939 0.475686 0.436177 60 20
Team091 0.218116 0.298969 0.235759 133 45
Team101 0.44699 0.46 0.405019 57 19
Team106 0.391653 0.475599 0.457485 55 18
Team112 0.319136 0.4465 0.37218 106 37
Team114 0.443276 0.482525 0.463826 18 3
Team115 0.340316 0.486455 0.432406 62 21
Team120 0.242106 0.409951 0.215594 125 42
Team122 0.413293 0.467503 0.480596 43 11
Team132 0.391701 0.486527 0.447608 46 13
Team140 0.23979 0.347628 0.235912 128 43
Team149 0.304006 0.409933 0.29367 117 39
Team158 0.40595 0.461487 0.432694 67 27
Team161 0.431094 0.481357 0.495694 21 5
Team163 0.424452 0.470733 0.390114 66 24
Team164 0.433596 0.467179 0.435277 48 15
Team170 0.426366 0.448713 0.417239 66 24
Team171 0.357506 0.464892 0.410405 81 31
Team181 0.421599 0.461667 0.413803 66 24
Team187 0.440378 0.459268 0.460959 41 9
Team202 0.354593 0.473333 0.384511 86 32
Team203 0.421638 0.45269 0.373379 80 29
Team208 0.199729 0.289904 0.205327 138 46
Team212 0.259865 0.42689 0.283447 116 38
Team221 0.459301 0.491775 0.453696 17 2
Team227 0.4742 0.48 0.427536 34 7
Team235 0.388707 0.433197 0.388485 98 35
Team241 0.443484 0.465417 0.415739 50 16
Team245 0.403277 0.476718 0.479604 41 9
Team251 0.439791 0.450933 0.450508 50 16
Team253 0.412821 0.475627 0.410118 63 23
Team261 0.436644 0.439998 0.390587 74 28
Team269 0.233314 0.293159 0.245493 131 44
Team273 0.431174 0.480051 0.462177 27 6
Team276 0.24533 0.368783 0.253939 122 41
Team284 0.405455 0.4168 0.434218 80 29
Team290 0.408335 0.475963 0.458489 43 11
Team291 0.331303 0.333352 0.316339 118 40
Team294 0.353915 0.353823 0.395685 105 36
Team297 0.377977 0.47588 0.495539 47 14

The aim of this IMPROVER subchallenge was to verify that it is possible to extract a robust diagnostic signature from gene expression data that can identify stages of different types of lung cancer.

Participants are asked to develop and submit a classifier that can stratify lung cancer patients in one of four groups – Stage 1 of Adenocarcinoma (AC Stage 1), Stage 2 of Adenocarcinoma (AC Stage 2), Stage 1 of Squamous cell carcinoma (SCC Stage 1) or Stage 2 of Squamous cell carcinoma (SCC Stage 2). The classifier could be built by using any publicly available gene expression data with related histopathological information and was tested on an independent dataset.

Overview of the Lung Cancer Sub-Challenge

as communicated to the Participants

Synopsis

The aim of this subchallenge is to verify that it is possible to extract a robust diagnostic signature from gene expression data that can identify stages of different types of lung cancer.

Participants are asked to develop and submit a classifier that can stratify lung cancer patients in one of four groups – Stage 1 of Adenocarcinoma (AC Stage 1), Stage 2 of Adenocarcinoma (AC Stage 2), Stage 1 of Squamous cell carcinoma (SCC Stage 1) or Stage 2 of Squamous cell carcinoma (SCC Stage 2). The classifier can be built by using any publicly available gene expression data with related histopathological information and will be tested on an independent dataset.

Background

In 2006 medical expenses from cancer care in the United States were an estimated $104.1 billion. As the population ages, costs are expected to continue to increase as cancer prevalence rises and expensive, targeted treatment strategies are becoming the standard of care. According to the World Health Organization (WHO) between 2004 and 2030, global cancer deaths will increase from 7.4 million to 11.8 million and cancer will be the leading cause of death followed by heart disease and stroke [1].

Non Small Cell Lung Cancer (NSCLC) accounts for approximately 85% of all lung cancers. NSCLC is divided into adenocarcinoma (AC), squamous cell carcinoma (SCC), and large cell carcinoma (LCC) histologies [4] (see Figure 1).

Figure 1: (A). Distribution of lung cancer subtypes in a study with smoking status at the time of diagnosis. The pie chart shows the distribution of the non small cell lung cancer (NSCLC) subtypes: SCC (squamous cell lung cancer), AC (adenocarcinoma) and LCC (large cell lung cancer), and the small cell lung cancer (SCLC). The distribution of current (red) and former (green) smokers is shown as a histogram for each subtype.(B). Schematic of the tissues involved in squamous cell carcinoma and adenocarcinoma

 

Figure 1: (A). Distribution of lung cancer subtypes in a study with smoking status at the time of diagnosis. The pie chart shows the distribution of the non small cell lung cancer (NSCLC) subtypes: SCC (squamous cell lung cancer), AC (adenocarcinoma) and LCC (large cell lung cancer), and the small cell lung cancer (SCLC). The distribution of current (red) and former (green) smokers is shown as a histogram for each subtype. (Source: table 1 in [3]) (B). Schematic of the tissues involved in squamous cell carcinoma and adenocarcinoma

NSCLC stage is generally defined by the TNM system [4] .The T category describes the original (primary) tumor – tumor size and whether it has spread to surrounding tissue. The N category signifies any lymph node involvement (in and around the lungs), and the M category indicates whether the cancer has spread to other parts of the body, i.e. metastasized.

According to the overall TNM staging, stage 1 lung cancer is small and localized to only one area of the lung. Stage 2 and 3 cancers are larger and may have grown into the surrounding tissues and there may be cancer cells in the lymph nodes. Stage 4 cancer has spread to another body part.

Lung cancer is currently diagnosed with X-ray or computed tomography (CT) screening. Treatment is a combination of surgery and chemotherapy, dependent on stage. Currently, there are some reports on novel biomarkers in non small cell lung cancer through gene expression profiling [4,5] .

The Sub-challenge

The sub-challenge is to classify Adenocarcinoma (AC) and Squamous Cell Carcinoma (SCC) and their respective stages (I & II) based on transcriptome from tumor samples. While gene signatures are the typical components of classifiers from gene expression, we believe that there is room for exploration of other biologically-interpretable signatures that go beyond over-or-under expressing genes.

Figure 2: Schematic diagram of the lung cancer sub-challenge

 

Figure 2: Schematic diagram of the lung cancer sub-challenge

 

The Data

Training data can be obtained from any publicly available source. For convenience, we include a list of third party publicly available datasets that participants may be able to use for training purposes:

 

TUMOR STAGE

 

Diagnosis

1

2

3

4

Total

AC

51

20

7

3

81

SCC

57

17

7

 2

83

Total

108

37

14

5

164

Table 1: Composition of possible training datasets. Each cell displays the number of samples  available for the corresponding phenotypes. A file with the accession codes for the training data test can be downloaded from the download tab.

Additional details (including the dataset IDs and class labels) are provided as a separate file “LC Metadata Training.xls”. The corresponding sample sets can be downloaded from the Gene Expression Omnibus (GEO) Database by searching for the appropriate dataset and sample IDs.  We note that we do not control these sites and that the use of the data available on those sites may be subject to restrictions.  

For testing, (including preparation of your submission), we provide participants with gene expression data from 150 lung tissue samples without revealing their diagnosis, together with the following clinical information: age, gender, race/ethnicity, height, weight, body mass index, smoking status and alcohol use. An Excel file “LC Clinical Info.xls” containing this information is available for download together with the test data. We note that your use of this data is subject to the restrictions described in the Challenge Rules. You must accept these Challenge Rules to participate in the Challenge.

Data for testing was generated using the Affymetrix® GeneChip Human Genome U133 Plus 2.0 platform and is available for download as both “raw” data in the manufacturer’s CEL file format and as a table of quantified gene expression values. Raw CEL files were converted to gene expression values using the MAS5 algorithm implemented in Expression Console™, which is available for download if you choose to register on the third party site. In addition, participants are encouraged to use their preferred normalization method (RMA, GCRMA, etc) if so they choose. Additional third party methods for data normalization are available in the Bioconductor package for the R statistical computing environment. Again, please note that we do not control these sites and that the use of the materials available on these sites may be subject to restrictions.  

 

Format for Submission of Predictions

Challenge participants should upload their predictions for each sub-challenge separately, with the following naming convention:

LCStage_<Team name>_predictions.txt

For each sample ID, participants should provide the confidence score of the prediction that a sample is classified as Stage 1 of Adenocarcinoma (AC Stage 1), Stage 2 of Adenocarcinoma (AC Stage 2), Stage 1 of Squamous cell carcinoma (SCC Stage 1) or Stage 2 of Squamous cell carcinoma (SCC Stage 2). The confidence of the classification should have a value between 0 and 1, with 1 being the most confident and 0 the least confident. The sum of confidences over all class labels and for each sample ID must equal 1. Please provide a tab separated (\t) text file as indicated in the following example:

Sample ID

AC_Stage_1

_confidence

AC_Stage_2

_confidence

SCC_Stage_1

_confidence

SCC_Stage_2

_confidence

lung_1

0.20

0.50

0.10

0.20

lung_2

0.95

0.03

0.01

0.01

lung_3

0.94

0.06

0.00

0.00

 

 

lung_149

0.00

1.00

0.00

0.00

lung_150

0.25

0.15

0.25

0.35

Note: participants must provide class confidence predictions for all test samples.

 

Submission of Write-up

The complete details of the method should be provided including:

  1. Raw gene expression processing (when relevant)
  2. Batch effect correction (when relevant)
  3. Feature selection (when relevant)
  4. Classification algorithm(s) with pseudo-code or scripts

In addition, participants must provide any other details that would allow easy reproduction of the results.

The write-up file should be submitted for each sub-challenge, with the following naming convention:

LCStage_<Team name>_writeup.txt

In addition to plain text, the write-up can also be submitted as a word document.

The Submission must include all details in this Section and also set forth in the Challenge Rules document.

Please note, that by agreeing to the Challenge Rules document, you have granted certain rights and permissions in your submission and method.

Credits

The identity of the provider of the test data sets will be disclosed following the submission deadline.

References

  1. John R and Ross H. The Global Economic Cost of Cancer. American Cancer Society and the LIVESTRONG organization. 2010.
  2. Kenfield SA, Wei EK, Stampfer MJ, Rosner BA, and Colditz GA: Comparison of aspects of smoking among four histologic types of lung cancer. Tob control 2008, 17(3):198-204.
  3. AJCC Cancer Staging Manual, Seventh Edition, 2010. http://www.cancerstaging.org
  4. Lung Cancer Alliance, 2012 http://www.lungcanceralliance.org/facing/facts.html
  5. Kuner R, Muley T, Meister M, Ruschhaupt M, Buness A, Xu EC et al.: Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. Lung Cancer 2009, 63: 32-38.
  6. Sanchez-Palencia A, Gomez-Morales M, Gomez-Capilla JA, Pedraza V, Boyero L, Rosell R et al.: Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int J Cancer 2011, 129: 355-364.

 

Scoring Legend:

Belief Confusion Matrix: A matrix whose element {i,j} is the average confidence that a subject belonging to class i is in class j. Each prediction has its own belief confusion matrix. The perfect belief confusion matrix is the identity matrix.
BCM (Belief Confusion Metric): This metric measures the trace of the difference between a prediction Belief Confusion Matrix and the Perfect Confusion Matrix (details).  The final value is normalized to be between 0 and 1.
CCEM (Correct Class Enrichment Metric): To compute this metric we add the confidence of the subjects whose classes were correctly predicted and subtract the confidence of the subjects whose classes were incorrectly predicted. In other words, this is a measure of enrichment of the correctly classified subjects. The final value is normalized to be between 0 and 1.
AUPR: For each class in a sub-challenge, a list of subjects is created ordered according to the confidence that the subject belongs to that class. Using this list we computed the Precision-Recall curve for each class, from which the Area under the precision recall curve is extracted. Precision is a measure of specificity whereas recall is a measure of completeness.
AUPR_avg: There are as many AUPRs as classes in a sub-challenge. The AUPR_avg metric is the arithmetic mean of the AUPR across the classes.
Rank-sum: For each team, the rank-sum is the sum of the ranking of that team in the three metrics BCM, CCEM and AUPR_avg
Rank: The rank of the sum of the ranks over the 3 computed metrics.

Click here to open the Scoring Metrics document.

Share this page