Computational Challenge

Objective of the challenge

The microbiomics challenge organized as part of the sbv IMPROVER project will start with a first phase titled “Microbiota composition prediction”. It aims at identifying state-of-the-art computational microbiome analysis pipeline(s) that can be used as off-the-shelf solutions for scientists to best recover the composition and relative abundance of bacterial communities present in a sample. The challenge is summarized in the figure below.

  The microbiomics challenge phase 1: “Microbiota composition prediction”.

Participants are provided with shotgun sequencing data from mouse microbiome samples and are asked to predict, at the phylum, genus, and species level, the composition and relative abundance of bacterial communities present in each sample. Participants have the freedom to use any private/public datasets to set up and test their approach. To allow for an unbiased analysis, little information is provided on the samples. Detailed information will be given after the challenge closure.

Challenge rules & submission compliance
Participants’ eligibility for scoring of their predictions is conditional to their compliance with the challenge and submission rules:

  1. Submission completeness including all prediction files and description of the computational approach in a write up (See below the Write-up instructions paragraph)
  2. Compliance with data format for the predictions (templates are provided here as guide to the participants)
  3. Compliance with the rules described below

Why participate in this challenge?

  • Help the scientific community benchmark computational methods objectively and establish standards and best practices in computational microbiome analysis. In addition, you will:
  • Gain early access to new benchmarking datasets Receive an independent assessment of your method(s)
  • Contribute to writing peer-reviewed scientific article(s) describing the outcome of the challenge
  • Grow your professional network by engaging with researchers from around the world
  • Have the possibility to win a travel bursary to a symposium taking place at the end of the next phase of the challenge (venue and time to be confirmed)

Data provided

Organizers provide to participants 19 samples (paired-end reads 2x150, Phred33 quality score, shotgun sequencing data) as a multiple-file .tar archive for download (each archive contains a subset of samples). The files are available for download from this page

File md5 sums
Dataset_part01.tar: 4860d2ce2b8bd757a39458bb5cbba240
Dataset_part02.tar: 8dab04d03ab1ece12328a96d4515f655
Dataset_part03.tar: 4560876c2c190dace965f7e34b03561f
Dataset_part04.tar: 63cadd129e4a580eedb20604dcaf8ac9

Each dataset in the archive is provided in the compressed .gz format. Two .fastq files are provided for paired-end reads. The naming convention used is as follows:

sample<sample number>_S<sample number>_L001_R<pair number 1|2>_001.fastq.gz 
Example: sample01_S01_L001_R1_001.fastq.gz (pair 1) – sample01_S01_L001_R2_001.fastq.gz (pair 2)

How to submit your predictions?

Participants are asked to predict the taxonomic composition of the provided samples at the phylum, genus, and species level. The taxonomic composition should be expressed as the relative abundance (percentage) of each taxon in the sample’s microbiome. Submissions must comply with the Bioboxes Profiling Format.

Below is an example of this file format. Note that (i) the entries after the 5th row (blank row) are separated by a TAB; (ii) only the taxonomic ranks of phylum, genus, and species will be considered for scoring; (iii) the field TAXPATHSN is optional.

NCBI taxonomy resource dates 14-07-2017 (download here). This archive contains information about the taxonomy source identifiers adopted for the challenge. It includes classification and nomenclature for all (and more) organisms found in the dataset. Inside the archive, participants will find “nodes.dmp” and “names.dmp” files. These files were used to build the submission template below which can be downloaded here.

sample submission file in the right format that would be valid for scoring purposes can be downloaded here.



Example of rejected submissionThe sum per rank species is > than or equal to 100. The submission will be rejected by the system.

During submission, participants should:

  • Submit a zip archive containing a prediction file per sample, according to the standard defined above. Make sure to modify the field @SampleID to match one of the samples in the dataset provided (sample01, sample02, …)
  • Express predicted relative abundance as percentages [0-100] of reads (e.g 100 times the number of reads assigned to a taxon divided by the total number of reads assigned to the microbiome) at species, genus, and phylum taxonomy level, and store the values in the PERCENTAGE column.
  • Ensure that percentages given for each taxa from the same rank (species, genus, phylum) sum up to <= 100 (if the sum is < 100, organizers will consider the rest as unassigned species/genus/phylum).
  • For all taxonomy IDs in the file submission template and not in submission files a percentage of 0 (zero) will be assumed.

2.5 Write-up instructions

The complete details of the method(s) should be provided including (if part of the pipeline):

  • Quality control procedure
  • Low quality reads filtering
  • Host genome contamination removal procedure
  • Taxonomic assignment procedure

In addition, participants must provide any other details that would allow easy reproduction and assessment of the results such as:

  • Name and version of tools
  • Parameters
  • Output parsing script to generate the submitted prediction files

The write-up document should be submitted as a plain text file or as a pdf file.

