Semantic Web Challenge (ISWC 2013) - BIG Data Track Winners

The Cancer Genome Atlas (TCGA) initiative aims at compiling raw and processed data files on more than 30 different cancer types hence making this huge amount of data available to oncologists and biomedical researchers for the analysis and to make important discoveries. TCGA publishes genomics information of various levels (e.g., exon expressions, protein expressions, micro RNA sequences, methylation etc.). Although, the data can be freely downloaded and analyzed by anyone, it requires powerful machines and computational tools to explore such a knowledge base. Furthermore, exploiting the plethora of TCGA data (terabytes of data) requires high performance computational resources and efficient processing techniques.
Given the exponential growth and heterogeneity nature of TCGA dataset, the authors devised a methodology that enables the exposure of TCGA dataset in a distributed and semantically aware fashion. This requires 1) downloading and processing text archives; 2) mapping data to genomic coordinates where necessary;  3) converting data into a machine readable format; 4) distributing the data across multiple servers; 5) supporting integration with other biomedical datasets; and 6) providing a federated query interface on top of it so that data from multiple servers can be easily queried, merged and return to the biomedical researchers.
With the work at hand, the authors participated in the Semantic Web Challenge held at the International Semantic Web Conference (ISWC), 2013 and won the 'Big Data Track' award. The paper submitted to the challenge was titled "Fostering Serendipity through Big Linked Data" -- in which they showed how high volume and velocity of recently published bio-medical research papers from PubMed (biomedical literature database) can be intelligently and semantically integrated with the TCGA dataset, thus supporting and facilitating biomedical researchers in their important work. To provide a proof of concept, they developed a visualization framework to facilitate the exploration of TCGA dataset in conjunction with PubMed publications.
They have received huge feedback from many researchers which further encourages them  to work on different visualizations exploiting TCGA data. A current work-in-progress is working towards building an interactive human genome browser on top of TCGA dataset which will enable biomedical researchers to visualize the patient genomic information and further allow to select certain genomic regions of interest and discover its correlation with the TCGA and other biomedical datasets in the visualization.
The team behind this work consist of researchers from INSIGHT NUI Galway (Ireland) {Aftab Iqbal, Maulik Kamdar, Stefan Decker}, AKSW, University of Leipzig (Germany) {Muhammad Saleem, Axel-Cyrille Ngonga}, Foundation Medicine (United States) {Helena F. Deus} and University of Alabama (United States) {Jonas S. Almeida}.

Publication Date

December 6th 2013