MapReduce is now reaching mainstream science
Programming and systems design is not exactly the core competency of life scientists, which is why seeing people leverage frameworks like Hadoop, designed for the kinds of systems described above is encouraging, but that is hardly mainstream.
I guess things are changing very fast and neither of us can anticipate the trends. Very recently Brian Bockelman wrote a guest post on Cloudera blog about how Hadoop is being used to process the results from High-Energy Physics experiments (High Energy Hadoop). Currently High Energy Physics (HEP) team at the University of Nebraska is using Hadoop Distributed File System (HDFS) as a generic, distributed file system which provides a superior option for their data management layer because of manageability, reliability, usability and scalability. Further he writes,
I believe that the physicists’ transformation and reduction of data is very similar to the MapReduce paradigm, and there might someday be small explorations into using the MapReduce components of Hadoop–but that’s pretty far off.
High Energy Physics is not only field where experimentation with Hadoop is going on, other fields are also observing similar kind of experiments such as bioinformatics, chemoinformatics and statistics. Recently Rajarshi Guha had posted several simple use cases for chemoinformatics using Hodoop such as substructure searching, SD file parsing, atom counting. Lately Saptarshi Guha announced development and release of RHIPE-a java package that integrates the R environment with Hadoop. Saptarshi suggest that
Using RHIPE, it is possible to write map-reduce algorithms using the R language and start them from within R. RHIPE is built on Hadoop and so benefits from Hadoop’s fault tolerance, distributed file system and job scheduling features. For the R user, there is rhlapply which runs an lapply across the cluster. For the Hadoop user, there is rhmr which runs a general map-reduce program.
Coming to bioinformatics, recently Michael C. Schatz published his CloudBurst paper. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes. CloudBurst is implemented using Hadoop to parallelize execution using multiple compute nodes. Similarly Brad Chapman posted few examples of MapReduce implementation of Generic feature format (GFF) parsing for Biopython using Disco. Disco is another implementation of the distributed MapReduce framework in Erlang and Python. I myself was trying to implement a Protein-Protein interaction database using HBase (the open source implementation of Google’s BigTable) which is the Hadoop database and runs on top of HDFS which I will share very soon.
In summary there are lots of new development related MapReduce application to mainstream scientific research, although it is still in fancy stage and subjected to open discussion about how much we can gain in performance and efficiency.
Updates
- You can find here and here how D.E. Shaw Research is implementing HiMach, a framework inspired from MapReduce for the molecular dynamics solutions. HiMach allows users to write trajectory analysis programs sequentially, and carries out the parallel execution of the programs automatically.
- mpiBLAST in Amazon EC2 and RunBLAST
- Let me know if you know any other application of MapReduce or MapReduce inspired framework in science.
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=fe4043a4-0e34-452f-918c-36ce0a673025)




There’s also the work by DE Shaw which I wrote about last year http://mndoci.com/blog/2008/11/24/mapping-and-reducing-md-trajectories-with-himach/
Remeber, people, the Byzantine Empire always falls.
Just write it in “C”, keep it simple and embarrassingly parallel and you’ll sleep at night.
Writing it in R/Python/Perl whatever and praying that a government subsidized “gigagantic cloud” is betting on a long shot.
MapReduce is now reaching mainstream science: Most of you will be aware about MapReduce, a framework developed b.. http://tinyurl.com/c42p7d
Thanks for comments. No one will disagree with that
but sometime we need to write or learn something new and for that shake its fine idea.
@Deepak thanks for that link, that seems an interesting real time application
Thank you for the article, I had heard of this.
Also, the BigTable paper is interesting, I am looking forward to see your code. Are you writing in python?