Skip to content

MapReduce is now reaching mainstream science

2009 May 6
by abhishektiwari
Most of you will be aware about MapReduce, a framework developed by Google to analyze large data sets in parallel on clusters of computers. It is used for certain kinds of distributable problems using a large number of computers. There are several implementation for MapReduce and Apache Hadoop is one of them. It’s not very long when Deepak wrote an interesting post about The future of big compute for big science and suggested that
Programming and systems design is not exactly the core competency of life scientists, which is why seeing people leverage frameworks like Hadoop, designed for the kinds of systems described above is encouraging, but that is hardly mainstream.

I guess things are changing very fast and neither of us can anticipate the trends. Very recently Brian Bockelman wrote a guest post on Cloudera blog about how Hadoop is being used to process the results from High-Energy Physics experiments (High Energy Hadoop). Currently High Energy Physics (HEP) team at the University of Nebraska is using Hadoop Distributed File System (HDFS) as a generic, distributed file system which provides a superior option for their data management layer because of manageability, reliability, usability and scalability. Further he writes,

I believe that the physicists’ transformation and reduction of data is very similar to the MapReduce paradigm, and there might someday be small explorations into using the MapReduce components of Hadoop–but that’s pretty far off.

High Energy Physics is not only field where experimentation with Hadoop is going on, other fields are also observing similar kind of experiments such as bioinformatics, chemoinformatics and statistics. Recently Rajarshi Guha had posted several simple use cases for chemoinformatics using Hodoop such as substructure searching, SD file parsing, atom counting. Lately Saptarshi Guha announced development and release of RHIPE-a java package that integrates the R environment with Hadoop. Saptarshi suggest that

Using RHIPE, it is possible to write map-reduce algorithms using the R language and start them from within R. RHIPE is built on Hadoop and so benefits from Hadoop’s fault tolerance, distributed file system and job scheduling features. For the R user, there is rhlapply which runs an lapply across the cluster. For the Hadoop user, there is rhmr which runs a general map-reduce program.

Coming to bioinformatics, recently Michael C. Schatz published his CloudBurst paper. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes. CloudBurst is implemented using Hadoop to parallelize execution using multiple compute nodes. Similarly Brad Chapman posted few examples of MapReduce implementation of Generic feature format (GFF) parsing for Biopython using Disco. Disco is another implementation of the distributed MapReduce framework in Erlang and Python. I myself was trying to implement a Protein-Protein interaction database using HBase (the open source implementation of Google’s BigTable) which is the Hadoop database and runs on top of HDFS which I will share very soon.
In summary there are lots of new development related MapReduce application to mainstream scientific research, although it is still in fancy stage and subjected to open discussion about how much we can gain in performance and efficiency.

Updates

  • You can find here and here how D.E. Shaw Research is implementing HiMach, a framework inspired from MapReduce for the molecular dynamics solutions. HiMach allows users to write trajectory analysis programs sequentially, and carries out the parallel execution of the programs automatically.
  • mpiBLAST in Amazon EC2 and RunBLAST
  • Let me know if you know any other application of MapReduce or MapReduce inspired framework in science.
Reblog this post [with Zemanta]
Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • Slashdot
  • HackerNews
  • Reddit
  • del.icio.us
  • Twitter
  • Facebook
  • Google Bookmarks
  • Posterous
  • Tumblr
6 Responses leave one →
  1. May 6, 2009

    There’s also the work by DE Shaw which I wrote about last year http://mndoci.com/blog/2008/11/24/mapping-and-reducing-md-trajectories-with-himach/

  2. dissentingopinion permalink
    May 6, 2009

    Remeber, people, the Byzantine Empire always falls.

    Just write it in “C”, keep it simple and embarrassingly parallel and you’ll sleep at night.

    Writing it in R/Python/Perl whatever and praying that a government subsidized “gigagantic cloud” is betting on a long shot.

  3. May 6, 2009

    MapReduce is now reaching mainstream science: Most of you will be aware about MapReduce, a framework developed b.. http://tinyurl.com/c42p7d

  4. May 6, 2009

    Thanks for comments. No one will disagree with that :-) but sometime we need to write or learn something new and for that shake its fine idea.

  5. May 6, 2009

    @Deepak thanks for that link, that seems an interesting real time application

  6. May 14, 2009

    Thank you for the article, I had heard of this.
    Also, the BigTable paper is interesting, I am looking forward to see your code. Are you writing in python?

Leave a Reply

Note:You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS