|
by Steve Meloan RELATED STORY: Exploring the New Frontier, Part 2
In short, genomics, and the companion study of proteomics -- the mapping and understanding of the proteins coded for in a given genome -- present genetic researchers with computational tasks of scales never before seen. Without the aid of massive databases, and networked computational systems, it would effectively be impossible to process and interpret the avalanche of biological data now being generated on an almost daily basis. And with genomic research centers scattered across the globe, in both private and academic settings, using a myriad of different hardware and operating systems (on everything from desk-tops to super computers), the secure, network-aware, cross-platform power of Java technology is increasingly proving indispensable to this ongoing task of reverse-engineering our inner-workings. Physiome Sciences, Inc.'s computer-based biological simulation technologies, and Bioinformatics Solutions Inc.'s PatternHunter, a genomic search and analysis facility, are just two examples of the growing adoption of Java technology in the fields of bioinformatics and computational biology. Part 1: Bioinformatics Solutions Inc./PatternHunter
Using the cross-platform and object oriented ease of development of the Java programming language, the PatternHunter program has taken the science of genomic search and analysis a quantum leap forward. PatternHunter, now owned by Bioinformatics Solutions Inc., processes in hours what other facilities before it had taken days or weeks to accomplish. PatternHunter delivers high-end genomic analysis to the computer desktop. Because it is written in the Java language, it is also cross-platform compatible. The Old WayIn the early days of genomic study, researchers dealt with smaller, and more specific, processing tasks. "The goal of searches in those days was to compare a short sequence against a long sequence, to find whether the short sequence was contained within the long sequence," says Ming Li, UCSB computer science professor, and co-developer of PatternHunter. With the completion of the human genome, however, such research goals began to expand computationally. In their quest to decipher the base pair code of various genomes, researchers sought to locate approximate similarities (or homologies) between either entire chromosomes within a given genome, or between the genomes of completely separate species. Such analysis is sometimes referred to as comparative genomics. Scalability ProblemsBut these enormous computational tasks began to present scalability problems for many of the previously established genomic search facilities -- both in terms of processing time, and memory consumption. Long established programs like FASTA, Blast, and MegaBlast began to bog down when faced with the growing processing challenges of comparative genomics. A further scalability problem was one of search accuracy. Genomic searching using smaller 'seed' sizes meant processing tasks on the order of billions of comparisons -- while simultaneously risking manifesting meaningless sequence similarities. Yet larger seed sizes, while computationally less intensive, risked missing important approximate similarities (distant homologies). With these diverse challenges in mind, Ming Li, and a series of post-doctoral academicians, began searching for newer and better ways to facilitate genomic search and analysis. The first version of PatternHunter was developed at the University of Waterloo by Dr. Bin Ma, under the guidance of Dr. Li. "That initial version was completed around July of 2000," says Li, "and was written in C++." The goal of the PatternHunter program was to address many of the increasingly apparent failings of previous facilities when confronted with large sequence comparison tasks. "We wanted to be the best in quality, and the best in speed," says Li, "to be scalable to analyzing complete genomes against complete genomes. To find patterns and homologies at that large a scale would have taken days or years for other programs to accomplish. Yet there was an increasing need for that." Ma began developing and refining innovative new search algorithms -- ones that used non-consecutive seeds. "In the past, people tried to search for large, contiguous blocks," says Li. "But you miss a lot of potentially interesting matches that way. With our seed model, we leave certain 'don't care' spaces. It turns out at that a very simple change like that makes a big difference in terms of processing speed and accuracy." The Java Technology WayNext, Li engaged another post-doctoral student, John Tromp, to take the initial version of PatternHunter to the next development level, revamping many of the behind the scenes data structures. To Li's amazement, Tromp urged that this new version of the program be written in the Java language. "I told him -- 'But if you do it in [the] Java [programming language], it will be too slow,'" says Li. Tromp assured his advisor otherwise, maintaining that the Java language's strong typing, ease of development and debugging, along with facilities such as Sun's Java HotSpot technology, would all make for the best possible version of PatternHunter. "John said that if you want to develop the best program, start with Java [technology]," reports Li. "And he did it! In a half year, working together with Bin, he developed the best program of its type in the world. It uses very little memory, and is scalable to entire genomes. No other program can say that. And the program also offers many times the homology finding power of other such facilities."
While much of the power of this second generation of PatternHunter is due to its advanced algorithm and data structures, Li and Tromp strongly credit the Java language with helping to facilitate these design goals. "In C, you work slow, and you make errors," says Li. "And it's difficult to debug. With [the] Java [programming language], you write clean, compact code, without introducing a lot of bugs. The Java language version of PatternHunter is just 40 KB--only 1% the size of Blast, while offering a large portion its functionality." "I simply felt that I could be more productive using [the] Java [programming language]," adds Tromp, who used IBM's VisualAge for Java during his development. "The language forces you to be more methodical, and to adopt a structured programming approach. And it also helps you to avoid many common coding mistakes, like running beyond array boundaries." Quite simply, Tromp recognized that his choice of algorithm and data structures was of paramount importance to the program's ultimate performance goals. "And whatever language helps you to express your algorithms and data structures most clearly, is the language you should use," says Tromp. "For me, that language was [the] Java [programming language]." Meeting Industry NeedsIn addition to size and memory usage, the Java language-based version of PatternHunter is also a quantum leap forward in terms of processing power and accuracy. The program is now being used in the work to sequence the mouse genome. "The mouse genome is somewhat similar to the human genome," explains Tromp. "Using PatternHunter, they can take small sequenced portions of the mouse genome and compare it to the human genome in order to get clues as to how those pieces should be assembled." "They have 16 million reads of the mouse genome, and each read is about 500 bases long," says Li. "Altogether, that's about 9 billion bases -- a 3 coverage of the mouse genome, which is 3 billion bases in total length. They need to compare those readings of the mouse genome against the human genome. So that amounts to comparing 9 billion base pairs against 3 billion base pairs -- a massive job. At similar sensitivities, to find all significant homologies, other programs would take many years to finish this task. But PatternHunter takes only about 20 days, on a single PC." PatternHunter's cross-platform compatibility is also a big plus in the diverse hardware and platform realms of biotechnology. "These labs typically use everything from Windows, to Solaris, to Macs, to Linux," says Tromp. "We can distribute PatternHunter as a single JAR file, and people can run it immediately. There are no installation worries." "PatternHunter runs on any computer, on any platform, and on any genome," adds Li, enthusiastically. "Any genome, anywhere!" PatternHunter "Swings"The basic PatternHunter program uses only the core elements of the Java language, without any of the related GUI classes. Both the program's input and output data is simply the ASCII "A," "C," "T," and "G" representations of the genetic alphabet. But interpreting results in this format can be more than a little difficult, particularly when dealing with potentially millions of characters. For this reason, the next phase of PatternHunter's evolution was the addition of a graphic-rich GUI front end, using Java technology's Swing component set. Li enlisted one of his graduate students, Lawrence Miller, to this task. "The Swing-based utility program I'm developing with Dr. Li sits at a higher application level," says Miller, "allowing geneticists to better visually interpret the data coming out of PatternHunter."
In addition, red markers along each source and object bar indicate sequence areas with scientific literature annotation information -- such as the fact that they are known genes. "A red dot indicates an annotation from some literature source such as GenBank," says Miller. "The unique NCBI (National Center for Biotechnology Information) ID number for that information can then be accessed from within the program." Additional windows display the actual size ranges represented by the various colors, as well as the raw ASCII data of the given alignments.
Finally, sliders on both sets of screens allow for variable filtering of the visuals, based upon both the length of alignments displayed, and the "score," which is a mathematical indication of the relative degree of similarity between two regions. "When you adjust the sliders, you can see the lines melting away, or coming back into view," says Miller. "It gives you a better visual sense of the associations." Miller has nothing but praise for the design of the Swing component set. "The screens were approximately two man months worth of work," he says, "and a tremendous learning opportunity. I was already an experienced Java programmer, but I hadn't dealt much with Swing. Yet it was extremely intuitive in its design. I looked over the API, and it was sufficiently consistent with the rest of the Java language that I was able to simply do the things I needed to do." The FutureFor now, PatternHunter continues to be utilized in the assembly phase of the mouse genome project. Meanwhile, Bioinformatics Solutions and various genomic industry firms are debating the terms of using and licensing the program. Part 2 of this series describes Physiome Sciences, Inc.'s computer-based biological simulation technologies. It will appear as a feature story on java.sun.com next month.
See Also
Exploring the New Frontier, Part 2
Bioinformatics Solutions Inc.
PatternHunter Product Information
Java Foundation Classes/Swing Components (JFC/Swing)
Java HotSpot Technology
IBM's VisualAge for Java
GenBank | ||||||||||||
|
| ||||||||||||