|
Everyone uses web crawlersindirectly, at least! Every time you search the Internet using a service such as Alta Vista, Excite, or Lycos, you're making use of an index that's based on the output of a web crawler. Web crawlersalso known as spiders, robots, or wanderersare software programs that automatically traverse the Web. Search engines use crawlers to find what's on the Web; then they construct an index of the pages that were found. However, you might want to use a crawler directly. You might even want to write your own! Here are some possible reasons:
This article explains what web crawlers are. It includes a web-crawling demo program, written in the Java programming language, that you can run from your browser. The demo traverses the Web automatically, shows a running list of files it has found, and updates the list each time it finds a new one. You can specify what type of file you want to find. The Java language source code for this demo application is provided as a programming example. How Web Crawlers WorkWeb crawlers start by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively. Web-crawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves some subsequent purpose. The most common use is to build an index for a web search engine, but crawlers are also used for other purposes, such as those mentioned in the previous section. Muscle Fish uses a crawler to search the Web for audio files. This is a straightforward task, as shown by the demo in the next section. It turns out that searching for audio files is not very different from searching for any other kind of file. On the other hand, indexing audio is anything but straightforward. Most search engines, if they handle audio at all, index only textual information that's associated with the sound file. Muscle Fish's approach is to acoustically analyze the audio itself. This feature lets you search for sound files based on how they actually soundyou're not limited to searching for whatever words happen to be located nearby on the same web page. (A forthcoming article and demo program will show this feature.) A Web-Crawling Demo Program
The simple application shown below crawls the Web, searching for a specified type of file.
To run the demo, follow these steps:
If you let the tour run without stopping, it will eventually stop on its own once it's found 50 files. At this point, it reports "reached search limit of 50." (You can increase the limit by changing the SEARCH_LIMIT constant in the source code.) The application will also stop automatically if it encounters a dead endmeaning that it's traversed all the files that are directly or indirectly available from the starting position you specified. If this happens, the application reports "done." The next time you click Search, the list of files gets cleared, and the search process starts over again. Notice that there's a pull-down menu that lets you specify what type of file you want to find. The default is HTML text files. You can also choose "audio/basic," "audio/au," "audio/aiff," "audio/wav," "video/mpeg," or "video/x-avi." A Look at the CodeTake a look at the Java-language source code for this demo. The code occupies less than 400 lines, including comments. It is a testament to JDK's elegance that this application took only a few person-hours to write from scratch. (Muscle Fish had never written a crawler before, nor was any pre-existing web-crawler code borrowed or studied.) Here's a pseudocode summary of the algorithm:
This demo tries to respect the robots exclusion standard, meaning that it avoids sites where it's unwelcome. Any site can exclude web crawlers from all or part of its filesystem, by putting certain statements in a file called Where to Go from HereThis simple programming example might have given you some ideas about how to write a full-fledged web crawler. Muscle Fish can't provide technical support for running this demo program or for writing crawlers. However, there are various resources on the Web for people interested in crawlers. The Web Robots Pages is a good starting point, and it contains links to other important sites.
Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold are members of Muscle Fish, LLC, a software consulting firm in Berkeley, California. Muscle Fish specializes in audio and music technology, and produces software that searches for sound based on its acoustical content. | |||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||