1. Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download. 2005. University of Utah Government Doc. Libraria. 31 Oct. 2007
- The deep Web is the “hidden” part of the Web,
- Inaccessible to conventional search engines, and consequently, to most users.
- Sometimes called the “Invisible Web”, includes information contained in searchable databases that can only be reached by a direct query or a specialized search engine.
- I nformation is contained in dynamic webpages that are generated upon request to a database. It has no persistent or static URL.
- Webpages with static or persistent URLs that can be detected by a search engine crawler.
- Once detected, the URL is added to that search engine’s database and can become a result in a query or search of that search engine .
- 550 billion documents
- 500 times the content of the surface Web
- Google has identified 1.2 billion documents
- An Internet search typically searches .03% (1/3000) of available content.
- The Deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the Surface Web.
- Searchable databases
- Downloadable files & spreadsheets
- Image and multi-media files
- Data sets
- Various file formats such as .pdf
- Lots of government information
- A search engine “Spider” or “Crawler” will seek out webpage documents by going from one hyperlink to another and adding each page to it’s catalog as it crawls along. This requires that each page have a static or persistent URL.
- People, not an automated software program, collect and index URLs in the search engine’s catalog.
- Higher quality sources
- Selected and organized by subject experts
- Dynamic display
- Customized data sets
- Some data is visual, and not word searchable
- Regular search engines miss vast resources available in the Deep Web
- A search conducted in a Deep Web site on a specific subject will generally yield a greater number of more relevant results than the same search run in a general search engine.
- I have a collection of information about famous people. It contains names, birthdays, claim to fame and other information about famous people.
- The information is kept in a searchable database called “Famous People”.
- I have a webpage with a search feature that lets me search my database.
- This webpage has a unique, unchanging Web address. This is known as a static or persistent URL.
- http://www.famouspeople.madeup/ blurbs.php?famous= actor = 1347 % 1 583 =
- The results of the search are returned on a webpage similar to the one shown at the right.
- The URL shown below reflects both the criteria used in the search and the location in the database where the information was found.
- The URLs shown below are known as dynamic URLs. The information displayed on each webpage is based on a query or search of the database.
- These pages will not be picked up and indexed by search engine crawlers.
- As in the example above, once the URL of the result of a database query is put on a static webpage, it can be discovered by a search engine crawler and indexed into that search engine.
- Once this happens, it can be called up by that regular search engine even though it was once only Deep Web content.
- Let’s look at an example using the Famous People Database.
- Once a report is retrieved from the Famous People database, the URL for that report can be used as a link on a static webpage.
- The static page can be indexed by a search engine. Since it contains a link to a Deep Web resource, the Deep Web will appear on the surface from time to time.
- Every search engine has a unique set of rules regarding how much coverage to give any given website. Some only index the first or “home” page, while others drill down into subsequent layers.
- Search engines also vary on how often their crawlers will return to sites to update entries.
- No single search engine indexes the entire Web or even comes close to a large percentage of it!
- “ Would you use an encyclopedia to look up a phone number?” Chris Sherman of About.com asks.
- When using a Deep Web index, such as CompletePlanet, Lycos or DirectSearch, you are first searching through a collection of databases, NOT looking for a specific piece of information
- Each database is its own searchable collection of information. Once you find one you want to search, you will then conduct another search within that particular database to find the information you want.