Posts

Deep Web Seminar Report


1. Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download. 2005. University of Utah Government Doc. Libraria. 31 Oct. 2007 . Kuhler, Denise . "Mining the Deep Web-With specialty search engines." University of Missouri System-. Jan. 2004. MOREnet. 31 Oct. 2007 . 2. What is the Deep Web?
  • The deep Web is the “hidden” part of the Web,
  • Inaccessible to conventional search engines, and consequently, to most users.
  • Sometimes called the “Invisible Web”, includes information contained in searchable databases that can only be reached by a direct query or a specialized search engine.
  • I nformation is contained in dynamic webpages that are generated upon request to a database. It has no persistent or static URL.
3. The Surface Web
  • Webpages with static or persistent URLs that can be detected by a search engine crawler.
  • Once detected, the URL is added to that search engine’s database and can become a result in a query or search of that search engine .
4. How big is the Deep Web?
  • 550 billion documents
  • 500 times the content of the surface Web
  • Google has identified 1.2 billion documents
  • An Internet search typically searches .03% (1/3000) of available content.
  • The Deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the Surface Web.
5. What’s in the Deep Web?
  • Searchable databases
  • Downloadable files & spreadsheets
  • Image and multi-media files
  • Data sets
  • Various file formats such as .pdf
  • Lots of government information
6. How is the Deep Web different from the Surface Web?
  • A search engine “Spider” or “Crawler” will seek out webpage documents by going from one hyperlink to another and adding each page to it’s catalog as it crawls along. This requires that each page have a static or persistent URL.
  • People, not an automated software program, collect and index URLs in the search engine’s catalog.
Surface webpages are added to search engines in one of two ways:. 7. Why use the Deep Web?
  • Higher quality sources
    • Selected and organized by subject experts
  • Dynamic display
  • Customized data sets
  • Some data is visual, and not word searchable
  • Regular search engines miss vast resources available in the Deep Web
  • A search conducted in a Deep Web site on a specific subject will generally yield a greater number of more relevant results than the same search run in a general search engine.
8. Famous people
  • I have a collection of information about famous people. It contains names, birthdays, claim to fame and other information about famous people.
  • The information is kept in a searchable database called “Famous People”.
Tennis Champion 1954 Everett Christine Actor, Humanitarian January 26, 1925 Newman Paul Author, publisher, scientist, statesman January 17, 1706 Franklin Benjamin Entertainer, actor, author 1937 Cosby Bill FAME BIRTHDAY L_NAME F_NAME 9. Static URL
  • I have a webpage with a search feature that lets me search my database.
  • This webpage has a unique, unchanging Web address. This is known as a static or persistent URL.
http://www.famouspeople.madeup/index.html 10. Search results
  • http://www.famouspeople.madeup/ blurbs.php?famous= actor = 1347 % 1 583 =
  • The results of the search are returned on a webpage similar to the one shown at the right.
  • The URL shown below reflects both the criteria used in the search and the location in the database where the information was found.
Each result links to a report generated by the database containing information about that famous person. 11. Individual report http://www.famouspeople.madeup/ blurbs.php?famous=bill%cosby=13473= The report that is generated by the database on each specific person will have a dynamic URL. 12. Dynamic URL
  • The URLs shown below are known as dynamic URLs. The information displayed on each webpage is based on a query or search of the database.
  • These pages will not be picked up and indexed by search engine crawlers.
http://www.famouspeople.madeup/ blurbs.php?famous=actor=1347%1583= http://www.famouspeople.madeup/ blurbs.php?famous=bill%cosby=13473= 13. Deep Web content occasionally shows up on the surface. Why?
  • As in the example above, once the URL of the result of a database query is put on a static webpage, it can be discovered by a search engine crawler and indexed into that search engine.
  • Once this happens, it can be called up by that regular search engine even though it was once only Deep Web content.
  • Let’s look at an example using the Famous People Database.
14. Bringing the Deep Web to the surface
  • Once a report is retrieved from the Famous People database, the URL for that report can be used as a link on a static webpage.
  • The static page can be indexed by a search engine. Since it contains a link to a Deep Web resource, the Deep Web will appear on the surface from time to time.
Static page links to a Deep Web resource. 15. Search engines sometimes miss Surface Web content
  • Every search engine has a unique set of rules regarding how much coverage to give any given website. Some only index the first or “home” page, while others drill down into subsequent layers.
  • Search engines also vary on how often their crawlers will return to sites to update entries.
  • No single search engine indexes the entire Web or even comes close to a large percentage of it!
16. Using the right tool for the job
  • “ Would you use an encyclopedia to look up a phone number?” Chris Sherman of About.com asks.
He continues, “Why attempt to pull a needle from a large haystack with material from all branches of knowledge when a specialized tool allows you to limit your search in specific ways as it relates to the type of information being searched?” 17. Searching Deep Web vs. Surface Web
  • When using a Deep Web index, such as CompletePlanet, Lycos or DirectSearch, you are first searching through a collection of databases, NOT looking for a specific piece of information
  • Each database is its own searchable collection of information. Once you find one you want to search, you will then conduct another search within that particular database to find the information you want.
18. CompletePlanet: http://completeplanet.com The listing at CompletePlanet is a listing of search engines and databases. When you type in a keyword, you are looking for databases or search engines containing that keyword.