Before a search engine can retrieve any documents, they have to be organized. Libraries have to go through all their books and systematically organize them to make them easy for readers to find, and search engines must do something similar.
Of course, this task proves a bit difficult because of the heterogeneity of web pages. Should the page be organized according to its title or its content? How much of the content should be considered? Should pictures be considered or disregarded?
This is called indexing. Some engines do this manually (which means they pay lots and lots of people to sit at computers and index as many pages as they can...which is not very many). Since this is an obviously expensive (both in time and money) method, most engines opt for automatic indexing by using web crawlers.
These are computer programs that "crawl" the web, pull up webpages, consider the content, and index the page. Web crawlers can index from 3 to 10 million pages per day. Unfortunately, automatic indexing is not nearly as accurate as manual indexing. For example, webmasters can easily trick many web crawlers by inserting popular search words such as "football" or "movies" into the page and making them the same color as the background (this is referred to as spamming)Berry & Brown.
Once the web pages have been indexed, the engine can go back and retrieve them in response to a query from a user, similar to a reader searching a card catalog in a library.
Several different types of search engines have been created so far, each with their advantages and disadvantages. The following table provides a quick overview of these different types:
|Type of Search Engine||Description||Advantages||Disadvantages|
|Boolean||Uses Connectors (and, or, not)||
|Probabilistic Model||Underlying algorithm guesses at relevance||
|Vector Space Model||Uses linear algebra to find connections in document collection||
The benefits of the vector space model (VSM) far outweigh its computational expense. This is why Company Enterprises has decided to build a search engine based on VSM, and why the remainder of your training will focus on this model.
Or Jump to a page!letter 1 2 3 4 5 6
This material is based upon work supported by the National Science Foundation under Grant No. 0546622. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.