Monday, January 11, 2010

HOW GOOGLE WORKS

How Google Works

Google runs on a distributed network of thousands of computers.. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

• Googlebot, a web crawler that finds and fetches web pages.
• The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
• The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

1. Googlebot
Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer.
Googlebot consists of many computers requesting and fetching pages much more quickly than we can with our web browser. Googlebot can request thousands of different pages simultaneously. Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.
To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

2. Google’s Indexer
Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

3. Google’s Query Processor
The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.
PageRank is Google’s system for ranking web pages.Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page.
Google also use machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings.
Let’s see how Google processes a query.





This article is taken from http://www.googleguide.com
For more information on how Google works, take a look at the following articles.
Google’s page on Google’s Technology, www.google.com/technology/.
• How does Google collect and rank results?, www.google.com/newsletter/librarian/librarian_2005_12/article1.html.
• Google’s PageRank Algorithm and How it Works, www.iprcom.com/papers/pagerank/
• Google’s PageRank Explained and How to Make the Most of It, www.webworkshop.net/pagerank.html

No comments:

Post a Comment