Hi,
This is not related much to game programming. I wanted to know how search engines crawl web pages and index the contents present in those pages. It would be nice if anyone could give me any links to the relevant documentation.
Thanks.
Searching and Indexing
Started by roxtar, Sep 27 2005 06:07 PM
4 replies to this topic
#1
Posted 27 September 2005 - 06:07 PM
#2
Posted 27 September 2005 - 11:27 PM
Crawling is pretty simple. Just start with a list of URLs, scan all the pages for more URLs and insert new ones in the list. Most search engines skip some URLs, like those blocked by ROBOTS.TXT.
#3
Posted 28 September 2005 - 01:43 AM
I know what crawling is. I wanted to know how to make a crawler and how does the crawler index those pages which it has visited.
#4
Posted 28 September 2005 - 08:27 AM
indexing is done by simply storing visited links in applicaton own database/audit, when its needed to check new URL if it wasnt already visited its tested against whole database.
#5
Posted 28 September 2005 - 09:32 AM
To write a crawler, you need a http protocol library. (You probably don't want to write it yourself.) It should be able to download any document from a specified URL. Then you can just search for "<a href=", "http://" or "www." strings, so you know where the URLs are.
Indexing is a bit trickier. Most web search engines keep their technology as a well guarded secret. One naive solution would be to store a dictionary of interesting search strings, together with the URLs where they were found.
Specify your needs more closely, and we can probably sugest something more useful.
Other than that, Google is your friend.
Indexing is a bit trickier. Most web search engines keep their technology as a well guarded secret. One naive solution would be to store a dictionary of interesting search strings, together with the URLs where they were found.
Specify your needs more closely, and we can probably sugest something more useful.
Other than that, Google is your friend.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












