Difference between a spider, crawler, and robots
Increasingly, the sites are modernizing and trying to keep up on top of search results. However, you need to invest in technology to achieve better positioning. Due to the considerable increase of material available on the web, it is essential to determine its existence so as to remain competitive. A site that is ranking in the search will surely be benefited.
As a definition, we have:
Also known as Robot, Bot, or Spider. These are programs used by search engines to explore the Internet and automatically download web content available on websites. They capture the text of the pages and the links found and thus enable search engine users to find new pages. Methodically, it exposes content and deems irrelevant content in the source code of sites, and stores the rest in the database. It is software developed to perform a scan on the internet in a systematic manner through information perceived as relevant to their function. One of the bases of the Search Engines, they are responsible for the indexing of websites and storing them in the database of search engines.
The process that executes a web crawler is called Web crawling or spidering. Many sites, in particular search engines, use crawlers to maintain an updated database. Web crawlers are mainly used to create a copy of all the visited pages for post-processing by a search engine that will index the downloaded pages to provide faster searches. Crawlers can also be used for automated maintenance tasks on a website, such as checking links or validating HTML code. The crawlers can also be used to obtain specific types of information from Web pages, such as mining addresses emails (most commonly for spam).
The search engine crawlers generally seek information about permissions on the content. There are two ways to block a decent crawler from indexing a particular page (and the links contained therein). The first, and most common, is through the robots.txt file. The other way is through the meta robots tag with the value “index” or “no follow”, used to not index (the page itself) and not below (the links in the page), respectively. There is also a third possibility, much less exploited, which is using the rel = “nofollow” for links, indicating that the link, in particular, should not be followed.
The Robots Perform In Three Basic Actions:
- First they find the pages of the site (process called crawling or spidering) and build a list of words and phrases found in every page;
- With this list they create a database and find the exact pages they should seek by entering the query in search option and the database organized by general features found in its pages. The machine enters the site in general database is called indexer;
- After that, the robot is able to find the site when the end user type a word or phrase. This step is called query processor.
As we can see, behind any search performed on the internet, there are a number of mechanisms that work together to provide a satisfactory result to the user. The process seems somewhat complex, however, nothing noticeable to us mere information seekers.
- How Cloud Computing Is Changing The Labor Market - March 25, 2015
- Adopting Infrastructure as a Service Can be a Good Deal - March 17, 2015
- Will Virtualize? Take These Six Points Into Consideration - March 12, 2015