Crawl statistics

The Yandex indexing robot regularly crawls site pages and loads them into the search database. The robot can fail to download a page if it is unavailable.

Yandex.Webmaster lets you know which pages of your site are crawled by the robot. You can view the URLs of the pages the robot failed to download because the hosting server was unavailable or because of errors in the page content.

For example, you can find out that the search database contains a large number of pages deleted from the site and get their URLs. If the robot constantly requests deleted pages, it slows down the crawl of the useful pages. This is why the useful content may not appear in search results for a long time. We recommend that you prohibit indexing of pages that aren't useful, for example, in the robots.txt file.

Information about the pages is available in the Crawl statistics section in Yandex.Webmaster. The information is updated daily within six hours after the robot visits the page.

By default, the service provides the data on the whole site. To view the information about a certain section, choose it from the list in the site URL field. Available sections reflect the site structure as known to Yandex (except for the manually added sections).

You can download the information about pages in the XLS or CSV format using the filters.

Note. The data is available starting from February 20, 2017.

Page status dynamics

Page information is presented as follows:

  • New and changed — The number of pages the robot crawled for the first time and pages that changed their status after they were crawled by the robot.
  • Crawl statistics — The number of pages crawled by the robot, with the server response code.

Page changes in the search database

Yandex.Webmaster shows the following information about the pages:

  • The date when the page was last visited by the robot (the crawl date).
  • The page path from the root directory of the site.
  • The server response code received at the crawl.

Base on this information you can find out how often the robot crawls the site pages. You can also see which pages were just added to the database and which ones were re-crawled.

Pages added to the search base

If a page is crawled for the first time, the Was column displays the N/a status, and the Currently column displays the server response (for example, 200 OK).

After the page is loaded to the search database successfully, it can be displayed in the search results once the search database is updated. Information about it is shown in the Pages in Search section.

Pages reindexed by the robot

If the robot crawled the page before, the page status can change when it is re-crawled: the Was column shows the server response received during previous visit, the Currently column shows the server response received during the the last crawl.

Assume that a page included in the search became unavailable for the robot. In this case, it is excluded from the search. After some time you can find it in the list of excluded pages in the Pages in Search section.

A page excluded from the search can stay in the search database so that the robot could check its availability. Usually the robot keeps requesting the page as long as there are links to it and it isn't prohibited in the robots.txt file.

To view the changes, set the option to Recent changes. Up to 50,000 changes can be displayed.

List of pages crawled by the robot

You can view the list of site pages crawled by the robot and the following information about them:

  • The date when the page was last visited by the robot (the crawl date).
  • The page path from the root directory of the site.
  • The server response code received when the page was last downloaded by the robot.

To view the list of pages, set the option to All pages. The list can contain up to 50,000 pages.

Data filtering

You can filter the information about the pages and changes in the search database by all parameters (the crawl date, the page URL, the server response code) using the icon. Here are a few examples:

By the server response

You can create a list of pages that the robot crawled but failed to download because of the 404 Not Found server response.

You can filter only new pages that were unavailable to the robot. To do this, set the radio button to Recent changes.

Also, you can get the full list of pages that were unavailable to the robot. To do this, set the radio button to All pages.

By the URL fragment

You can create a list of pages with the URL containing a certain fragment. To do this, choose Contains from the list and enter the fragment in the field.

By the URL using special characters

You can use special characters to match the beginning of the string or a substring, and set more complex conditions using regular expressions. To do it, choose URL matches from the list and enter the condition in the field. You can add multiple conditions by putting each of them on a new line.

For conditions, the following rules are available:

  • Match any of the conditions (corresponds to the “OR” operator).
  • Match all conditions (corresponds to the “AND” operator).
Characters used for filtering
Character Description Example
* Matches any number of any characters

Display data for all pages that start with https://example.com/tariff/, including the specified page: / tariff / *

Using the * character

The * character can be useful when searching for URLs that contain two specific elements or more.

For example, you can find news or announcements for a certain year: /news/*/2017/.

@ The filtered results contain the specified string (but don't necessarily strictly match it) Display information for all pages with URLs containing the specified string: @tariff
~ Condition is a regular expression Display data for pages with URLs that match a regular expression. For example, you can filter all pages with address containing the fragment ~table|sofa|bed repeated once or several times.
! Negative condition Exclude pages with URLs starting with https://example.com/tariff/: !/tariff/*

The use of characters isn't case sensitive.

The @,!, ~ characters can be used only at the beginning of the string. The following combinations are available:

Operator Example
!@ Exclude pages with URLs containing tariff: !@tariff
!~ Exclude pages with URLs that match the regular expression