Site indexing

  1. Adding a site to Yandex search engine.

  2. Sitemap. For the convenience of webmasters and search engines, a special format of a site map, sitemap, was developed. This is a list of links to internal pages of the site represented in XML. Yandex, too, supports this format. You can download a sitemap for your site in the appropriate section of Yandex.Webmaster service. This enables you to control the priority according to which the robot traverses some pages of your site. For example, if some pages are updated much more often than the others, you should include this information in the sitemap so Yandex robot could plan its work correctly.

  3. Robots.txt is a file intended for search engine robots. In this file, the webmaster can specify the site indexing parameters for all the robots or for each search engine in particular. Let's have a look at three most important parameters that you may specify in this file.

    • Disallow. This directive prohibits indexing of specific sections of your site. Using it, you can prohibit indexing of technical pages and those that are not interesting for users or search engines. This includes pages with site search results, visit stats, duplicating pages, various logs, service database pages and so on. You can get more information on that in the special help topic on robots.txt file.

    • Crawl-delay. This directive enables you to specify the minimum interval (in seconds) for the indexing robot between accessing site pages. This directive is useful for large-scale projects with tens of thousands of pages. Yandex search robot can generate a significant load on such a site while indexing, causing slowdowns and disruptions in the site functioning. This is why it may make sense to limit the number of access attempts per second. For example, the following directive: Crawl-delay: 2 will tell the robot to wait 2 seconds between consequent server requests.

    • Clean param. This directive enables you to tell the search robots which cgi parameters in page addresses should be disregarded. Page addresses may contain session identifiers. Formally the pages with different identifiers are different, but their content is identical. If your sites have many such pages, the indexing robot may start indexing them instead of accessing useful content. You can find more information on using clean param directive in the respective help topic.

      Yandex.Webmaster service allows you to view the list of indexed URLs from your site. Check it at a regular interval, because even small coding errors may lead to significant increase of unwanted URLs on the site and overload it.

  4. Yandex indexes the most popular types of documents widespread in the Internet. There are limitations, however, that affect how the document is indexed and whether it will be.

    • A large number of cgi parameters in URL, many levels of repeated nested directories and too big overall URL length may negatively affect document indexing.

    • Document size matters: the documents larger than 10 MB are not indexed.

    • Flash indexing

      1. Flash is indexed if it is not embedded into HTML and the page is transmitted with an HTTP header containing Content-Type: application/x-shockwave-flash;

      2. *.swf files are indexed if they are linked to directly.

    • In PDF documents, only text content is indexed. Text represented as graphic images is not indexed.

    • Yandex indexes documents in Open Office XML and OpenDocument formats correctly (including, among others, Microsoft Office and Open Office documents). Please note that when new versions of software appear the implementation of support for the new formats may take a while.

  5. If you overrode server behavior for non-existent URLs, make sure that the server returns error code 404. Having received it, the search engine will exclude this document from the search. Make sure that all the useful site pages return code 200, OK.

  6. Make sure your http headers are correct. In particular, it is very important that the server returns a correct answer to the “if-modified-since” request. The Last-Modified must contain a correct date of the last document update.

  7. We recommend to place the site versions intended for mobile devices, as well as site version in different languages, into subdomains.


Control Yandex search robot by prohibiting, with the help of robots.txt file, indexing of the pages that are not intended for the users.