Tools / Crawler / Getting started

Quickstart for the Algolia Crawler

Learn how to use the Crawler to crawl an example site, the Algolia blog at algolia.com/blog, and index the content in an Algolia index named crawler_default_index_name.

Initialization

Create a new crawler in the Crawler Admin by clicking on the Create new crawler button.

Fill in the required fields with the following information:

  • Crawler Name: Algolia Blog
  • App ID: the Algolia application you want to target
  • Start URL: https://www.algolia.com/blog

Create a new crawler

Advanced options

  • API Key: by default, one is generated for your crawler, but you can use an existing one from your Algolia dashboard
  • Algolia Index Prefix: crawler_

Once your crawler has been created, you can edit and view its configuration, from the Editor section.

Configuration file

Each crawler has a configuration file. The configuration file starts by defining a crawler object, which contains top-level parameters, actions, and index settings. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
new Crawler({
  appId: "YOUR_APP_ID",
  apiKey: "YOUR_API_KEY",
  indexPrefix: "crawler_",
  rateLimit: 8,
  maxUrls: 500,
  startUrls: ["https://www.algolia.com/blog"],
  ignoreQueryParams: ["utm_medium", "utm_source", "utm_campaign", "utm_term"],
  actions: [
    {
      indexName: "default_index_name",
      pathsToMatch: ["https://www.algolia.com/blog/**"],
      recordExtractor: ({ url, $, contentLength, fileType }) => {
        console.log(`Crawling "${url.href}"`);

        return [
          {
            // URL
            url: url.href,
            hostname: url.hostname,
            path: url.pathname,
            depth: url.pathname.split("/").length - 1,

            // Metadata
            contentLength,
            fileType,
            title: $("head > title").text(),
            keywords: $("meta[name=keywords]").attr("content"),
            description: $("meta[name=description]").attr("content"),
            type: $('meta[property="og:type"]').attr("content"),
            image: $('meta[property="og:image"]').attr("content"),

            // Content
            headers: $("h1,h2")
              .map((i, e) => $(e).text())
              .get()
          }
        ]
      }
    }
  ],
  initialIndexSettings: {
    default_index_name: {
      searchableAttributes: [
        "unordered(keywords)",
        "unordered(title)",
        "unordered(description)",
        "unordered(headers)",
        "url"
      ],
      customRanking: ["asc(depth)"],
      attributesForFaceting: ["fileType", "type"]
    }
  }
});

Top-level parameters

The following parameters apply to all your actions:

  • appId, your Algolia application ID
  • apiKey, your Algolia application’s API key. Find them in the Algolia dashboard
  • indexPrefix, the prefix added to the Algolia indices your crawler generates
  • rateLimit, the number of concurrent tasks (per second) that can run for this crawler
  • maxUrls, the maximum number of URLs your crawler will crawl
  • startUrls, the URLs that your Crawler starts on
  • ignoreQueryParams, the parameters added to URLs as the crawl progresses that you want to ignore (don’t worry about this parameter for now).

Actions

A crawler can have any number of actions. An action indicates a subset of your targeted URLs that you want to extract records from in a specific way. For example, suppose you want to crawl a site with a blog, a home page, and documentation. You probably need a unique recordExtractor for each section and may want to store their content in different indices: to accomplish this, you can use actions.

Actions include:

Index settings

The initialIndexSettings part of your configuration defines the default settings of your crawler-generated Algolia indices.

These settings are only applied the first time you run your crawler. For all subsequent crawls, initialIndexSettings is ignored.

Start crawling

To start your crawler, go to the Overview tab and click the Start crawling button. After a couple of minutes, the crawl should finish.

When you click the Start crawling button, the Crawler extracts links and then extracts the content from those links.

Clicking on Start crawling launches a crawl that starts from the URLs you provide in the startUrls, sitemaps, and extraUrls parameters of your crawler object. startUrls defaults to the home page you provided when you created your crawler, but you can add custom starting points by editing your crawler’s configuration.

1
startUrls: ["https://www.algolia.com/blog/"],

Your crawler fetches the start pages and extracts every link they contain. However, it doesn’t follow every link: otherwise, you could end up inadvertently crawling a huge chunk of the web.

For example, if your site has a link to a popular Wikipedia page, every link on that Wikipedia page would be crawled, and all the links on those pages too, and so on. As you can imagine, this can quickly become unmanageable. One of the critical ways you can restrict which links your crawler follows is with the pathsToMatch field of your crawler object.

The scope of your crawl is defined through the pathsToMatch field. Only links that match a path found in pathsToMatch are crawled. This field defaults to paths starting with the same URL as your home page.

1
pathsToMatch: ["https://www.algolia.com/blog/**"],

Content extraction

The following recordExtractor is applied to each page that matches a path in pathsToMatch:

1
2
3
recordExtractor: ({ url, $, contentLength, fileType }) => {
     return [{ url: url.href, title: $('head > title').text() }];
}

This recordExtractor. creates an array of records per crawled page and adds those records to the index you defined in your actions indexName field (prefixed by the value of indexPrefix).

You can verify the process in your Algolia dashboard.

You should find an index with the indexName you defined, prefixed by the indexPrefix. In the preceding example, the index is called crawler_default_index_name.

Exclusions

In the Monitoring tab of the Crawler Admin, you may see that the crawler has ignored several pages. This is valuable information: if you can identify a pattern to the ignored pages, you can tell the crawler to exclude these pages on future reindexes.

Excluding isn’t the same as ignoring. An excluded page isn’t crawled. An ignored page is crawled (or a crawl attempt is made) but nothing is extracted.

Excluding pages has two main advantages:

  • It lowers the resource usage of your crawler
  • It speeds up your crawling by only fetching and processing meaningful URLs.

The preceding example doesn’t have any specific ignored URLs. However, there may be some URLs that you might want to exclude, such as password-protected pages. These pages would appear as HTTP Unauthorized (401) or HTTP Forbidden (403) entries in the Monitoring section.

To exclude these URLs:

  1. Go to the Settings tab, and scroll down to the Exclusions section.
  2. Click the Add an Exclusion button, enter your pattern, for example, https://www.algolia.com/blog/private/**, and click Save.
  3. At the bottom of your configuration file, you should find the following:

    1
    
    exclusionPatterns: ['https://www.algolia.com/blog/private/**'],
    
  4. Go to the Overview tab and click the Restart crawling button.
  5. Once the crawl is finished, reopen the Monitoring tab to confirm that the HTTP Unauthorized (401) errors are gone.

See the documentation for exclusionPatterns.

Advanced crawler configuration

You can find the complete configuration reference on the Crawler configuration API page.

For more examples, browse the GitHub repository of sample crawler configuration files.

Did you find this page helpful?