Crawler: LinkExtractor

Type: function

Parameter syntax

linkExtractor: ({ $, url, defaultExtractor }) => {
  ...
  // return ['https://...']
}

See code examples

About this parameter

Override the default logic used to extract URLs from pages.

By default, we queue all URLs that comply with pathsToMatch, fileTypesToMatch, and exclusions. You can override this default logic by providing a custom function which executes on each crawled page, and returns the URLs to queue.

The expected return value is an array of URLs (as strings).

Examples

Copy
  {
    linkExtractor: ({ $, url, defaultExtractor }) => {
      if (/example.com\/doc\//.test(url.href) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
      }
      // Otherwise, use the default logic (queue all found links)
      return defaultExtractor();
    },
  }

Copy
{
  linkExtractor: ({ $, url, defaultExtractor }) => {
    // This turns off link discovery, except for URLs listed in sitemap.xml
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
  },
}

Copy
{
  linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
  },
}

Parameters

Parameter	Description
`url`	type: URL Optional URL of the resource that was just crawled.
`defaultExtractor`	type: function Optional Default function used internally by the Crawler to discover URLs from a resource’s content. It returns an array of strings containing all URLs found on the current resource (if they match the configuration).
`$`	type: object (Cheerio instance) Optional A Cheerio instance containing the HTML of the crawled page.

requestOptions

externalData

Did you find this page helpful?

Crawler: LinkExtractor

About this parameter

Examples

Parameters

On this page