Crawler: LinkExtractor
function
linkExtractor: ({ $, url, defaultExtractor }) => { ... // return ['https://...'] }
About this parameter
Override the default logic used to extract URLs from pages.
By default, we queue all URLs that comply with pathsToMatch
, fileTypesToMatch
, and exclusions
.
You can override this default logic by providing a custom function which executes on each crawled page, and returns the URLs to queue.
The expected return value is an array of URLs (as strings).
Examples
1
2
3
4
5
6
7
8
9
10
{
linkExtractor: ({ $, url, defaultExtractor }) => {
if (/example.com\/doc\//.test(url.href) {
// For all pages under /doc, only queue the first found link
return defaultExtractor().slice(0,1);
}
// Otherwise, use the default logic (queue all found links)
return defaultExtractor();
},
}
1
2
3
4
5
6
{
linkExtractor: ({ $, url, defaultExtractor }) => {
// This turns off link discovery, except for URLs listed in sitemap.xml
return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
},
}
1
2
3
4
5
6
{
linkExtractor: ({ $ }) => {
// Access the DOM and extract what you specify
return [$('.my-link').attr('href')]
},
}
Parameters
Parameter | Description |
---|---|
url
|
type: URL
Optional
URL of the resource that was just crawled. |
defaultExtractor
|
type: function
Optional
Default function used internally by the Crawler to discover URLs from a resource’s content. It returns an array of strings containing all URLs found on the current resource (if they match the configuration). |
$
|
type: object (Cheerio instance)
Optional
A Cheerio instance containing the HTML of the crawled page. |