Crawler API
The Crawler REST API is only available if you have access to Crawler.
The Crawler REST API lets your interact directly with your crawlers.
All API calls go through the https://crawler.algolia.com
domain.
Request format
Authentication is done via basic authentication.
Response format
The response format for all requests is a JSON object.
The success or failure of an API call is indicated by its HTTP status code.
A 2xx
status code indicates success, whereas a 4xx
status code indicates failure.
When a request fails, the response body is still JSON, but contains a message
field which you can use for debugging.
Crawler endpoints
Quick reference
Verb | Path | Method |
---|---|---|
POST |
/api/1/crawlers |
|
GET |
/api/1/crawlers |
|
GET |
/api/1/crawlers/ |
|
PATCH |
/api/1/crawlers/ |
|
PATCH |
/api/1/crawlers/ |
|
POST |
/api/1/crawlers/ |
|
POST |
/api/1/crawlers/ |
|
POST |
/api/1/crawlers/ |
|
GET |
/api/1/crawlers/ |
|
POST |
/api/1/crawlers/ |
|
POST |
/api/1/crawlers/ |
|
GET |
/api/1/crawlers/ |
|
POST |
/api/1/crawlers/ |
|
GET |
/api/1/domains |
Create a crawler
Path: /api/1/crawlers
HTTP Verb: POST
Description:
Create a new crawler with the given configuration.
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
2
3
4
curl -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d @create_crawler.json
This is the create_crawler.json
file used in the curl call.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"name": "Algolia Website",
"config": {
"appId": "YOUR_APP_ID",
"apiKey": "YOUR_API_KEY",
"indexPrefix": "crawler_",
"rateLimit": 8,
"startUrls": [
"https://www.algolia.com"
],
"actions": [
{
"indexName": "algolia_website",
"pathsToMatch": [
"https://www.algolia.com/**"
],
"selectorsToMatch": [
".products",
"!.featured"
],
"fileTypesToMatch": [
"html",
"pdf"
],
"recordExtractor": {
"__type": "function",
"source": "() => {}"
}
}
]
}
}
When the request is successful, the HTTP response is a 200 OK
and returns the ID of the created crawler:
1
2
3
{
"id": "x000x000-x00x-00xx-x000-000000000x00"
}
Get available crawlers
Path: /api/1/crawlers
HTTP Verb: GET
Description:
Get a list of your available crawlers and pagination information
Parameters:
query
Parameter | Description |
---|---|
itemsPerPage
|
type: number
default: 20
Optional
The number of items per page. |
page
|
type: number
default: 1
Optional
The current page to fetch. |
name
|
type: string
Optional
The string to match against the crawler’s name. |
appId
|
type: string
Optional
The Algolia application ID to filter results. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers"
When the request is successful, the HTTP response is 200 OK
and returns a list of your available crawlers:
1
2
3
4
5
6
7
8
9
10
11
{
"items": [
{
"id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809",
"name": "My Crawler"
}
],
"itemsPerPage": 20,
"page": 1,
"total": 100
}
Get a crawler
Path: /api/1/crawlers/{id}
HTTP Verb: GET
Description:
Get information about the specified crawler and its settings.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
query
Parameter | Description |
---|---|
withConfig
|
type: boolean
default: false
Optional
When |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}?withConfig=true"
When the request is successful, the HTTP response is 200 OK
and returns information on the specified crawler:
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"name": "Algolia Blog",
"createdAt": "2019-08-22T08:17:09.680Z",
"updatedAt": "2019-09-02T09:57:33.968Z",
"running": false,
"reindexing": false,
"blocked": false,
"lastReindexStartedAt": "2019-08-29T07:56:51.682Z",
"lastReindexEndedAt": "2019-08-29T07:57:35.991Z"
"config": {
// YOUR CRAWLER'S CONFIGURATION IF `withConfig=true`
}
}
Partially update a crawler
Path: /api/1/crawlers/{id}
HTTP Verb: PATCH
Description:
Update a crawler’s configuration or change its name.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
2
3
curl -X PATCH --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/" \
-H "Content-Type: application/json" \
-d @crawler_patch.json
The crawler_patch.json
file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"name": "New config name",
"config": {
"appId": "YOUR_APP_ID",
"apiKey": "YOUR_API_KEY",
"indexPrefix": "crawler_",
"rateLimit": 8,
"startUrls": [
"https://www.algolia.com"
],
"actions": [
{
"indexName": "algolia_website",
"pathsToMatch": [
"https://www.algolia.com/**"
],
"selectorsToMatch": [
".products",
"!.featured"
],
"fileTypesToMatch": [
"html",
"pdf"
],
"recordExtractor": {
"__type": "function",
"source": "() => {}"
}
}
]
}
}
When the request is successful, the HTTP response is a 200 OK
and returns the taskId
of your request:
1
2
3
{
"taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Partially update a crawler’s configuration
Path: /api/1/crawlers/{id}
/config
HTTP Verb: PATCH
Description:
Update parts of a crawler’s configuration.
The partial update endpoint currently only supports top-level configuration properties. For example, to update a recordExtractor
, you need to pass the complete actions
property.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
2
3
curl -X PATCH --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/config" \
-H "Content-Type: application/json" \
-d @crawler_config_patch.json
The crawler_config_patch.json
file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"rateLimit": 8,
"startUrls": [
"https://www.algolia.com",
"https://www.algolia.com/blog"
],
"actions": [
{
"indexName": "algolia_website",
"pathsToMatch": [
"https://www.algolia.com/**"
],
"recordExtractor": {
"__type": "function",
"source": "() => {}"
}
}
]
}
When the request is successful, the HTTP response is a 200 OK
and returns the taskId
of your request:
1
2
3
{
"taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Run a crawler
Path: /api/1/crawlers/{id}
/run
HTTP Verb: POST
Description:
Unpause the specified crawler. If a crawl was previously ongoing, it will be resumed. Otherwise, the crawler will go into the active state and wait for the next schedule.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
Errors:
401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/run"
When the request is successful, the HTTP response is a 200 OK
and returns the task ID of the crawl:
1
2
3
{
"taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Pause a crawler
Path: /api/1/crawlers/{id}
/pause
HTTP Verb: POST
Description:
Request the specified crawler to pause its execution.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/pause"
When the request is successful, the HTTP response is a 200 OK
and returns the task ID of your request:
1
2
3
{
"taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Reindex with a crawler
Path: /api/1/crawlers/{id}
/reindex
HTTP Verb: POST
Description:
Request the specified crawler to start (or restart) crawling.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
Errors:
401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/reindex"
When the request is successful, the HTTP response is a 200 OK
and returns the task ID of the crawl:
1
2
3
{
"taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Get statistics on a crawler
Path: /api/1/crawlers/{id}
/stats/urls
HTTP Verb: GET
Description:
Get a summary of the current status of crawled URLs for the specified crawler.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
Errors:
401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/stats/urls"
When the request is successful, the HTTP response is a 200 OK
and returns URL statistics for your specified crawler’s last crawl:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"count":129,
"data":[
{
"reason":"success",
"status":"DONE",
"category":"success",
"readable":"Success",
"count":127
},
{
"reason":"http_not_found",
"status":"SKIPPED",
"category":"fetch",
"readable":"HTTP Not Found (404)",
"count":1
},
{
"reason":"network_error",
"status":"FAILED",
"category":"fetch",
"readable":"Network error",
"count":1
}
]
}
Crawl specific URLs
Path: /api/1/crawlers/{id}
/urls/crawl
HTTP Verb: POST
Description:
Immediately crawl the given URLs. The generated records are pushed to the live index if there’s no ongoing reindex, and to the temporary index otherwise.
There’s a rate limit of 200 calls every 24h for this endpoint.
Parameters:
path
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler. |
body
Parameter | Description |
---|---|
urls
|
type: string[]
Required
The URLs to crawl (maximum 50 per call). |
save
|
type: boolean
Optional
|
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler or the crawler doesn’t exist.
Example:
1
2
3
curl -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/urls/crawl" \
-H "Content-Type: application/json" \
-d @crawl_urls.json
The crawl_urls.json
file:
1
2
3
4
5
6
7
{
"urls": [
"https://www.algolia.com/blog/article-42.html",
"https://www.algolia.com/blog/article-101.html"
],
"save": false
}
When the request is successful, the HTTP response is a 200 OK
and returns the taskId
of your request:
1
2
3
{
"taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Test a URL on a crawler
Path: /api/1/crawlers/{id}
/test
HTTP Verb: POST
Description:
Test an URL against the given crawler’s configuration and see what will be processed. You can also override parts of the configuration to try your changes before updating the configuration.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of your targeted crawler |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
2
3
4
curl -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/test" \ [ INSERT
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d @test_crawler.json
This is the test_crawler.json
file used in curl call.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"url": "https://www.algolia.com/blog",
"config": {
"appId": "YOUR_APP_ID",
"apiKey": "YOUR_API_KEY",
"indexPrefix": "crawler_",
"rateLimit": 8,
"startUrls": [
"https://www.algolia.com"
],
"actions": [
{
"indexName": "algolia_website",
"pathsToMatch": [
"https://www.algolia.com/**"
],
"selectorsToMatch": [
".products",
"!.featured"
],
"fileTypesToMatch": [
"html",
"pdf"
],
"recordExtractor": {
"__type": "function",
"source": "() => {}"
}
}
]
}
}
When the request is successful, the HTTP response is a 200 OK
and returns the ID of the created crawler:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
"startDate": "2019-05-21T09:04:33.742Z",
"endDate": "2019-05-21T09:04:33.923Z",
"logs": [
[
"Processing url 'https://www.algolia.com/blog'"
]
],
"records": [
{
"indexName": "testIndex",
"records": [
{
"objectID": "https://www.algolia.com/blog",
"numberOfLinks": 2
}
],
"recordsPerExtractor": [
{
"index": 0,
"type": "custom",
"records": [
{
"objectID": "https://www.algolia.com/blog"
}
]
}
]
}
],
"links": [
"https://blog.algolia.com/engineering/challenging-migration-heroku-google-kubernetes-engine/",
"https://blog.algolia.com/engineering/tale-two-engines-algolia-unity/"
],
"externalData": {
"dataSourceId1": {
"data1": "val1",
"data2": "val2"
},
"dataSourceId2": {
"data1": "val1",
"data2": "val2"
}
},
"error": {}
}
Get status of a task
Path: /api/1/crawlers/{id}
/tasks/{tid}
HTTP Verb: GET
Description:
Get the status of a specific task
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of the targeted crawler. |
tid
|
type: string
Required
The ID of the targeted task. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/tasks/${TASK_ID}"
When the request is successful, the HTTP response is 200 OK
and returns whether the task is pending or not:
1
2
3
{
"pending": false
}
Cancel a blocking task
Path: /api/1/crawlers/{id}
/tasks/{tid}
/cancel
HTTP Verb: POST
Description:
Cancel a specific task that is currently blocking a crawler.
The blocking task ID to use for{tid}
is available in the Get a Crawler response in the blockingTaskId
field. This field is only present if the queried crawler is blocked.
You can’t cancel non-blocking tasks. Using the endpoint on a non-blocking task returns an error.
Parameters:
Parameter | Description |
---|---|
id
|
type: string
Required
The ID of the targeted crawler. |
tid
|
type: string
Required
The ID of the blocking task. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing or invalid.403
: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.
Example:
1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/tasks/${TASK_ID}/cancel"
When the request is successful, the HTTP response is 200 OK
.
Get registered domains
Path: /api/1/domains
HTTP Verb: GET
Description:
Get a list of your registered domains
Parameters:
query
Parameter | Description |
---|---|
appId
|
type: string
Optional
Only retrieve results for this application. |
itemsPerPage
|
type: number
default: 20
Optional
The number of items per page. |
page
|
type: number
default: 1
Optional
The current page to fetch. |
Errors:
400
: Bad request or request argument.401
: Authorization information is missing.403
: Authorization information is invalid.
Example:
1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/domains"
When the request is successful, the HTTP response is 200 OK
and returns a list of your registered domains:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"items": [
{
"appId": "MYAPP00001",
"domain": "*.algolia.com",
"verified": true
},
{
"appId": "MYAPP00002",
"domain": "blog.algolia.com",
"verified": true
}
],
"itemsPerPage": 20,
"page": 1,
"total": 12
}