Crawler Settings

The QuantSearch crawler discovers and indexes your website content. Configure it to match your site structure and performance requirements.

Basic Settings

Max Pages

The maximum number of pages to crawl per job. This is limited by your plan:

Plan Max Pages per Site
Free50
Pro2,000
Enterprise10,000

Max Depth

How many links deep to follow from your start URL:

  • 0 - Only crawl the start URL(s)
  • 1 - Start URL + pages directly linked from it
  • 3 - Recommended for most documentation sites
  • 5+ - For deeply nested content

Concurrency

Number of parallel workers (1-20). Higher values crawl faster but put more load on your server. Default is 5.

Content Filtering

Include Patterns

Only crawl URLs matching these regex patterns. One pattern per line.

# Only crawl /docs/ section
^/docs/

# Only crawl English pages
^/en/

Exclude Patterns

Skip URLs matching these patterns:

# Skip preview/draft pages
\?.*preview=true

# Skip private areas
^/admin/
^/internal/

# Skip generated files
\.pdf$
\.zip$

JavaScript Rendering

Enable JavaScript rendering for Single Page Applications (SPAs) or sites with dynamically loaded content. This uses a headless browser to render pages.

Performance Note

JavaScript rendering is slower and more resource-intensive. Only enable it if your content requires JavaScript to display.

Custom Headers

Send custom HTTP headers with each request. Useful for:

  • Basic authentication (staging sites)
  • API keys
  • Custom user agents
{
  "Authorization": "Basic YWRtaW46cGFzc3dvcmQ=",
  "X-Custom-Header": "value"
}

Single URL Crawls

You can also crawl specific URLs without following links. This is useful for:

  • Refreshing specific pages
  • Adding new content immediately
  • Testing changes

Enter up to 50 URLs (one per line) in the "Crawl URLs" modal.

Robots.txt

By default, the crawler respects your robots.txt file. It identifies as:

User-agent: QuantBot/1.0 (+https://quantcdn.io/bot)

To allow QuantSearch while blocking other bots:

# robots.txt
User-agent: *
Disallow: /

User-agent: QuantBot
Allow: /

Crawl Frequency

Currently, crawls are triggered manually from the dashboard. Scheduled crawls are coming soon.

Recommended frequency:

  • Documentation - After each release
  • Blog - Weekly
  • Marketing site - When content changes

Troubleshooting

Pages not being indexed

Check:

  • URL matches your include patterns (if any)
  • URL doesn't match exclude patterns
  • Max depth allows reaching the page
  • Page returns 200 status code
  • robots.txt allows crawling

Content not extracted correctly

If content appears wrong in search results:

  • Enable JavaScript rendering for SPAs
  • Check if content is behind login/authentication
  • Ensure content is in standard HTML (not canvas/images)

Crawl too slow

  • Increase concurrency (be careful with your server load)
  • Disable JavaScript rendering if not needed
  • Use include patterns to focus on important sections