Docs/Knowledge Base/Website Crawling

Website Crawling

Automatically import and sync content from your website to keep your bot's knowledge always up-to-date.

How Website Crawling Works

  1. 1

    Enter Your Website URL

    Provide your main domain (e.g., https://yoursite.com)

  2. 2

    Configure Crawl Settings

    Choose pages to include/exclude, set crawl depth

  3. 3

    Crawler Extracts Content

    Our bot visits pages and extracts text, ignoring navigation and footers

  4. 4

    Content is Indexed

    AI processes and indexes content for intelligent retrieval

  5. 5

    Auto-Sync (Optional)

    Schedule regular re-crawls to keep content fresh

Crawl Settings

Start URL

The URL where crawling begins

Example: https://yoursite.com or https://yoursite.com/docs

Crawl Depth

How many links deep to follow from start URL

Example: 0 = start page only, 1 = +direct links, 2 = +their links

Include Patterns

Only crawl URLs matching these patterns

Example: /docs/*, /blog/*, /help/*

Exclude Patterns

Skip URLs matching these patterns

Example: /admin/*, /login/*, /cart/*

Max Pages

Maximum number of pages to crawl

Example: 50, 200, 500 (depends on plan)

Respect robots.txt

Honor your site's robots.txt rules

Example: Enabled by default

Auto-Sync Schedule

Keep your knowledge base current by scheduling automatic re-crawls:

Daily

Best for: News sites, frequently updated content

Weekly

Best for: Most websites, documentation

Bi-weekly

Best for: Stable content, product pages

Monthly

Best for: Rarely changing content

Pro Tip: Set up a webhook to trigger a re-crawl whenever you publish new content on your website.

Best Practices

Do

  • • Start with specific sections (e.g., /help)
  • • Use include patterns to focus crawling
  • • Exclude login/admin pages
  • • Set reasonable crawl depth (1-2)
  • • Review crawled pages before enabling bot
  • • Set up auto-sync for dynamic content

Don't

  • • Crawl your entire site without filters
  • • Include user-generated content pages
  • • Crawl pages behind authentication
  • • Set crawl depth too high (>3)
  • • Forget to exclude duplicate content
  • • Crawl competitor websites

Troubleshooting

Pages not being crawled

Check robots.txt rules, ensure pages are publicly accessible, and verify include/exclude patterns.

Wrong content extracted

Our crawler tries to extract main content. For SPA/React sites, ensure server-side rendering is enabled.

Crawl taking too long

Reduce crawl depth, narrow include patterns, or split into multiple crawl jobs.

Stale content in responses

Trigger a manual re-crawl or set up auto-sync with appropriate frequency.