Website Crawling

Automatically import and sync content from your website to keep your bot's knowledge always up-to-date.

How Website Crawling Works

1
Enter Your Website URL
Provide your main domain (e.g., https://yoursite.com)
2
Configure Crawl Settings
Choose pages to include/exclude, set crawl depth
3
Crawler Extracts Content
Our bot visits pages and extracts text, ignoring navigation and footers
4
Content is Indexed
AI processes and indexes content for intelligent retrieval
5
Auto-Sync (Optional)
Schedule regular re-crawls to keep content fresh

Crawl Settings

Start URL

The URL where crawling begins

Example: https://yoursite.com or https://yoursite.com/docs

Crawl Depth

How many links deep to follow from start URL

Example: 0 = start page only, 1 = +direct links, 2 = +their links

Include Patterns

Only crawl URLs matching these patterns

Example: /docs/*, /blog/*, /help/*

Exclude Patterns

Skip URLs matching these patterns

Example: /admin/*, /login/*, /cart/*

Max Pages

Maximum number of pages to crawl

Example: 50, 200, 500 (depends on plan)

Respect robots.txt

Honor your site's robots.txt rules

Example: Enabled by default

Auto-Sync Schedule

Keep your knowledge base current by scheduling automatic re-crawls:

Daily

Best for: News sites, frequently updated content

Weekly

Best for: Most websites, documentation

Bi-weekly

Best for: Stable content, product pages

Monthly

Best for: Rarely changing content

Pro Tip: Set up a webhook to trigger a re-crawl whenever you publish new content on your website.

Best Practices

Do

• Start with specific sections (e.g., /help)
• Use include patterns to focus crawling
• Exclude login/admin pages
• Set reasonable crawl depth (1-2)
• Review crawled pages before enabling bot
• Set up auto-sync for dynamic content

Don't

• Crawl your entire site without filters
• Include user-generated content pages
• Crawl pages behind authentication
• Set crawl depth too high (>3)
• Forget to exclude duplicate content
• Crawl competitor websites

Troubleshooting

Pages not being crawled

Check robots.txt rules, ensure pages are publicly accessible, and verify include/exclude patterns.

Wrong content extracted

Our crawler tries to extract main content. For SPA/React sites, ensure server-side rendering is enabled.

Crawl taking too long

Reduce crawl depth, narrow include patterns, or split into multiple crawl jobs.

Stale content in responses

Trigger a manual re-crawl or set up auto-sync with appropriate frequency.

Next Steps

Add Custom Q&A Training Best Practices