Website Crawling
Automatically import and sync content from your website to keep your bot's knowledge always up-to-date.
How Website Crawling Works
- 1
Enter Your Website URL
Provide your main domain (e.g., https://yoursite.com)
- 2
Configure Crawl Settings
Choose pages to include/exclude, set crawl depth
- 3
Crawler Extracts Content
Our bot visits pages and extracts text, ignoring navigation and footers
- 4
Content is Indexed
AI processes and indexes content for intelligent retrieval
- 5
Auto-Sync (Optional)
Schedule regular re-crawls to keep content fresh
Crawl Settings
Start URL
The URL where crawling begins
Example: https://yoursite.com or https://yoursite.com/docs
Crawl Depth
How many links deep to follow from start URL
Example: 0 = start page only, 1 = +direct links, 2 = +their links
Include Patterns
Only crawl URLs matching these patterns
Example: /docs/*, /blog/*, /help/*
Exclude Patterns
Skip URLs matching these patterns
Example: /admin/*, /login/*, /cart/*
Max Pages
Maximum number of pages to crawl
Example: 50, 200, 500 (depends on plan)
Respect robots.txt
Honor your site's robots.txt rules
Example: Enabled by default
Auto-Sync Schedule
Keep your knowledge base current by scheduling automatic re-crawls:
Daily
Best for: News sites, frequently updated content
Weekly
Best for: Most websites, documentation
Bi-weekly
Best for: Stable content, product pages
Monthly
Best for: Rarely changing content
Pro Tip: Set up a webhook to trigger a re-crawl whenever you publish new content on your website.
Best Practices
Do
- • Start with specific sections (e.g., /help)
- • Use include patterns to focus crawling
- • Exclude login/admin pages
- • Set reasonable crawl depth (1-2)
- • Review crawled pages before enabling bot
- • Set up auto-sync for dynamic content
Don't
- • Crawl your entire site without filters
- • Include user-generated content pages
- • Crawl pages behind authentication
- • Set crawl depth too high (>3)
- • Forget to exclude duplicate content
- • Crawl competitor websites
Troubleshooting
Pages not being crawled
Check robots.txt rules, ensure pages are publicly accessible, and verify include/exclude patterns.
Wrong content extracted
Our crawler tries to extract main content. For SPA/React sites, ensure server-side rendering is enabled.
Crawl taking too long
Reduce crawl depth, narrow include patterns, or split into multiple crawl jobs.
Stale content in responses
Trigger a manual re-crawl or set up auto-sync with appropriate frequency.