AI Messaging Agents can learn directly from your public-facing content to deliver accurate, consistent, and on-brand answers during customer conversations. This is made possible through Knowledge Sources, which allow your AI Messaging Agent to reference trusted information in real time. This article explains what a Knowledge Source is, how it benefits your AI Messaging Agent, and the current limitations to be aware of.
What is a knowledge source?
A knowledge source is a centralised library of information about your company, such as products, services, and frequently asked questions. It helps ensure information is easy to find, easy to reference, and easy for your AI Messaging Agent to understand.
Aircall uses the content you provide, for example public webpages, to build a knowledge source that your AI Messaging Agent can rely on during customer conversations.
How knowledge sources help your AI Messaging Agent
Once your content is added as a Knowledge Source, your AI Messaging Agent can:
- Answer common questions using accurate, brand-approved information
- Maintain consistent messaging across conversations
- Reduce repetitive manual responses
- Reference your content instantly during customer interactions
This ensures customers receive precise and helpful answers based directly on your own published information.
Supported content types
You can add new Knowledge Sources in the following ways:
- Block of content: Paste any plain text you want the agent to learn from.
- Webpage: Add a single public URL.
- Website: Add a main public domain, with optional subpages.
- Existing sources: Reuse or update content you have already added for your AI Voice Agent.
Note: All content added as a Knowledge Source must be publicly available. Sources can be shared across your AI Voice Agent and AI Messaging Agent. If a rule or piece of information applies to one channel only, create a dedicated source for it rather than adding it to a shared one.
Current limitations
To ensure the best results, be aware of the following limitations.
Gated or authentication-required pages
Knowledge Sources cannot ingest content from:
- Login-required pages
- Password-protected areas
- Internal portals or dashboards
- Pages behind paywalls
Only public URLs are supported.
Image-only content
If important information appears only as images, such as text embedded in images, diagrams, or screenshots, it may not be readable or usable by the AI Messaging Agent.
Document uploading not yet supported
You currently cannot upload files such as PDFs, Word documents, or spreadsheets.
Note: Support for document uploads is planned for a future version.
Managing website crawling in your knowledge sources
Newly added website content can work together seamlessly in your knowledge sources. This section explains how website crawling works, how content is processed, and what limits apply.
How website crawling works
When you add a website URL to your knowledge sources, Aircall automatically processes the page you provide, the pages it links to, and the pages those linked pages reference. This applies only if the URLs share the same prefix.
Crawl depth
Crawling covers the provided page and up to two levels deeper, provided URLs share the same prefix.
Example
If you add https://website.com/depth1/, the crawler may also process:
https://website.com/depth1/depth2https://website.com/depth1/depth2/depth3
It will not crawl unrelated sections such as:
https://website.com/bloghttps://website.com/contact
This ensures only relevant sections of your website are included.
Review and select pages from your website URL
After adding a website URL, you can review the sitemap generated for that source and see which pages are associated with it. From there, you can select or deselect pages to control exactly what is included in your knowledge sources.
As you add pages, you can monitor the character limit indicator to see how much of the available limit your knowledge source is using.
How content is extracted and cleaned
All website content goes through multi-stage processing to ensure high-quality knowledge.
| Category | Details | Purpose |
|---|---|---|
| What is removed | Navigation menus, headers and footers, cookie banners, advertisements, images and videos, scripts and malicious code, formatting noise and redundant HTML | Removes non-essential and potentially unsafe elements so only relevant, clean content is processed. |
| What is kept | Headings, paragraphs, lists, structured article content | Preserves structured and meaningful content that contributes to accurate knowledge retrieval. |
| Why this matters | Improves response accuracy, prevents irrelevant content from affecting answers, reduces unnecessary processing, enhances security | Ensures higher-quality responses and improved reliability of the AI Messaging Agent. |
Processing time expectations
Processing time depends on the size of the crawl:
- 1 to 10 pages typically process in under one minute.
- Medium-sized sections may take 5 to 10 minutes.
- Large root-level crawls may take up to 30 minutes.
You can monitor progress using the document status indicator.
Best practices for website ingestion
| Topic | Recommendation | Details |
|---|---|---|
| Start with specific URLs | Add precise, deep-linked pages instead of root domains. | Instead of https://website.com/, use a targeted page such as https://website.com/help/article-name. The deeper the URL path, the more targeted the crawl. |
| Expand gradually | Move up one directory level at a time if broader coverage is needed. | Move from https://website.com/help/article-name to https://website.com/help/. Avoid adding the root URL unless you need content from across the entire site. |
| Avoid over-crawling | Do not start with root-level URLs unless necessary. | Root-level URLs can capture hundreds of pages, increase processing time, trigger summarisation, and introduce irrelevant content. |
| Use structured knowledge pages | Prioritise well-organised, content-focused pages. | Best-performing sources include help centres, documentation hubs, FAQ sections, and structured articles with clear headings. |
| Avoid unsuitable content types | Exclude pages that are dynamic, restricted, or unstructured. | Avoid login-required pages, search result pages, dynamic or form-based content, news feeds, and media-heavy pages. |
| Review after crawling | Validate results once processing is complete. | Check the document preview to confirm the correct pages were captured, no duplicate URLs were added, and content is structured properly. You can refresh website content later if the source page updates. |
Tip: Consider using manual FAQ or text input instead of website crawling when content changes frequently (such as news or real-time data), pages require authentication, the website is primarily video or image-based, or the content is unstructured.