Consolidating Multiple Crawlers into a Single Source in Sitecore Search

Sitecore Search is designed to support a wide variety of content ingestion strategies—whether that means crawling HTML pages, consuming structured data from APIs, or indexing multiple locales. In our initial setup, we followed platform best practices: each site and data source had its own content source and crawler. This setup was clean, modular, and easy to manage.

However, due to a licensing restriction that allowed only one content source, we had to consolidate everything into a single source—while preserving the behavior of 17 previously independent sources.

Here’s what we started with:

17 separate content sources, each with its own crawler:
- 16 sitemap-based sources (one per locale/site).
- 1 API-based source that indexed downloadable documents.

The Problem

The key challenges in consolidating them:

Sitemap and API crawlers are mutually exclusive within a single source.
Each sitemap crawler mapped to a specific locale, but the API crawler needed to index documents for all locales.
The document extractor logic differed between the website and the API-based documents.

The Solution

1. Use an Advanced Web Crawler

I consolidated the 16 sitemaps by creating a single Web Crawler (Advanced) and added each sitemap as a trigger. This immediately reduced the number of content sources and centralized crawling.

2. Refactor the API to Output a Sitemap

Since API crawlers couldn’t coexist with sitemap crawlers in a single source, I took a different approach:

I modified the API at /api/documents to instead return a Sitemap-compatible XML format at /api/documents.xml.
Each entry in this sitemap pointed to a “fake” detail page for the document.

Example:

/api/download?documentNumber=LS128EN → (original download endpoint)
/api/download?details=true&documentNumber=LS128EN → (HTML page with metadata)

The details=true parameter signals the API to return a simple HTML page containing only the metadata that Sitecore Search needs to index.

This allowed the API-based content to be indexed using the same Advanced Web Crawler, just like the other sitemaps.

Because my project was written with Next.js, it was fairly straightforward to make this conversion.

3. Solve Locale Detection with a Custom Extractor

The original sitemap crawlers had explicit locales, but now everything was coming from one source. To fix this:

I created a locale extractor that examines the subdomain of each crawled page URL.
Each subdomain is mapped to a specific locale.
For the API document sitemap, I included all locales since the same documents are relevant to every language.

The Locale Extractor I wrote can be found below:

function extract(request, response) {
    $ = response.body;
    const url = request.url;

	const localeMap = {
		// North America
		www: 'en-US',
		ca: 'en-CA',
		mx: 'es-MX',
		amer: 'en-BS',

		// Latin America
		br: 'pt-BR',
		latam: 'pt-AO',

		// EMEA
		fr: 'fr-FR',
		de: 'de-DE',
		it: 'it-IT',
		pt: 'pt-PT',
		es: 'es-ES',
		uk: 'en-GB',
		emea: 'de-CH',

		// Asia Pacific
		cn: 'zh-CN',
		kr: 'ko-KR',
		sg: 'en-SG',
		apac: 'en-HK'
  	};

// You can change the pattern below to match your website's domains or change it altogether to suit your needs.
    const match = url.match(/^https:\/\/([a-z]{2,5})\-tst\-sc\.mywebsite\.com/);
    const localeKey = match ? match[1] : 'www'; // default to www which is the US site. 

    let fullLocale = localeMap[localeKey] || 'en-US';

    // If we're on the Canadian site, do an additional check. This site is multi-lingual.
    if (fullLocale.toLowerCase() == "en-ca" && url.toLowerCase().includes("fr-ca")) {
        fullLocale = "fr-CA"
    }
	


    return fullLocale;
}

4. Dual Document Extractors with Prioritized Matching

Sitecore Search currently doesn’t support conditionally applying different extractors by logic—so I had to get creative:

I created two document extractors:
- One for API documents (from the new fake pages)
- One for standard sitemap content
The API document extractor was ordered first, and used a regex URL match (/api/) to catch only those URLs.
- This extractor had the localized flag set to false, which tells Sitecore Search to index each result in all available locales.
All other URLs would fall back to the standard sitemap extractor.
- This extractor had the localized flag set to true, which feeds into the locale extractor mentioned above.

This ensured:

API documents were extracted correctly with their unique metadata format.
Regular website pages were still indexed as expected.

The document extractors I wrote can be found below:

API document extractor:

Extractor Type: JS
URLs to Match:
- Type: Regular Expression
- Value: /api/
Localized: false

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    $ = response.body;
        return [{
            'id': $('meta[property="id"]').attr('content'),
            'type': "onbase_document",
            'name': $('meta[property="documentTitle"]').attr('content'),
            'page_header_title': $('meta[property="documentTitle"]').attr('content'),
            'document_title': $('meta[property="documentTitle"]').attr('content'),
            'url': $('meta[property="documentUrl"]').attr('content'),
            'product_id': $('meta[property="productId"]').attr('content'),
            'document_type': $('meta[property="documentType"]').attr('content'),
            'lot_number': $('meta[property="lotNumber"]').attr('content'),
            'available_languages': $('meta[property="availableLanguages"]').attr('content'),
            'tenant': "My Website"
        }]
}

Standard sitemap extractor:

Extractor Type: JS
URLs to Match: N/A (all that don’t match the API document extractor)
Localized: true

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    $ = response.body;
        var bodyContent = '';
        $('.rich-text').each(function() {
            bodyContent += $(this).text();
        });

        return [{
            // Language
            'language': $('html').attr('lang'),
            // Titles
            'page_title': $('title').text(),
            'page_header_title': $('meta[property="pageHeaderTitle"]').attr('content'),
            'page_short_title': $('meta[property="pageShortTitle"]').attr('content'),
            'page_subtitle': $('meta[property="pageSubtitle"]').attr('content'),

            // Descriptions, Keywords & Body
            'description': $('meta[name="description"]').attr('content'),
            'keywords': $('meta[name="keywords"]').attr('content'),
            'page_summary': $('meta[property="pageSummary"]').attr('content'),
            'body_content': bodyContent,

            // OpenGraph
            'open_graph_title': $('meta[property="og:title"]').attr('content'),
            'open_graph_description': $('meta[property="og:description"]').attr('content'),
            'open_graph_image': $('meta[name="og:image"]').attr('content'),
            'open_graph_url': $('meta[property="og:url"]').attr('content'),

            // Taxonomy
            'tax_content_type': $('meta[property="taxContentType"]').attr('content').length > 0 ? $('meta[property="taxContentType"]').attr('content').split(',') : null,
            'tax_content_type_fixed': $('meta[property="taxContentTypeFixed"]').attr('content').length > 0 ? $('meta[property="taxContentTypeFixed"]').attr('content').split(',') : null,

            'tax_topic': $('meta[property="taxTopic"]').attr('content').length > 0 ? $('meta[property="taxTopic"]').attr('content').split(',') : null,

            'tax_resource_type': $('meta[property="taxResourceType"]').attr('content').length > 0 ? $('meta[property="taxResourceType"]').attr('content').split(',') : null,
            'tax_event_type': $('meta[property="taxEventType"]').attr('content').length > 0 ? $('meta[property="taxEventType"]').attr('content').split(',') : null,

            'tax_product_type': $('meta[property="taxProductType"]').attr('content').length > 0 ? $('meta[property="taxProductType"]').attr('content').split(',') : null,
            'tax_product_cat': $('meta[property="taxProductCat"]').attr('content').length > 0 ? $('meta[property="taxProductCat"]').attr('content').split(',') : null,
            'tax_products': $('meta[property="taxProducts"]').attr('content').length > 0 ? $('meta[property="taxProducts"]').attr('content').split(',') : null,

            'tax_service': $('meta[property="taxService"]').attr('content').length > 0 ? $('meta[property="taxService"]').attr('content').split(',') : null,
            'tax_service_cat': $('meta[property="taxServiceCat"]').length > 0 ? $('meta[property="taxServiceCat"]').attr('content').split(',') : null,

            'tax_application': $('meta[property="taxApplication"]').attr('content').length > 0 ? $('meta[property="taxApplication"]').attr('content').split(',') : null,
            'tax_industry': $('meta[property="taxIndustry"]').attr('content').length > 0 ? $('meta[property="taxIndustry"]').attr('content').split(',') : null,

            'tax_country': $('meta[property="taxCountry"]').length > 0 ? $('meta[property="taxCountry"]').attr('content').split(',') : null,
            'tax_region': $('meta[property="taxRegion"]').length > 0 ? $('meta[property="taxRegion"]').attr('content').split(',') : null,

            'event_start_date': $('meta[property="eventStartDate"]').attr('content'),
            'event_end_date': $('meta[property="eventEndDate"]').attr('content'),
            'event_timezone': $('meta[property="eventTimezone"]').attr('content'),
            'download_on_base': $('meta[property="downloadOnBase"]').attr('content'),
            'download_cta': $('meta[property="downloadCta"]').attr('content'),
            'is_gated': $('meta[property="isGated"]').attr('content'),


            // Other
            'tenant': $('meta[property="tenant"]').attr('content'),
            'id': $('meta[property="contentId"]').attr('content'),
            'name': $('title').text(),
            'type': $('meta[property="taxContentTypeFixed"]').attr('content') || 'website_content',
            'url': $('meta[property="og:url"]').attr('content'),
            'page_thumbnail': $('meta[property="pageThumbnail"]').attr('content'),
            'page_template': $('meta[property="pageTemplate"]').attr('content'),
        }];
}

The Result

After these changes, I successfully:

Consolidated 17 sources down to 1
Unified all content under a single advanced web crawler
Indexed multiple locales using dynamic detection via subdomain parsing
Used fake detail pages to trick the crawler into treating API content like sitemap content
Maintained separate extraction logic for different content types using extractor order and regex matching
Achieved compliance with the Sitecore Search license.

This setup is more scalable, easier to manage, and unlocks the full power of Sitecore Search without compromising flexibility. This was a tough

Key Takeaways

Sitecore Search’s Advanced Web Crawler can serve as a universal collector with the right tricks.
Customizing your API output to mimic sitemaps can unlock compatibility where direct integration isn’t supported.
Locale extractors and regex-based extractor rules are powerful tools for hybrid environments.

Thanks for reading, and if you have any specific questions, feel free to reach out! Until next time, happy Sitecore(Search)-ing!