Sitecore Search - Document Extractor で JavaScript を利用する : Haramizu.com

Web サイトのコンテンツに関して、コンテンツのタイプを指定したいとします。例えばブログであれば blog 、製品情報であれば products のように設定をしていく形です。今回は、この部分を JavaScript で処理をする手順を紹介していきます。

ソースの追加

今回は www.sitecore.com のコンテンツを対象として、動作確認を進めていきます。実際の作業の際には、ドメインのところを読み替えて、サイトのデータ構造を理解しながら進めてください。

まずは新規ソースを作成します。

続いて Web Crawler Settings を開いて、ドメインを指定します。

sitecore.com のサイトの /search よりも下は検索エンジンが動いている関係上、クロールの対象から除外します。設定は Exclusion patterns に対して、Glob Express 、値は /search と設定します。

また、前回の記事で sitemap.xml の取得で紹介しているように、User Agent を設定したうえでクロールするように設定をします。

Available Locales に関しては、取り急ぎ en-us のみとします。

Triggers は sitemap.xml の利用を前提として進めていますので、今回は sitemap.xml を利用します。

Document Extractor に関して、今回は JavaScript を設定します。

ソースコードには、初回はデフォルトのコードだけで実行してみます。

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    $ = response.body;

    return [{
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': $('meta[property="og:type"]').attr('content') || 'website_content',
        'url': $('meta[property="og:url"]').attr('content')
    }];
}

これで初期設定は完了となります。一度、Publish をして正しくクロールできるか確認をしてください。しばらくすると 1000 のコンテンツが入りました。

Document Extractor の変更

HTML の構造からデータを取得して格納するという点では、上記に記載している Javascript のコードは XPath とあまり変わらない動作となります。そこで、今回は、type に関して URL を利用して設定が変わるよう、以下のように書き換えました。

function extract(request, response) {
    $ = response.body;

    let url = request.url;
    let subtype;

    if (url.includes('/products/')) {
        subtype = 'Products';
    } else if (url.includes('/solutions/')) {
        subtype = 'Solutions';
    } else if (url.includes('/knowledge-center/')) {
        subtype = 'Knowledge Center';
    } else if (url.includes('/partners/')) {
        subtype = 'Partners';
    } else if (url.includes('/company/')) {
        subtype = 'Company';
    } else {
        subtype = 'website';
    }

　　　　return [{
　　　　　　　　'title': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
　　　　　　　　'subtitle': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
　　　　　　　　'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
　　　　　　　　'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
　　　　　　　　'type': subtype ,
　　　　　　　　'url': $('meta[property="og:url"]').attr('content')
　　　　}];
}

設定を変更して改めてクロールをかけます。しばらくすると、コンテンツに以下のようにデータが揃いました。

コンテンツタイプをフィルタに追加すると、以下のように候補が表示されるようになっています。

まとめ

今回は URL でコンテンツのタイプを指定する形としました。og タグに入っているデータを入れるのも効果的ですが、古いコンテンツには og タグが入っていないケースなどもあると思います。URL などで判別できる部分がある場合は、積極的に活用していきたいところです。