Doing a significant text aggregation with a custom analyzer

I have an ES index with a mapping like this:

    {'texts-temp': {'aliases': {},
      'mappings': {'properties': {'data': {'properties':
           (...)
           'user_text': {'type': 'text',
                 'fields': {'word_tokenized': {'type': 'text',
                             'analyzer': 'text_analyzer'}}},
           (...)
         },
        'settings': {'index': {'number_of_shards': '1',
              'provided_name': 'texts-temp',
              'creation_date': (...),
        'analysis': {'filter': {'shingen_filter': {'max_shingle_size': '3',
           'min_shingle_size': '2',
           'type': 'shingle'}},
         'analyzer': {'text_analyzer': {'filter': ['lowercase',
            'stop',
            'asciifolding',
            'apostrophe',
            'stemmer',
            'shingen_filter'],
           'type': 'custom',
           'tokenizer': 'standard'}}},

When I do the following search:

    search = Search(index = 'texts-temp')
    q = Q("terms", data=list_urls_text)
    search = search.query(q)
    search = search.extra(track_total_hits=True)
    reply = search.execute()
    [x.data.user_text.word_tokenized for x in reply]

I get an empty list... However, if I look at the elements x.data.user_text, then I get the texts, but not tokenized.

What am I doing wrong that the index doesn't have the field data.user_text.word_tokenized?

Is this why my significant text aggregation returns empty?

{'query':{terms:{'data': list_urls_text}
'aggs':{'sample':{'sampler':{'shard_size':200},
'aggs':{'keywords':{'significant_tex':{'field': 'data.user_text.word_tokenized', 'size'=10, 'filter_duplicate_text':True}}...}

It looks right. I suspect that field setting will help determine the correct analyzer but the aggregation doesn’t know that it should then try use that analyzer on the contents of the ‘data.user_text’ field in the json. To give a helping hand try set the source_fields parameter to ‘data.user_text’.
I’m away from computer so can’t confirm but I suspect this will work.

Hi Mark,
Thanks for the reply. Unfortunatly, your proposed solution didn't work. I still get the same reply:

{'sample': {'doc_count': 200,
  'wordcloud': {'doc_count': 200, 'bg_count': 8357411, 'buckets': []}}}

Hmm. Can you share the JSON for the search request?

Hi Mark, sorry for the late reply. I was under some time constraint, and since I couldn't use it, I decided to do it locally, using a tfidf scoring system. It solved my problem... Maybe in the future I'll return to this. Thanks for the help, either way. :wink:

Thanks for the update. Would be good to find out what went wrong if you return to this.