Doing a significant text aggregation with a custom analyzer

I have an ES index with a mapping like this:

    {'texts-temp': {'aliases': {},
      'mappings': {'properties': {'data': {'properties':
           (...)
           'user_text': {'type': 'text',
                 'fields': {'word_tokenized': {'type': 'text',
                             'analyzer': 'text_analyzer'}}},
           (...)
         },
        'settings': {'index': {'number_of_shards': '1',
              'provided_name': 'texts-temp',
              'creation_date': (...),
        'analysis': {'filter': {'shingen_filter': {'max_shingle_size': '3',
           'min_shingle_size': '2',
           'type': 'shingle'}},
         'analyzer': {'text_analyzer': {'filter': ['lowercase',
            'stop',
            'asciifolding',
            'apostrophe',
            'stemmer',
            'shingen_filter'],
           'type': 'custom',
           'tokenizer': 'standard'}}},

When I do the following search:

    search = Search(index = 'texts-temp')
    q = Q("terms", data=list_urls_text)
    search = search.query(q)
    search = search.extra(track_total_hits=True)
    reply = search.execute()
    [x.data.user_text.word_tokenized for x in reply]

I get an empty list... However, if I look at the elements x.data.user_text, then I get the texts, but not tokenized.

What am I doing wrong that the index doesn't have the field data.user_text.word_tokenized?

Is this why my significant text aggregation returns empty?

{'query':{terms:{'data': list_urls_text}
'aggs':{'sample':{'sampler':{'shard_size':200},
'aggs':{'keywords':{'significant_tex':{'field': 'data.user_text.word_tokenized', 'size'=10, 'filter_duplicate_text':True}}...}

It looks right. I suspect that field setting will help determine the correct analyzer but the aggregation doesn’t know that it should then try use that analyzer on the contents of the ‘data.user_text’ field in the json. To give a helping hand try set the source_fields parameter to ‘data.user_text’.
I’m away from computer so can’t confirm but I suspect this will work.

Hi Mark,
Thanks for the reply. Unfortunatly, your proposed solution didn't work. I still get the same reply:

{'sample': {'doc_count': 200,
  'wordcloud': {'doc_count': 200, 'bg_count': 8357411, 'buckets': []}}}

Hmm. Can you share the JSON for the search request?

Hi Mark, sorry for the late reply. I was under some time constraint, and since I couldn't use it, I decided to do it locally, using a tfidf scoring system. It solved my problem... Maybe in the future I'll return to this. Thanks for the help, either way. :wink:

Thanks for the update. Would be good to find out what went wrong if you return to this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.