Doing a significant text aggregation with a custom analyzer

Ivo_Tavares · August 27, 2022, 9:23am

I have an ES index with a mapping like this:

    {'texts-temp': {'aliases': {},
      'mappings': {'properties': {'data': {'properties':
           (...)
           'user_text': {'type': 'text',
                 'fields': {'word_tokenized': {'type': 'text',
                             'analyzer': 'text_analyzer'}}},
           (...)
         },
        'settings': {'index': {'number_of_shards': '1',
              'provided_name': 'texts-temp',
              'creation_date': (...),
        'analysis': {'filter': {'shingen_filter': {'max_shingle_size': '3',
           'min_shingle_size': '2',
           'type': 'shingle'}},
         'analyzer': {'text_analyzer': {'filter': ['lowercase',
            'stop',
            'asciifolding',
            'apostrophe',
            'stemmer',
            'shingen_filter'],
           'type': 'custom',
           'tokenizer': 'standard'}}},

When I do the following search:

    search = Search(index = 'texts-temp')
    q = Q("terms", data=list_urls_text)
    search = search.query(q)
    search = search.extra(track_total_hits=True)
    reply = search.execute()
    [x.data.user_text.word_tokenized for x in reply]

I get an empty list... However, if I look at the elements x.data.user_text, then I get the texts, but not tokenized.

What am I doing wrong that the index doesn't have the field data.user_text.word_tokenized?

Is this why my significant text aggregation returns empty?

{'query':{terms:{'data': list_urls_text}
'aggs':{'sample':{'sampler':{'shard_size':200},
'aggs':{'keywords':{'significant_tex':{'field': 'data.user_text.word_tokenized', 'size'=10, 'filter_duplicate_text':True}}...}

Mark_Harwood1 · August 27, 2022, 6:19pm

It looks right. I suspect that field setting will help determine the correct analyzer but the aggregation doesn’t know that it should then try use that analyzer on the contents of the ‘data.user_text’ field in the json. To give a helping hand try set the source_fields parameter to ‘data.user_text’.
I’m away from computer so can’t confirm but I suspect this will work.

Ivo_Tavares · August 29, 2022, 8:34am

Hi Mark,
Thanks for the reply. Unfortunatly, your proposed solution didn't work. I still get the same reply:

{'sample': {'doc_count': 200,
  'wordcloud': {'doc_count': 200, 'bg_count': 8357411, 'buckets': []}}}

Mark_Harwood1 · August 29, 2022, 9:18am

Hmm. Can you share the JSON for the search request?

Ivo_Tavares · August 30, 2022, 10:49am

Hi Mark, sorry for the late reply. I was under some time constraint, and since I couldn't use it, I decided to do it locally, using a tfidf scoring system. It solved my problem... Maybe in the future I'll return to this. Thanks for the help, either way.

Mark_Harwood1 · August 31, 2022, 7:09am

Thanks for the update. Would be good to find out what went wrong if you return to this.

system · September 28, 2022, 7:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Significant text aggregation with custom analyzer in Elasticsearch 6 Elasticsearch	6	1087	December 21, 2017
Issues with custom analyzer when running aggregate queries Elasticsearch	3	1032	September 18, 2017
Significant terms aggregation with non tokenized text Elasticsearch	2	482	July 6, 2017
Stopwords in term aggregation Elasticsearch	7	1165	July 5, 2017
Getting "Text missing exception" while using custom analyzers in elasticserach Elasticsearch	4	2942	July 5, 2017

Doing a significant text aggregation with a custom analyzer

Related topics