Significant text aggregation with custom analyzer in Elasticsearch 6

philv · November 18, 2017, 10:43pm

Should SignificantText aggs work with a text field that has a custom analysis chain ? (Or the other mappings customisations I outline in the mappings segment below)
I have a 'text' type field that I can't get buckets for despite finding hits.

Snippet:
{
"from" : 0,
"size" : 0,
"query" : {
"query_string" : {
"default_field" : "y.txt",
"query" : "the"
}

  },
  
  "aggs" : {
 
        "my_sample" : {
            "sampler" : {
              "shard_size" : 1000
            },
            "aggs": {
              
                "keys" : {
                  
                  "significant_text" : { "field" : "y.txt"
                                         
                                       }
                 
                }
            }
        }
    }
}

Results

{
    "took": 178,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 42578,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "my_sample": {
            "doc_count": 3000,
            "keys": {
                "doc_count": 3000,
                "bg_count": 2107476,
                "buckets": []
            }
        }
    }
}

Mappings segment:

 "y.txt": {
          "type": "text",
          "analyzer": "ds_typeextractor",
          "store": true,
          "position_increment_gap": 1024,
          "term_vector": "with_positions_offsets"
        }

Thanks,
Phil

Mark_Harwood · November 20, 2017, 12:20pm

Yes.

"query" : "the"

Likely there's nothing remotely significant to be discovered about docs that use the word "the".
It's like asking what's special about the sort of people who like breathing. Try a more discriminating query.

philv · November 21, 2017, 10:48am

Thanks Mark.
I'd tried quite a few queries I'd expect to find niche associations but never get any buckets back.
If I load the same data under a different schema it works with silly queries like 'the' (albeit with low scores), so I think it's somehow related to my underlying schema or something else. I need to investigate more. thanks again. Significant Text is awesome.

Mark_Harwood · November 21, 2017, 11:03am

Hmm. Would be useful to know why not.
If you can get away with one shard that helps spot low-frequency items.

Another thing to experiment with is upping shard_size and shard_min_doc_count. The default for shard_min_doc_count is "1" and I now worry this might be too low. It means that the shard_size number of terms selected to bring back from each shard may be dominated by one-off terms which never amount to anything sufficiently popular to meet the global min_doc_count threshold.
If you increase shard_min_doc_count to something like 3 then we return higher-confidence terms from each shard that stand more chance of amounting to something significant globally. The words may be more boring (choosing recall over precision here) but we didn't blow our shard_size limited number of candidate terms pulled back from each shard by gambling on one-off local terms that could have been very precise choices but ultimately led to nothing globally.

Distributed data can be the enemy of this sort of analysis!

philv · November 23, 2017, 12:13pm

Hi Mark,

After some tinkering I believe the issue is that the index I'm hitting doesn't have _source enabled.
It does have the relevant field stored though, and I can retrieve and see the data. Does Significant Text agg need _source enabled ?

Thanks,
Phil

Mark_Harwood · November 23, 2017, 12:44pm

Ah. Yes, that would be an issue.
We nearly decided to disable disabling _source at one point given features like reindex rely on access to stored _source but we hung onto it only for the rare few who feel there's a need to do so. I expect there will always be a subset of elasticsearch features that don't work if you choose to disable _source.

system · December 21, 2017, 12:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Doing a significant text aggregation with a custom analyzer Elasticsearch	6	350	September 28, 2022
Significant Text Aggregation always returning zero buckets Elasticsearch	3	371	October 10, 2021
Significant terms aggregation with non tokenized text Elasticsearch	2	471	July 6, 2017
Significant field aggregation ElasticSearch Elasticsearch	4	346	April 18, 2019
Can Significant text aggregation work on copy_to fields Elasticsearch	3	376	May 11, 2021

Significant text aggregation with custom analyzer in Elasticsearch 6

Related topics