Significant text aggregation with custom analyzer in Elasticsearch 6

Should SignificantText aggs work with a text field that has a custom analysis chain ? (Or the other mappings customisations I outline in the mappings segment below)
I have a 'text' type field that I can't get buckets for despite finding hits.

Snippet:
{
"from" : 0,
"size" : 0,
"query" : {
"query_string" : {
"default_field" : "y.txt",
"query" : "the"
}

  },
  
  "aggs" : {
 
        "my_sample" : {
            "sampler" : {
              "shard_size" : 1000
            },
            "aggs": {
              
                "keys" : {
                  
                  "significant_text" : { "field" : "y.txt"
                                         
                                       }
                 
                }
            }
        }
    }
}

Results

{
    "took": 178,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 42578,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "my_sample": {
            "doc_count": 3000,
            "keys": {
                "doc_count": 3000,
                "bg_count": 2107476,
                "buckets": []
            }
        }
    }
}

Mappings segment:

 "y.txt": {
          "type": "text",
          "analyzer": "ds_typeextractor",
          "store": true,
          "position_increment_gap": 1024,
          "term_vector": "with_positions_offsets"
        }

Thanks,
Phil

Yes.

"query" : "the"

Likely there's nothing remotely significant to be discovered about docs that use the word "the".
It's like asking what's special about the sort of people who like breathing. Try a more discriminating query.

Thanks Mark.
I'd tried quite a few queries I'd expect to find niche associations but never get any buckets back.
If I load the same data under a different schema it works with silly queries like 'the' (albeit with low scores), so I think it's somehow related to my underlying schema or something else. I need to investigate more. thanks again. Significant Text is awesome.

Hmm. Would be useful to know why not.
If you can get away with one shard that helps spot low-frequency items.

Another thing to experiment with is upping shard_size and shard_min_doc_count. The default for shard_min_doc_count is "1" and I now worry this might be too low. It means that the shard_size number of terms selected to bring back from each shard may be dominated by one-off terms which never amount to anything sufficiently popular to meet the global min_doc_count threshold.
If you increase shard_min_doc_count to something like 3 then we return higher-confidence terms from each shard that stand more chance of amounting to something significant globally. The words may be more boring (choosing recall over precision here) but we didn't blow our shard_size limited number of candidate terms pulled back from each shard by gambling on one-off local terms that could have been very precise choices but ultimately led to nothing globally.

Distributed data can be the enemy of this sort of analysis!

Hi Mark,

After some tinkering I believe the issue is that the index I'm hitting doesn't have _source enabled.
It does have the relevant field stored though, and I can retrieve and see the data. Does Significant Text agg need _source enabled ?

Thanks,
Phil

Ah. Yes, that would be an issue.
We nearly decided to disable disabling _source at one point given features like reindex rely on access to stored _source but we hung onto it only for the rare few who feel there's a need to do so. I expect there will always be a subset of elasticsearch features that don't work if you choose to disable _source.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.