Average text field length

Hi, is there way to get the average length of a text field? I got "Fielddata is disabled on text fields by default" exception when I try to use avg aggregation. In the following query, "reason" is a text field type.

  "size": 0,
  "aggs" : {
    "avg_size" : {
      "avg": {
         "script" : {
           "lang": "painless",
           "source": "doc['reason'].value.length()"
        }
      }
    }
  }

I'd compute that at index time with a similar painless script that I'd put in a Script Processor.

I don't need the average length field stored in index. I mainly use it for test to get an idea how long the field is. Is there a way to do it using Search API?

I mainly want to figure out how and why Elasticsearch stores documents, the original sources, efficiently in terms of space. I understand that inverted docs efficiently store indexs. But how does the original documents compressed that ES shows much smaller space usage compared to MySQL DB? Do duplicate fields in different documents only share the same copy of data?

Thanks!

That's a totally different story. There are so many data structures involved depending on what you are doing ie text vs keyword, text analyzers, store, compression algorithm, number of segments, number of shards...

Comparing MySQL storage (not speaking about MySQL indices) vs Elasticsearch storage is like to me comparing oranges and apples.

If we just speak of storing _source, which mean basically disabling all indexation on all fields, then you might be able to compare...
Compression on stored fields is done per Lucene segment. A shard may have multiple segments unless you call the _forcemerge API. An index may have multiple shards.
So if you want to "compare", you may want to create one index with one single shard, disable all indexing for all fields in the mapping, and once the operation is done, call the _forcemerge API to have only one segments.

Then you can more or less check the size on disk by doing:

du -s data/nodes/0/indices/vj8heygETPqlLrBQUZsR1w/0/index

Note that vj8heygETPqlLrBQUZsR1w depends on your index id, which you can find with the _cat/indices API:

GET /_cat/indices/yourindexname?v

HTH

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.