Get distinct values from "text field" without remapping

I'm querying ~350 TB of documents.

Re-indexing is not an option.

Performance, within reason, is not a concern.

my documents have a field s3_filename {"type": "text"}. It doesn't have any subfields.

Setting fielddata: true for s3_filename gives the wrong results (it returns unique "words" from the value, not unique values)

e.g. s3_filename has values similar to ud20220711/long-file-name-20220711.json.gz

I want the aggregates on the value (ud20220711/long-file-name-20220711.json.gz), however, I get instead aggregates of the words:

returns buckets of parts of the s3_filename. E.g, [long, file, name, 20220711, json, gz] but no bucket for the value ud20220711/long-file-name-20220711.json.gz

I've tried simple aggs, and aggs with composites, but nothing works...

Which version of Elasticsearch are you using? Creating a runtime field with the keyword type would help you, but this is only available from 7.12+ if I'm not wrong.

Thank you.

I'm running 7.16.1 currently.

That looks promising, but it is still complains about the text field -(

$ cat s3_filenames.json
{
  "runtime_mappings": {
    "s3_fn": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['s3_filename'].value)"
      }
    }
  },
  "aggs": {
    "filenames": {
      "terms": {
        "field": "s3_fn",
        "size": 10
      }
    }
  }
}

throws this

    "failed_shards": [
      {
        "shard": 0,
        "index": "cloud-logs-003965",
        "node": "JZGAzVVoQEmmCqJJEu8nTg",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:875)",
            "org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:112)",
            "org.elasticsearch.index.query.SearchExecutionContext.lambda$lookup$2(SearchExecutionContext.java:512)",
            "org.elasticsearch.search.lookup.SearchLookup.getForField(SearchLookup.java:109)",
            "org.elasticsearch.search.lookup.LeafDocLookup$2.run(LeafDocLookup.java:107)",
            "org.elasticsearch.search.lookup.LeafDocLookup$2.run(LeafDocLookup.java:104)",
            "java.base/java.security.AccessController.doPrivileged(AccessController.java:318)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:104)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:28)",
            "emit(doc['s3_filename'].value)",
            "         ^---- HERE"
          ],
          "script": "emit(doc['s3_filename'].value)",
          "lang": "painless",
          "position": {
            "offset": 9,
            "start": 0,
            "end": 30
          },
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [s3_filename] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
          }
        }
      }
    ]

If I add fielddata: true to s3_filename, then emit(doc['s3_filname']) returns a list of words instead of the value.

Try to define the runtime field without a script.

You can use just:

PUT your-index/_mapping
{
  "runtime": {
    "s3_filename": {
      "type": "keyword"
    }
  }
}

This will change the mapping of the field s3_filename to keyword on runtime (this can have some impact in the performance).

If for some reason you need to remove this runtime mapping, just use:

PUT your-index/_mapping
{
  "runtime": {
    "s3_filename": null
  }
}
1 Like

Thanks! That did it. It takes 18 times as long to run; BUT, it has valid output!!!