Doc_values for GUID-esque, unique data

Hi,

We have a large amount of transactional data coming in. Each transactions has a transaction identified (i.e. ti) that is unique to each transaction. This looks like this:
ti : "144e647b-3b72-4206-9d72-3fa18ea4896d"
we have to convert this to a not_analyzed string due to ES stripping out the "-", and mapped this to a new field, ti.raw. (we keep the original ti, just in case).

We can now perform a "distinct count" based on the ti.raw (wasn't possible on the ti itself).

However, ES is now loading the ti.raw into the fielddata, using a ton of heap memory. We are essentially maxed out at 75% memory usage all day long.

If we were to set the ti.raw to doc_value in the mappings, I presume this would have a few effects:

  1. Disk size would bloom significantly (every single entry in our logs has a ti.raw)
  2. Memory usage would drop significantly
  3. Disk I/O would rise up significantly (but we are using SSD's, so I don't think it will be a huge issue)
  4. Searches using ti.raw for a distinct count would take slightly longer

The current entry looks like this:
"dynamic_templates": [{
"message_field1": {
"match": "ti",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
}

and the proposed change would add:
"ignore_above": 256,
"doc_values": true

We utilize ti.raw significantly, so am wondering if:
a) This is the proper way to add this
b) How much storage this would utilize? (we are logging around 60m transactions a day)
c) Would the speed be significantly worse?
d) This would actually reduce our memory usage

Some background:
-ES 1.5
-6x nodes (4x CPUs, 61gb RAM, 512gb SSD drives)
-3x master nodes
-running on AWS
-Current CPU utilization varies between 15-80% (depending on what searches are being run).
-Memory is always near 60-78%
-Read IOPS is between 0 to 10, spiking to 300 IOPS when a heavy search is run.
-Write IOPs is between 0-125, spiking to 125 when large data is gathered and pushed in bulk to the cluster. Normally not over 50 IOPS
-Overall search performance is acceptable at this time unless dashboards and scheduled searches are run at the same time (causes too much CPU usage and too many search threads queued up - something we are aware of, as the dashboards are running many, many searches all at once).

Thanks in advance!!!

-Matt

Basically the best way to check on how this would work will be to test it.

But, do you really need the analysed and not analysed fields?
You shouldn't see major performance changes, with SSDs and off heap FS caching helping out.
And yes, moving to doc values would definitely reduce fielddata use.

What version are you on?

Thanks, Mark!

We did, in fact, check this, and enabled doc_values on our ti.raw. Next step will be to disable the ti field.

We want to make sure that we are not losing data, as this is a production setup... :wink:

As stated, we are using AWS, which is currently set to 1.5.

I am in the process of migrating the cluster right now to see if it does, in fact, reduce memory consumption. I think it will, but I would like a "clean slate" to verify. I will reply once it is done and has run for at least 24 hours ingesting data.

Thanks again!

This is a pity, there are so many good improvements between there and 2.3.4.

Mark,

I am well-aware of the lack of versioning.... We MIGHT in the future set up our own cluster in EC2 (versus their actual ES offering). However, in the meantime, this is what we have....

Thanks again!!!

Don't forget we also have ESaaS, which is always up to date - https://www.elastic.co/cloud

1 Like