Doc_values for GUID-esque, unique data

jstmatt · July 6, 2016, 12:25am

Hi,

We have a large amount of transactional data coming in. Each transactions has a transaction identified (i.e. ti) that is unique to each transaction. This looks like this:
ti : "144e647b-3b72-4206-9d72-3fa18ea4896d"
we have to convert this to a not_analyzed string due to ES stripping out the "-", and mapped this to a new field, ti.raw. (we keep the original ti, just in case).

We can now perform a "distinct count" based on the ti.raw (wasn't possible on the ti itself).

However, ES is now loading the ti.raw into the fielddata, using a ton of heap memory. We are essentially maxed out at 75% memory usage all day long.

If we were to set the ti.raw to doc_value in the mappings, I presume this would have a few effects:

Disk size would bloom significantly (every single entry in our logs has a ti.raw)
Memory usage would drop significantly
Disk I/O would rise up significantly (but we are using SSD's, so I don't think it will be a huge issue)
Searches using ti.raw for a distinct count would take slightly longer

The current entry looks like this:
"dynamic_templates": [{
"message_field1": {
"match": "ti",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
}

and the proposed change would add:
"ignore_above": 256,
"doc_values": true

We utilize ti.raw significantly, so am wondering if:
a) This is the proper way to add this
b) How much storage this would utilize? (we are logging around 60m transactions a day)
c) Would the speed be significantly worse?
d) This would actually reduce our memory usage

Some background:
-ES 1.5
-6x nodes (4x CPUs, 61gb RAM, 512gb SSD drives)
-3x master nodes
-running on AWS
-Current CPU utilization varies between 15-80% (depending on what searches are being run).
-Memory is always near 60-78%
-Read IOPS is between 0 to 10, spiking to 300 IOPS when a heavy search is run.
-Write IOPs is between 0-125, spiking to 125 when large data is gathered and pushed in bulk to the cluster. Normally not over 50 IOPS
-Overall search performance is acceptable at this time unless dashboards and scheduled searches are run at the same time (causes too much CPU usage and too many search threads queued up - something we are aware of, as the dashboards are running many, many searches all at once).

Thanks in advance!!!

-Matt

warkolm · July 6, 2016, 4:05am

Basically the best way to check on how this would work will be to test it.

But, do you really need the analysed and not analysed fields?
You shouldn't see major performance changes, with SSDs and off heap FS caching helping out.
And yes, moving to doc values would definitely reduce fielddata use.

What version are you on?

jstmatt · July 14, 2016, 2:11pm

Thanks, Mark!

We did, in fact, check this, and enabled doc_values on our ti.raw. Next step will be to disable the ti field.

We want to make sure that we are not losing data, as this is a production setup...

As stated, we are using AWS, which is currently set to 1.5.

I am in the process of migrating the cluster right now to see if it does, in fact, reduce memory consumption. I think it will, but I would like a "clean slate" to verify. I will reply once it is done and has run for at least 24 hours ingesting data.

Thanks again!

warkolm · July 14, 2016, 9:19pm

This is a pity, there are so many good improvements between there and 2.3.4.

jstmatt · July 14, 2016, 9:35pm

Mark,

I am well-aware of the lack of versioning.... We MIGHT in the future set up our own cluster in EC2 (versus their actual ES offering). However, in the meantime, this is what we have....

Thanks again!!!

warkolm · July 14, 2016, 9:43pm

Don't forget we also have ESaaS, which is always up to date - https://www.elastic.co/cloud

Topic		Replies	Views
Understanding doc_values? Elasticsearch	6	645	July 6, 2017
DocValues in ES 1.0.0.Beta1 Elasticsearch	9	551	July 6, 2017
Geo Features & Doc_Values for Analyzed String Fields Elasticsearch	9	1780	July 5, 2017
Doubts in doc_values Elasticsearch	2	534	July 5, 2017
Could use some help with using Doc Values Elasticsearch	12	482	July 6, 2017

Doc_values for GUID-esque, unique data

Related topics