Supposed I have a text field called "some_label". I want to turn it into keyword opposed to the default text.
But for the readers, it won't know if a field is text or keyword. To standardize it, I allow the reader to search via "some_label.keyword" whenever keyword is required and "some_label" when keyword is not required (meaning it could be used as text field).
The key point here is that the reader won't know if the field is restricted to "keyword".
My question is will this incur any penalty in terms of storage and performance?
Should the first "type" : "keyword", be a text field?
Any additional field adds overhead in ingestion and storage.
Though if storage is a major pain point for you, then synthetic _source might make that a lot easier. Though this is IMO mostly helpful for large, machine generated datasets (like logs or security events). For user generated content, the storage aspect is commonly very manageable and the workarounds are often not the best tradeoff.
Should the first "type" : "keyword", be a text field?
No. I want to use this field as "keyword" only. The reason for this redundant dot keyword field is to standardize on API calls.
I don't want the caller to remember a list of fields that are keyword. It creates synchronization issue when new fields are added, etc.
If I don't add this redundant "keyword" field, the call will fail when searching for "label.keyword".
Requiring the caller to know when to use ".keyword" and when not to is just too confusing.
Hence my question regarding potential penalties for this convenience for the caller.
Imagine I have 2 text fields.
"ticket_topic" & "ticket_uuid". Both are text.
"ticket_topic" are freely searchable so it would simply just be "text". I will typically leave the dot keyword namespace there just in case some people like to copy & paste when searching.
Obviously ticket_uuid shouldn't need to index substrings; therefore, it makes more sense to make it a "keyword".
But the caller then needs to know when composing the query to use "ticket_uudi" because "ticket_uuid.keyword" won't exist, even though the field is already a "keyword".
If I add this redundant dot "keyword", then "ticket_uuid.keyword" will work.
Hope this makes sense.
Synthetic_source is not what I'm looking for.
Is there any documentation that would help me gauge the overhead for adding additional fields?
Thanks.
It isn't possible and you'll run into the error Failed to parse mapping [_doc]: Type [alias] cannot be used in multi field. Which didn't completely surprise me
So I don't think there's any way around indexing and storing it twice. Definitely if the field is defined differently (with and without ignore_above) but even for the same definition I'm not aware of any shortcuts we'd take to avoid the duplication.
The required disk space will depend on a couple of factors (what compression level are you using, how repetitive is the data and thus how well does it compress, how much data you you have in the field,...). I would store a representative data set for your scenario and then check it with the _disk_usage API.
The only other idea would be: How are your users consuming this field and could you solve this in the application?
So each extra "fields" would cause new indexing to occur.
Thanks much for the info. That's exactly what I need to make implementation decisions moving forward.
Maybe keeping a list of "keyword" fields is needed after-all...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.