What does a sample document look like? Why have you overridden these parameters?
These type of limits are generally in present for a good reason, so my increasing these dramatically you are asking for trouble. I would recommend you reconsider how you store data in Elasticsearch so you can go back to the default settings.
My use case requires me to increase the field limit. It started off with around 20k fields but has since become 100k, and it's estimated to go to around 600k. The retention policy is for around 2 months.
Since newer data wasn't being written (fields weren't being added/saved), I increased various fields until the issue was resolved.
My current issue, I that I require searching my data using wildcards. I can search just fine without them, but I get errors when I use a wildcard (such as what I mentioned earlier). Even when I test on a search result I know only has a few dozen or so hits......i get the same error.
Why is the field count growing so much? What does a sample document look like?
If you are going to work with Elasticsearch I believe you need to change how you use it and change the document structure. Your settings are so far beyond what is recommended that I am not surprised that you are running into problems. I would also not be surprised if you start running into cluster stability and performance issues as the cluster state is going to be quite large with mappings that size.
If you can share some sample documents the community might be able to provide some suggestions on how to best restructure the data to align with how Elasticsearch works.
If you can not change the structure of the data Elasticsearch may not be a suitable tool to use. Maybe you should look into using something else?
Fields are being fetched from icinga. Its not growing fast, it more that additional hosts and resources are being added. Moreover, many of the hosts are such that they can add/remove/modify their individual workloads a couple hundred times a day if required, all of which is being monitored, and appropriate perf data being saved on elastic.
I cant reveal confidential sample data, but I can provide a matching pattern. The data below is the data provided by icinga, and written to elastic using elasticwriter.
How many of these data sets are generated per day?
Do you have any other data associated with this data set, e.g. timestamp, source id etc?
Elasticsearch works best with few keys and many values, so one way to solve this would be to break this data set up into many documents looking something like this:
You can then map the key field as keyword (for exact match) and have a wildcardmulti-field underneath for efficient wildcard matching.
If the values are all related and you do need to get them all in one query you can also consider storing the key-value pair documents above as a nested field. This will make updates more expensive, but if your data is immutable it might be an option although it will complicate query syntax a bit.
I understand what your saying, but my scenario is different. As I'm limited to using elasticwriter to grab data from icinga, my data is being saved in the format below:
Essentially the situation is flipped, where I have fewer documents, but more data per document.
On average, i would have around 3k documents per 5 min span, where each document could have between 20 and 1000 values. (only around 100-150 such documents with over 50 values )
I do not think that approach will work nor scale. You are likely to face a lot of different issues down the line with that approach. If you are going to use Elasticsearch you will need to find a way to transform the data. Logstash would be able to do this, so if you can find a way to direct your data to Logstash that could solve the problem. You might be able to assign a default ingest pipeline to the indices through an index template and perform the transformation that way. As far as I know ingest pipelines are not able to split documents, but it may be possible to transform it into a nested structure using e.g. a script processor so it looks something like this:
This is probably what I would try doing as it still allows your ingest process to write the current data directly to the index in the current format.
I think this should work and also scale and perform quite well. It would also allow you to have a small strict mapping, which avoids a lot of cluster state updates and increases stability.
Noted. Thank you! However, that is an issue for a later time. The current issue is, that I am unable to use wildcards in my search. The large data shouldnt be a problem.
Infact, I have attempted to run the wildcard query on a document I know only has 4 values, and it still failed.
Moreover, I have setup a similar environment (trying multiple versions) on a separate VM, and it accepts wildcard searches. But it doesn't work on the environment I'm currently developing in.
I would recommend you change the approach immediately. I do not have the time nor energy to help troubleshoot your query issue as it uses an IMHO flawed approach and in my mind is a waste of time. Good luck!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.