Elasticsearch Query Error

Hi,
I'm facing this error While querying the data. is there any solution for this?

error:
query_shard_exception

Reason

failed to create query: field expansion for [*] matches too many fields, limit: 1024, got: 20703

Index uuid

kElrsci_TC2hgP-SYXoFKw

Index

tsdb-2023.05.04

Caused by type

illegal_argument_exception

Caused by reason

field expansion for [*] matches too many fields, limit: 1024, got: 20703

What is the query that is causing the error? What is the mapping of that index?

Which version of Elasticsearch are you using?

Thank you.
I'm using Elastic 7.17
here is the index setting: Mapping is dynamic.

Query
host.keyword : "openstack-cluster-non-exposed" AND check_command : "check_serversOS" AND check_result.cpu.value:*

Setting
{
"tsdb-2023.04.14" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"_tier_preference" : "data_content"
}
}
},
"mapping" : {
"nested_fields" : {
"limit" : "60000"
},
"depth" : {
"limit" : "60000"
},
"total_fields" : {
"limit" : "6000000"
},
"dimension_fields" : {
"limit" : "6000"
}
},
"number_of_shards" : "5",
"provided_name" : "tsdb-2023.04.14",
"creation_date" : "1681412458823",
"number_of_replicas" : "0",
"uuid" : "h40EMKtxSDmxAXppDTWHhQ",
"version" : {
"created" : "7170999"
}
}
}

What does a sample document look like? Why have you overridden these parameters?

These type of limits are generally in present for a good reason, so my increasing these dramatically you are asking for trouble. I would recommend you reconsider how you store data in Elasticsearch so you can go back to the default settings.

My use case requires me to increase the field limit. It started off with around 20k fields but has since become 100k, and it's estimated to go to around 600k. The retention policy is for around 2 months.

Since newer data wasn't being written (fields weren't being added/saved), I increased various fields until the issue was resolved.

My current issue, I that I require searching my data using wildcards. I can search just fine without them, but I get errors when I use a wildcard (such as what I mentioned earlier). Even when I test on a search result I know only has a few dozen or so hits......i get the same error.

Why is the field count growing so much? What does a sample document look like?

If you are going to work with Elasticsearch I believe you need to change how you use it and change the document structure. Your settings are so far beyond what is recommended that I am not surprised that you are running into problems. I would also not be surprised if you start running into cluster stability and performance issues as the cluster state is going to be quite large with mappings that size.

If you can share some sample documents the community might be able to provide some suggestions on how to best restructure the data to align with how Elasticsearch works.

If you can not change the structure of the data Elasticsearch may not be a suitable tool to use. Maybe you should look into using something else?

Fields are being fetched from icinga. Its not growing fast, it more that additional hosts and resources are being added. Moreover, many of the hosts are such that they can add/remove/modify their individual workloads a couple hundred times a day if required, all of which is being monitored, and appropriate perf data being saved on elastic.

I cant reveal confidential sample data, but I can provide a matching pattern. The data below is the data provided by icinga, and written to elastic using elasticwriter.

Sample data pattern (provided by icinga):
rgen_11_1=1 rgen_11_2=2 rgen_11_3=3 rgen_11_4=4 rgen_11_5=5 rgen_11_6=6 rgen_11_7=7 rgen_11_8=8 rgen_11_9=9 rgen_11_10=10 rgen_11_11=11 rgen_11_12=12 rgen_11_13=13 rgen_11_14=14 rgen_11_15=15 rgen_11_16=16 rgen_11_17=17 rgen_11_18=18 rgen_11_19=19 rgen_11_20=20 rgen_11_21=21 rgen_11_22=22 rgen_11_23=23 rgen_11_24=24 rgen_11_25=25 rgen_11_26=26 rgen_11_27=27 rgen_11_28=28 rgen_11_29=29 rgen_11_30=30 rgen_11_31=31 rgen_11_32=32 rgen_11_33=33 rgen_11_34=34 rgen_11_35=35 rgen_11_36=36 rgen_11_37=37 rgen_11_38=38 rgen_11_39=39 rgen_11_40=40 rgen_11_41=41 rgen_11_42=42 rgen_11_43=43 rgen_11_44=44 rgen_11_45=45 rgen_11_46=46 rgen_11_47=47 rgen_11_48=48 rgen_11_49=49 rgen_11_50=50 rgen_11_51=51 rgen_11_52=52 rgen_11_53=53 rgen_11_54=54 rgen_11_55=55 rgen_11_56=56 rgen_11_57=57 rgen_11_58=58 rgen_11_59=59 rgen_11_60=60 rgen_11_61=61 rgen_11_62=62 rgen_11_63=63 rgen_11_64=64 rgen_11_65=65 rgen_11_66=66 rgen_11_67=67 rgen_11_68=68 rgen_11_69=69 rgen_11_70=70 rgen_11_71=71 rgen_11_72=72 rgen_11_73=73 rgen_11_74=74 rgen_11_75=75 rgen_11_76=76 rgen_11_77=77 rgen_11_78=78 rgen_11_79=79 rgen_11_80=80 rgen_11_81=81 rgen_11_82=82 rgen_11_83=83 rgen_11_84=84 rgen_11_85=85 rgen_11_86=86 rgen_11_87=87 rgen_11_88=88 rgen_11_89=89 rgen_11_90=90 rgen_11_91=91 rgen_11_92=92 rgen_11_93=93 rgen_11_94=94 rgen_11_95=95 rgen_11_96=96 rgen_11_97=97 rgen_11_98=98 rgen_11_99=99 rgen_11_100=100 rgen_11_101=101 rgen_11_102=102 rgen_11_103=103 rgen_11_104=104 rgen_11_105=105 rgen_11_106=106 rgen_11_107=107 rgen_11_108=108 rgen_11_109=109 rgen_11_110=110 rgen_11_111=111 rgen_11_112=112 rgen_11_113=113 rgen_11_114=114 rgen_11_115=115 rgen_11_116=116 rgen_11_117=117 rgen_11_118=118 rgen_11_119=119 rgen_11_120=120 rgen_11_121=121 rgen_11_122=122 rgen_11_123=123 rgen_11_124=124 rgen_11_125=125 rgen_11_126=126 rgen_11_127=127 rgen_11_128=128 rgen_11_129=129 rgen_11_130=130 rgen_11_131=131 rgen_11_132=132 rgen_11_133=133 rgen_11_134=134 rgen_11_135=135 rgen_11_136=136 rgen_11_137=137

I need to run queries along the following lines:

  1. Get all data where the field begins with rgen, and ends with 5 (so, rgen_11_5, rgen_10_5, etc)
  2. Get all data that contains 11 in the middle (so rgen_11_1, rgen_11_2......so on)
  3. Get all data where the value must be 15 (so rgen_5_11=15, rgen_8_1255=15 ........ so on)

How many of these data sets are generated per day?

Do you have any other data associated with this data set, e.g. timestamp, source id etc?

Elasticsearch works best with few keys and many values, so one way to solve this would be to break this data set up into many documents looking something like this:

{
  "set_id": "abcdef1234",
  "key": "rgen_11_99",
  "value": 99,
  "@timestamp": "2023-05-04T08:00:00Z",
  ...
}

You can then map the key field as keyword (for exact match) and have a wildcard multi-field underneath for efficient wildcard matching.

If the values are all related and you do need to get them all in one query you can also consider storing the key-value pair documents above as a nested field. This will make updates more expensive, but if your data is immutable it might be an option although it will complicate query syntax a bit.

I understand what your saying, but my scenario is different. As I'm limited to using elasticwriter to grab data from icinga, my data is being saved in the format below:

Sample format
{
"host": "host1",
"server": "server1",
"service": "service1",
"rgen_11_99": 1,
"rgen_11_100": 2,
"rgen_11_101": 3,
"rgen_12_45": 4,
"@timestamp": "2023-05-04T08:00:00Z",
...
}

Essentially the situation is flipped, where I have fewer documents, but more data per document.

On average, i would have around 3k documents per 5 min span, where each document could have between 20 and 1000 values. (only around 100-150 such documents with over 50 values )

I do not think that approach will work nor scale. You are likely to face a lot of different issues down the line with that approach. If you are going to use Elasticsearch you will need to find a way to transform the data. Logstash would be able to do this, so if you can find a way to direct your data to Logstash that could solve the problem. You might be able to assign a default ingest pipeline to the indices through an index template and perform the transformation that way. As far as I know ingest pipelines are not able to split documents, but it may be possible to transform it into a nested structure using e.g. a script processor so it looks something like this:

{
  "host": "host1",
  "server": "server1",
  "service": "service1",
  "data": [
    {"key": "rgen_11_99", "value": 1},
    {"key": "rgen_11_100", "value": 2},
    {"key": "rgen_11_101", "value": 3},
    {"key": "rgen_12_45", "value": 4}
  ],
  "@timestamp": "2023-05-04T08:00:00Z",
  ...
}

This is probably what I would try doing as it still allows your ingest process to write the current data directly to the index in the current format.

I think this should work and also scale and perform quite well. It would also allow you to have a small strict mapping, which avoids a lot of cluster state updates and increases stability.

Noted. Thank you! However, that is an issue for a later time. The current issue is, that I am unable to use wildcards in my search. The large data shouldnt be a problem.

Infact, I have attempted to run the wildcard query on a document I know only has 4 values, and it still failed.

Moreover, I have setup a similar environment (trying multiple versions) on a separate VM, and it accepts wildcard searches. But it doesn't work on the environment I'm currently developing in.

It is.

I would recommend you change the approach immediately. I do not have the time nor energy to help troubleshoot your query issue as it uses an IMHO flawed approach and in my mind is a waste of time. Good luck!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.