I have an application that is writing many documents in Elasticsearch. After a year or two, I figured out that some of these documents are scams because they contain large strings (more than 10000 characters). E.g. one document looks like this:
Message: "bnfgbjkcywcbyftetzodbpgipcdoxgedjxqbfmcjiwlkceyehnwpwhlcfpbivaflaphvlplgeqirctmdyyoasqhhgfopvktgeupughwrteqadrlcmeauxktggoopycijrwenoesdtewvkgsdhafptepxqfidgdpjozvqafbkkshoiokaosqypwxpmttgzntpbdnk...[ up to 10000 characters ]"
Mapping of the Message
field is set as a Text
and as a keyword
. In cases when that kind of document gets in the index, only the Text
field is mapped, because the keyword
has "ignore above 256" setting enabled by default.
I wanted to check documents with a Message field value larger than 1000, but I can't apply the script because the Message.keyword
does not exist, only the Message (as a Text
).
GET my-index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "doc['Message'].value.length() > 1000",
"lang": "painless"
}
}
}
}
}
}
will throw an error that the Text fields are not optimized for operations that require per-document field data like aggregations and sorting.
If I put doc['Message.keyword'].value.length() > 1000
I get an error cause those documents don't have keyword
(because it's larger than 256).
Any idea how else can I check these documents that contain only 1 long string (greater than N number of characters), I would like to reindex into a new index without those documents, and apply proper mapping so in the future those documents will be rejected by Elastic.
Thanks!