Is there any character limit on term filter?

I'm trying to check if a certain attachment is previously indexed or not to avoid reindexing that document. But the problem is, for longer text fields, the term filter isn't working. I didn't find any mention of this limitation in the documentation. For the same document,

GET myIndex/_search
{
  "_source": ["attachmentId", "attachment.title"]
  , "from": 0,
  "size": 100,
  "query": {
    "term": {
      "attachment.title.keyword":"Developers_Handbook"
    }
  }
}

works but

GET myIndex/_search
{
  "_source": ["attachmentId", "attachment.title"]
  , "from": 0,
  "size": 100,
  "query": {
    "term": {
      "attachmentId.keyword":"ANGjdJ8AZZwMqSNtE2blw2GPnXsjn3Zuo8iytsY_G3Lz24iZM6eASgf2iPaBxyfcY7LO_GHjZc8oo20EwAl-9kH_-fhA37ciPoQgzoGob4JdoA3Drhu_OJ9Kz997duCVoz6fj_U5vDs3XMm76wQXXx5X6RguxCkojGwsBG2GZYPWsrbSHT81lxHd1_GSk6J9vh4PLazlhSaU8pZtjt_NyQo7gMuT0FHUmi9MZ63ivxQ6IJk1SL48GPnavBGKSi16FHlb4Kh_9n_Zt4fY6rXsJlxxNHH7QAPpuIT751X37dIEtSJHTDXjNQICd9Y0KUuiLop28FX9P8lZn-n_ZRqojsTnsdL9p4AJQy_EYA_a24yUROmwfRLmPCUVlYsE1_wS60DHi3wRHY8JYjNe8fkZ"
    }
  }
}

this does not. Is this expected behaviour? Also, is there a better way to do this?

What is the mapping?

{
  "myIndex" : {
    "mappings" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "author" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "content" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "date" : {
              "type" : "date"
            },
            "language" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "title" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        },
        "attachmentId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "data" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

Is the comparision failing because ignore_above is set to 256?

Yes.

But I'd probably try something else. I'd compute a signature for your file which I'd store instead of searching for a term which can be very very very big. I don't think that's a good idea to index the base64 content.
I don't think it's a good idea to store it in elasticsearch.

In FSCrawler project, I'm computing such a signature for every file I'm sending to elasticsearch. And I'm only comparing signatures.

1 Like

Thanks a lot. That base64 value is actually always going to be 404 characters long. But it's probably a bad idea nevertheless. I'll be mindful of it when implementing it. For now, I was just doing a POC.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.