Searching for exact string in big fields (Lucene Limitation)

I spent a long time trying to figure out a way to use Elasticsearch for exact string searches like we do normally in MySQL: SELECT * FROM mytable WHERE long_text_field like "%search_text%"

I am talking about fields long more than 32KB, so I cannot use the keyword data type or anything similar which relies on putting the whole text in a single term.

This is needed because searching for exact strings is important for data entry people, and if that is not possible with Elasticsearch, then I will never be able to use Elasticsearch as the sole data storage system, and I have to rely on a MySQL schema and sync with Elasticsearch even if my data is really simple and does not need a relational database.

I tried to use scripting to search, but it is not supported to load a non-indexed field into fielddata nor doc_values, and I didn't find it in _source.

Having a fast exact string search is also important, as we are planning to open data entry control panel to a very big number of volunteers. But if it is possible only on slow speed that maybe better than nothing.

Any ideas?

Doing

like "%search_text%"

Is not exact string search.

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are want to do. Please, try to keep the example as simple as possible.

@dadoonet Thanks for your response.

I am sorry if I used some terms wrongly, and I don't have a problem changing the name of that search case if you provided one. But for me, the name is still just right. I want to search for an exact string, without any analysis in a big field, bigger than 32KB.

Is that possible, and how? I don't pretend I am an expert, but I scanned the documentation and the reference several times without being able to figure out a single way to do that.

I want to search for an exact string, without any analysis in a big field, bigger than 32KB.

I think that the key point is here without any analysis.

Exact string means to me that if you indexed "field": "FoObAr", you can search with a term query on "field": "FoObAr", where "field": "ObA" won't match as it's not the exact string.

Anyway, it seems that you want to do full text search on non analyzed field, which means that you should create a mapping like this:

"full_name": {
   "type":  "text",
   "analyzer": "keyword"
}

But I'm almost sure it's a bad idea. If you want to search for phrase or something like this, then that's something that you should do in another way.

I mean by this that it would be super useful (if you need help) that you share some example of the documents you want to index and what are the typical text your users are going to search for.

"content": {
   "type":  "text",
   "analyzer": "keyword"
}

Thanks very much for your response.

Your example is exactly my requirement, but I tried it, and I got to the limit of 32KB for a single keyword, but my documents have some fields which are bigger than 32KB, and the content of these fields can be XML, HTML, Plain Text, and even binary, and I need to search for exact strings inside them with highlighting if possible. All of this is pretty simple in MySQL, but i didn't find a way to do it in Elasticsearch except I break the fields in multiple fields with 100 bytes overlap and allow the search string to be maximum of 100 bytes, and this way is not considered acceptable for me.

So, I am in searching for another way if possible.

As an example document:
{
id: 1,
name: "Document title",
content: "

Header
"
}

The document has a content property which its content is an html document.
The search string can be: ">Header<"
It must match the exact string no matter if it is capital or small letters.

I need exactly like: SELECT * FROM table WHERE content like "%>Header<%" when content is bigger than 32KB.

In order to make this fast and scalable I suspect you will need to be able to rely on an index instead of using brute force on the source. Might you perhaps be able to use a ngram tokeniser where you specify all (or most) character classes to be used for tokenisation in order to allow exact matches to be found even across word boundaries and special characters?

Thanks for your suggestion.
I think ngram tokenizer will generate huge data. Imagine taking a field of 1MB content length, and generating fixed width ngram of 100 characters each (if we allowed for limiting the search string to 100 characters), that will make the index of this field cost: 100*1MB terms. I don't think this is going to be acceptable either.

And by the way, speed is not important for my case, it is acceptable to have it when it is ready.

I wish I have a tokenizer, which splits blindly on 32KB - 300 (configurable overlap). Is that available?
if available, then I will use it for a special sub-field and when I want to search for exact string I will use that. And I wish highlight can be provided for it.

To me it seems like a bit odd way to search for data as it is very sensitive to upper and lower case issues etc, but then I do not know your domain. Is this how the users have expressed that they want to search for data or is it that this way is what they have gotten used to due to limitations of the MySQL based system?

MySQL is limited in full text search, but in exact string search is good enough.
Yes, my users needs to search for exact strings without making any analysis, even capital and small.

My domain is a technical one. In other words, my users are technical people who needs to search for artifacts or patterns alongside full text search. For this, I am forced now to use MySQL as the single true source of data and Elasticsearch as the caching layer, but if Elasticsearch was able to give me this feature, I will be able to remove MySQL completely, and rely on Elasticsearch as the single true source of data. I don't think I need any other features of MySQL for the simple schema I have.

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

With typical data you will have, typical text your users will search for?

PUT test
{
    "mappings" : {
        "type1" : {
            "properties" : {
                "field" : { "type" : "keyword" }
            }
        }
    }
}
PUT test/type1/1
{
    "field" : "<html><body>ANY CONTENTS LONGER TO MAKE THE TOTAL LENGTH MORE THAN 32KB which may contains any kind of information, numbers, names, words, sentences, html code, javascript code, xml code, you name it1234567890</body></html>"
}


{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Document contains at least one immense term in field=\"field\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48]...', original message: bytes can be at most 32766 in length; got 51000"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Document contains at least one immense term in field=\"field\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48]...', original message: bytes can be at most 32766 in length; got 51000",
    "caused_by": {
      "type": "max_bytes_length_exceeded_exception",
      "reason": "bytes can be at most 32766 in length; got 51000"
    }
  },
  "status": 400
}

As you can see, I am hitting the 32KB limit when I use a keyword data type.

The actual content is any HTML document you want, and I don't have the permission to submit any actual document content, but it is general html with different language per field.

Now, without the ability to have long documents in keyword data type, then I think I will not be able to use any partial matching techniques, right?

Now trying "index" = false

PUT test1
{
    "mappings" : {
        "type1" : {
            "properties" : {
                "field" : { "type" : "text", "index": false }
            }
        }
    }
}
PUT test1/type1/1
{
    "field" : "<html><body>ANY CONTENTS LONGER TO MAKE THE TOTAL LENGTH MORE THAN 32KB which may contains any kind of information, numbers, names, words, sentences, html code, javascript code, xml code, you name it1234567890</body></html>"
}

It is successful, but when searching:

GET /test1/_search
{
  "query": {
    "wildcard" : { "field" : "*234*" }
  }
}


{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "failed to create query: {\n  \"wildcard\" : {\n    \"field\" : {\n      \"wildcard\" : \"*234*\",\n      \"boost\" : 1.0\n    }\n  }\n}",
        "index_uuid": "vPNInbApSWCqnJJijHSrGQ",
        "index": "test1"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test1",
        "node": "ajoCRgAOR0icCeAMmR7LYw",
        "reason": {
          "type": "query_shard_exception",
          "reason": "failed to create query: {\n  \"wildcard\" : {\n    \"field\" : {\n      \"wildcard\" : \"*234*\",\n      \"boost\" : 1.0\n    }\n  }\n}",
          "index_uuid": "vPNInbApSWCqnJJijHSrGQ",
          "index": "test1",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Cannot search on field [field] since it is not indexed."
          }
        }
      }
    ]
  },
  "status": 400
}

As you can see, if I didn't index it, it will be stored, but it will not allow me to use the wildcard query.

No. That's wrong.

It's a matter of analysis. A end user would probably search for 234 and would expect to find anything that contains this.

Using ngram is IMHO the way to go as already proposed.

Using keyword for that is wrong. <html><body>ANY CONTENTS LONGER TO MAKE THE TOTAL LENGTH MORE THAN 32KB... is not a "keyword".

Thanks for your response, but I still see ngram not acceptable as it will emit a lot of data or I am missing something.

According to my understanding of ngram, it generates a list of terms, and then I can use the wildcard query, right?

But let us imagine that I enforced the limit for the maximum search string length allowed to be 100 characters, what will be the min/max for ngram tokenizer? And how much the size of the index will be?

I don't feel like I am able to make any reasonable min/max without having a lot of data. I think it should be 100 for both min and max, and then, a 1MB field, will need about 100MB index data. Then, if I have 1,000,000 documents like that, I will need extra 100TB just to index the actual 1TB of data. I think MySQL is a much better solution in this case even with the sync needed between MySQL and Elasticsearch.

The thing is that if you have a string like

<html><body>ANY CONTENTS LONGER TO MAKE THE TOTAL LENGTH MORE THAN 32KB

A search engine like elasticsearch will index something like:

  • any
  • contents
  • longer
  • to
  • make
  • ....

Then when the user search for "any contents" or "any make" or "contents any" or "an onten" elasticsearch might or might not gives results depending on how you as a developer implements the query.
Adding to this that sort is made by default on relevancy so you always give first the best results.

So I'd really like that you share a real example, ie a json document and what a "normal" end user would enter in his search box and for each case what are the documents you expect to have and in which order.

From that I'll be happy to help. Sometimes implementing a fuzzy query is enough instead of using ngrams.

In any case:

  • Elasticsearch is not designed to run efficiently queries "like '%foo%'" and if you are using that you are probably misunderstanding how search engines work. Saying that because when I discovered elasticsearch 7 years ago, I was this guy. :slight_smile:
  • Elasticsearch might have to produce more data on disk depending on the use case but sometimes this is the price to pay to get extremely fast results.
  • I have hard time to believe that you can search with a wildcard like this in MySQL or whatever datastore in less than 50ms. My experience (and the reason I switched my search from Oracle / PostgreSQL to elasticsearch 7 years ago) was that a like query on Oracle was taking more than 30 seconds while I was under 20ms with elasticsearch.

My 2 cents. I'll be happy to hear from you if you can share a gist on GitHub

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.