Best Practices for storing and searching text in XML documents

I would like to know what the best practices are for storing a large text blob field into an Elasticsearch document field and executing text queries against that text. I am particularly concerned with storing the text of an XML file in the field.

Background Context: I work on a Rails app backed by a Postgres database. One of the DB tables stores an entire XML file into a text column called content, among other columns for stories metadata. There is a web portal for users to be able to search for records of this table. One of the search fields is a text input box for users to find all records that contain that specific string. In SQL, this is simply implemented as WHERE content LIKE '%USER_SEARCH_QUERY%'.

Search performance for these records has been historically poor with Postgres, so my team integrated Elasticsearch to help speed up searching. We have imported all of these records into an Elasticsearch index. The Rails models are converted into JSON and inserted into the index, utilizing dynamic mapping. Now when users execute a search, we construct a JSON request body for the query and hit the Search API.

The Problem: We have had trouble implementing an Elasticsearch query that matches the exact same behaviour of the SQL query written above. At first, we were using the match_phrase query. This appeared to work well for the majority of cases, but there still instances where a search query that would return results against SQL were not returning any results on Elasticsearch. For example, if the XML content contained the element <SomeField>000.111.222</SomeField> and the user searched for 111, then no results would be found. We concluded that the default analyzer was not tokenizing this element into multiple phrases.

The next attempt was to use query_string. Whatever text was entered into the input field would be surrounded by asterisks and then have special characters escaped with backslashes. The JSON for the query would look like this for the above example: { "query_string": { "query": "content:*111*" } }. Now, search queries against XML elements with numeric bodies are working properly. Unfortunately, there are still cases where Elasticsearch querying on XML text is not working. Remaining issues include searching UUIDs (alphanumeric characters separated by dashes) and date strings (e.g. ISO8601 timestamps).

I know that there are still more experiments that I can try to improve Elasticsearch text search on XML blobs (I'm considering converting the XML into JSON and storing it as a nested structure on the content key, then running the query_string search on the content.* fields). I'm also predicting that I cannot relying on dynamic mapping and I'll need to specify something more specific for content, like a custom analyzer. However, I'm at the point where I would like to engage the community for help. What else can I do to properly store this XML text blob and implement search that exactly matches wildcard searching in SQL?

Thank you for reading this post and helping me out.

Hi Evan,

Your question is not an easy one. Getting to good search results is a process that takes time, and that begins with understanding exactly what the search requirements are. But let me try to give you some pointers (without knowing all your requirements of course :slight_smile:).

You're right about the analyzer. And I would go the route of defining an analyzer that provides you with the search capabilities that you need. For example, a custom analyzer that uses the Char Group Tokenizer. This tokenizer allows you to break strings into terms on whatever characters you specify. You could for example break on "-", ".", ">", "<", and " " with a custom analyzer like this:

PUT my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "char_group",
          "tokenize_on_chars": [
            "-",
            ".",
            ">",
            "<",
            " "
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": [],
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Now you can index documents like this:

POST my_index/_doc
{
  "my_field": "<SomeField>000.111.222</SomeField>"
}

POST my_index/_doc
{
  "my_field": "<SomeField>c264e835-6c8c-421d-8f0a-99ff3ebc4023</SomeField>"
}

And query the data like this:

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "111"
    }
  }
}

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "c264e835"
    }
  }
}

Personally, I would not go the query_string query route with wildcards, are this will be slow on large datasets. But if you do want to go that route, you may want to consider mapping the content field as a keyword field instead of a text field, so no text analysis is applied at all.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.