I would like to know what the best practices are for storing a large text blob field into an Elasticsearch document field and executing text queries against that text. I am particularly concerned with storing the text of an XML file in the field.
Background Context: I work on a Rails app backed by a Postgres database. One of the DB tables stores an entire XML file into a text column called content
, among other columns for stories metadata. There is a web portal for users to be able to search for records of this table. One of the search fields is a text input box for users to find all records that contain that specific string. In SQL, this is simply implemented as WHERE content LIKE '%USER_SEARCH_QUERY%'
.
Search performance for these records has been historically poor with Postgres, so my team integrated Elasticsearch to help speed up searching. We have imported all of these records into an Elasticsearch index. The Rails models are converted into JSON and inserted into the index, utilizing dynamic mapping. Now when users execute a search, we construct a JSON request body for the query and hit the Search API.
The Problem: We have had trouble implementing an Elasticsearch query that matches the exact same behaviour of the SQL query written above. At first, we were using the match_phrase
query. This appeared to work well for the majority of cases, but there still instances where a search query that would return results against SQL were not returning any results on Elasticsearch. For example, if the XML content contained the element <SomeField>000.111.222</SomeField>
and the user searched for 111
, then no results would be found. We concluded that the default analyzer was not tokenizing this element into multiple phrases.
The next attempt was to use query_string
. Whatever text was entered into the input field would be surrounded by asterisks and then have special characters escaped with backslashes. The JSON for the query would look like this for the above example: { "query_string": { "query": "content:*111*" } }
. Now, search queries against XML elements with numeric bodies are working properly. Unfortunately, there are still cases where Elasticsearch querying on XML text is not working. Remaining issues include searching UUIDs (alphanumeric characters separated by dashes) and date strings (e.g. ISO8601 timestamps).
I know that there are still more experiments that I can try to improve Elasticsearch text search on XML blobs (I'm considering converting the XML into JSON and storing it as a nested structure on the content
key, then running the query_string
search on the content.*
fields). I'm also predicting that I cannot relying on dynamic mapping and I'll need to specify something more specific for content
, like a custom analyzer. However, I'm at the point where I would like to engage the community for help. What else can I do to properly store this XML text blob and implement search that exactly matches wildcard searching in SQL?
Thank you for reading this post and helping me out.