Searching ElasticSearch - Query help

With so many different types of search options, I'm struggling to know the best way to query my indexes.

For reference, here is a sample document from my index:

{
  "_index": "general",
  "_type": "general",
  "_id": "5904",
  "_version": 18,
  "found": true,
  "_source": {
    "author_name": "John Doe",
    "entry_text": "<p>Lorem ipsum dolor sit amet [...]</p>",
    "comments": [
      {
        "author_name": "John Doe",
        "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eget ex libero. Interdum et malesuada fames ac ante ipsum primis in faucibus. Nullam sed aliquam ligula. Quisque sit amet elementum velit. Donec in ante eget mauris lobortis tincidunt ac non mi.",
        "author_id": 1,
        "timestamp": 1486417016417
      },
      {
        "author_name": "John Doe",
        "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eget ex libero.\n\nQuisque sit amet elementum velit.",
        "author_id": 1,
        "timestamp": 1486417159233
      }
    ],
    "created": 1486409437884,
    "tags": [
      "bugs"
    ],
    "products": [
      {
        "model": "IRGASON",
        "id": 4923,
        "title": "Integrated CO2 and H2O Open-Path Gas Analyzer and 3-D Sonic Anemometer",
        "url": "irgason"
      }
    ],
    "entry_title": "Lorem Ipsum dolor sit amet",
    "files": [
      {
        "author_name": "John Doe",
        "extension": "txt",
        "attachment": {
          "content_type": "text/plain; charset=windows-1252",
          "language": "et",
          "content": "Lorem ipsum dolor sit amet [...]",
          "content_length": 2988
        },
        "name": "5904_20170209084729.txt",
        "description": "test",
        "author_id": "1",
        "timestamp": 1486680449324
      }
    ],
    "links": [],
    "author_id": 1,
    "entry_type": "General",
    "updated": 1486655343160,
    "entry_id": 45
  }
}

I want to be able to full-text search the following fields:
entry_title (boost - High)
entry_text (boost - Medium)
files.description (inner_hits)
files.attachment.content (boost -medium) (inner_hits)
comments.text

The other items - tags, products, authors, etc are simply used as filters, which I have no problem figuring out.

I really like the simplicity and also flexibility of using the simple_query_string, however I would like to be able to use inner_hits to also query and bring back results from my files/attachment fields to display in search results. As far as I can tell, you cannot use inner_hits with the simple_query_string. Is that correct?

If I want my end users to be able to search using quoted values, + -, etc and don't use simple_query_string, would I have to parse the search terms and map them to phrase matching, and must_not, and must, myself in order to accomplish what simple_query_string does, while being able to specify that I want to get inner_hits results?

It seems like it could get very complicated to accurately parse input from a single query field and then map to the appropriate query parts when the simple_query_string already does this.

What approach should I take on this? What is the most common approach to these types of search problems?

Thank you for your assistance!

For anyone looking to do something similar, this is what I ended up doing:

Since I can't use inner_hits with simple_query_string, I decided I needed to solve my problem by changing how I index my documents.

So, instead of indexing files/attachments as a nested object my main documents, I have separated them out into individual documents. I also included a reference to the parent entry in order to maintain that relationship.

With this setup, I can now use all of the nice functionality of simple_query_string, and search my attachment content and have them show up in results.

1 Like

Thanks for sharing your solution! :smiley:

One more thing I did to keep things tidied up with this approach.

I wanted to keep the relationship between file attachments and their parent documents easy to use, so I indexed the parent id and parent entry title into the each attachment document. This works great and keeps things simple on the query side, however there is the problem that occasionally parent entry titles may change.

In order to handle this, I check each time an entry title changes (in my application logic) to see if there are any attachments for that entry. If there are any file attachments I run a simple script that collects a reference to each attachment, and then builds a _bulk api request update to update the parent entry title field for each attachment/file document. This keeps things in sync and works quite well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.