Looking for advice on an Elastic Search index approach, proximity search on it, and storing additional data to be return with the search

Preamble:

Thanks for any help and advice in advance on this, I've dug around for anything that I feel is a similar use case and not found it, and it's fairly basic so it makes me wonder...

I've recently picked up Elastic Search for a project but have a use case that I'm not entirely sure if my approach for it is sensible.

I would really like to leverage the speed of Elastic Search, and the Lucene indexing etc.

The data I am storing in it will very very rarely be updated, and I need to be able to do full text searches on it all, proximity etc. etc.

Now, I understand that generally with a full text search, the content would be stored in the index document, and you would do a search on the entire string, and that would return the docs that match.

However, I am wanting to be able to store information along with the words, so that I can quickly identify (by coords) where they are in a document at the same time as searching them.

I already have the data I need for this, but indexing it correctly, to be able to do what I need it to, is where I have a few concerns and questions...

My Approach:

An approach I have tried, which actually works perfectly in many respects, is to index all the words in a seperate index, and a document for each word (allowing me to store the additional information I want to).

For example:

index: files/ (for clarity will call this index ES-F)

I index a "file (document/pdf etc.)" to there, and I let ES generate an ID,

Then I have looked at this in two ways:

  1. I use the ID to create indices for the files words (to identify which file they came from), creating a document for each word and all of it's additional data:

index: word_GENERATE-FILE-ID (for clarity will call this index ES-W)

  1. I use the ID as part of the data stored in the word index document (to identify which file it came from):

index: word/ (for clarity will call this index ES-W also)`

document: source: { file_id: GENERATED-FILE-ID, string: 'word', etc ... }

This allows me to perform lots of different searches, directly on the ES-W index or specific file word index I need, but only if it is a single word.

Which for many of my use cases works fine, I search the word, get the additional data, great!

Except if I need to look for phrases etc. obviously, this doesn't work.

(Note: This ES cluster will only be used for this sort of data, files and words and nothing else.)

My First Few Questions:

  1. Is this an approach that makes sense / can you understand what I am trying to achieve with it?
  2. Is there a way to join documents in a query, in this case, multiple strings and then do a search on the result of that, and still return the documents that the result gets? Almost like creating a view (like in SQL) of sorts (I know this is a bit hopeful)
  3. Can you recommend a different approach?

I'm just not sure how I could retrieve any additional data if I was to store all the text within the ES-F index document itself.

I know this is how I can then do any search I want across the body of the text though.

A Potential Approach

I have also thought about creating an ES-F index something like the following (or even perhaps an additional index for page_words or similar (ES-P) if I am unable to work out the page index from a search on the below):

file/

{
 source:
    {
     pages: [
        {
        content: ["Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis et risus ac eros bibendum pretium fringilla non nulla. Aliquam eget pulvinar lectus, a accumsan ex. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nulla facilisi. Etiam in lobortis purus. Nulla ultricies molestie massa, vitae blandit tellus accumsan vitae. Duis accumsan augue ac porttitor finibus. Pellentesque quis finibus enim, eget volutpat purus."]
            },
        {
        content: ["Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis et risus ac eros bibendum pretium fringilla non nulla. Aliquam eget pulvinar lectus, a accumsan ex. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nulla facilisi. Etiam in lobortis purus. Nulla ultricies molestie massa, vitae blandit tellus accumsan vitae. Duis accumsan augue ac porttitor finibus. Pellentesque quis finibus enim, eget volutpat purus."]
            }
       ]
       
    }
}

Which would allow me to do more complex searches on pages (but obviously not across pages).

Then return from that search, return what the page number is, and then using that, develop a way to pull the correct words out of the ES-W index...

  1. I would be able to add all the words when first processing them and the additional data to, for example, an ArrayList element for each page, and determine the words ArrayList index for that page, and include this in the ES-W index documents additional data.

  2. Then, using the retrieve page and content from the ES-F index after a proximity search on it. I could take the content, and do something similar to step 1, giving me a list of indexes for each word searched for:

eg. Search for "Elastic Search"

"Lorem ipsum dolor sit amet elastic search adipiscing elit"

    When processing for **ES-W** we'd get an index of:

- 5 for elastic
- 6 for search

   This index would be stored in **ES-W** along with the additional data.
  1. Then with the page number, and the ArrayList (for explaination purposes) index, I could identify exactly which words the more complex searches such as proximty search have found.

More questions:

Well, not including the questions above, there's really just one...

Does the potential approach I have suggested, sound like a good approach, or does it feel too complicated?

Part of me feels it's quite simple in reality and could work really well, but I'm new to Elastic Search so any advice is massively welcomed, even if you're telling me I'm completely wrong!!!

Oh and apologies for the lengthy and wordy question, I wasn't sure how best to keep this brief but also give all the information I think I need to do understand what I'm wanting to do!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.