Integrate file and db search


(Mauro) #1

Hi to all,
I'm considering elasticsearch to replace our "db only" search in a php/oracle application.
I have an oracle db which contains various data type, and also files in blob fields. I want to index the single row (various fields+blob) and make a full-text search (also in file content: pdf, doc, xls, etc.). Alternatively I can put the file outside the db, in the file system, and leave only the pointer to the file in the db.
Is this possible with elasticsearch? I find few information about the integration of heterogeneous data...
Thank You,
Mauro


(David Pilato) #2

Yes. This is doable. Typically this is what I have been doing in my former company apart that the data was coming from PostgresQL.

You have to separate the concerns first:

  • Index objects
  • Index attachments (blobs)

For objects, you need to understand what and how you want to search. As an example, if you want to search for tweets, then index tweets. If you need to search tweets by the twitter handle, then index twitter handler field within tweet documents. Don't try to recreate relations as you have in a relational database.

For blobs, you will have to send the blob as a BASE64 encoded binary to elasticsearch.
Before 5.0.0, you can use the mapper attachments plugin but as you are starting a new project, my advice is to start with the 5.0.0-alpha5 version in development. In that case you can use the attachment ingest plugin which is much more flexible.

Hope this helps.


(Mauro) #3

Thank you for Your answer... I'm yet considering to start the project with the 5.0.0-alpha5 version.
I want only to understand if it's possible to search in text and binary at the same time. If I'm right, this is the procedure:

  1. search in text: return the id of the row containing the term
  2. search in binary: return the id of the row containing the term
  3. merge the two results removing duplicates

There is a way to do archive the same results with only one query?
Thank you


(David Pilato) #4

You can't JOIN in elasticsearch so you have two choices:

  • run 2 queries and merge the results in your application
  • index attachments as part as your top level entity object

For example, I was indexing something like:

{
  "id": "1",
  "company": "elastic",
  "attachments": [
    {
      "url": "path/to/myfile.pdf",
      "content": "BASE64 content when using mapper-attachments or extracted text with ingest-attachment"
    },    {
      "url": "path/to/myfile.docx",
      "content": "BASE64 content when using mapper-attachments or extracted text with ingest-attachment"
    }
  ]
}

Something along those lines.

HTH


(system) #5