Index (ingest attachment) update in elastic search or storing multipe document in same index


(Gaurav Garg) #1

Hi Team,

I am new to elasticseaarch.

I am using ingest attachment for indexing pdf, excel.. etc.. document. I am doing it with rest client. Is it possible i can update index.

I mean to say that when i do first POST request then it will index my first pdf document and in my second post request with same index i should able to do indexing of second pdf document. This means it will append 2nd pdf document with first one.

Thank you.


(David Pilato) #2

Hummmm. I don't think it's possible.

Actually update API does not support ingest:

For now, I believe you need to send the full document again, which means both PDFs.


(Gaurav Garg) #3

@dadoonet

Thank you very much for your feedback.

Yes i need to send full document again both pdf.

I mean in first post request i am creating pipeline first and then indexing Base64 document. After some time according to my use case i want to upload another pdf base64 document in the same index so that any time user can upload document but when they search it will display how many files contain these search/query keyword.

As of now i am doing POST request with different different index value and i am searching document with respective index value.

I got some information https://www.elastic.co/guide/en/elasticsearch/reference/5.5/docs-update-by-query.html but i am not sure whether i can do index update with this.

can i do it with re-indexing in ingest attachment

is there any way i can do it ?

I am using java rest client.

Thank you :slight_smile:

Regards,
Gaurav


(Gaurav Garg) #4

Following code sample create ingest pipeline

``create_pipeline () {[
RestClient requester = RestClient.builder( new HttpHost ("localhost", 9300),
                           new HttpHost ("localhost", 9200)).build();
    HttpEntity body = new NStringEntity("{ \"description\" : \"Extract attachment information\", \n" +
            "  \"processors\": [\n" +Preformatted text
            "    {\n" +
            "      \"attachment\": {\n" +
            "        \"field\": \"data\",\n" +
            "        \"indexed_chars\": -1\n" +
            "      }\n" +
            "    }\n" +
            "  ] \n" +
            "}\n",ContentType.APPLICATION_JSON);
    try {
        requester.performRequest("PUT","/_ingest/pipeline/attachment", Collections.<String,String>emptyMap(), body);
    } catch (IOException e) {
        e.printStackTrace();

        return false;
    } ``

following code sample create index using ingest attachment plugin

   public void IndexComputeController (QueryIndexer computeIndex) {
   
   create_pipeline ();
   
     {
                    RestClient requester = RestClient.builder( new HttpHost ("localhost", 9300),
                                                               new HttpHost ("localhost", 9200)).build();
                    HttpEntity body = new NStringEntity("{" +
                                    "\"data\":\"" + encoder(filePath) + "\"" + "}", ContentType.APPLICATION_JSON);
                    HashMap<String,String> param = new HashMap<String,String>();
                    param.put("pipeline", "attachment"); 
                    Response httpresponse = requester.performRequest("PUT", "/"+index+"/"+"_update_by_query"+"/"+"doc"+"/"+"1", param, body);
                    System.out.println ("POST response is: " + httpresponse);
                } 
}

(Gaurav Garg) #5

@dadoonet

Hi David,

In above code if i use

`

Response httpresponse = requester.performRequest("PUT", "/"+"index"+"/"+"doc"+"/"+"_update_by_query", param, body);

`

for creating same index for all POST request from rest client which mean all of my post request sending different different pdf document to be indexed . then this updation working sometime. sometime means its completely replacing my 2nd post request index and sometime completely replacing my 3rd post request index.

Is it right way to do ?

Could you suggest me right way to index document with same index for all POST request. I am using ingest attachment plugin. to parse my pdf documents.

Thanks :slight_smile:


(David Pilato) #6

Not sure I follow all the posts here.

In short:

  • You can't use today update API with ingest
  • You need to provide again all documents if you want to add one single document.

Do you have the original binary documents when you want to add a new one?


(Gaurav Garg) #7

Hi David,

Yes i have all binary document when i upload new one but it decrease huge performance.

My use-case is with ingest attachment Plugin because each time when user want he need to upload new (pdf/doc/etc...) file to the cloud and if i calculate binary of all file again whatever files have uploaded in the cloud then it will result in performance penalty.

Should i manually parse file using TIKA parser and use old API instead of using ingest attachment for this update index case.

Thanks,

Regards,
Gaurav


(David Pilato) #8

Why do you want to store that in a single document then?

Just create a new document like:

PUT index/doc/1
{
  "user": {
    // User details
  },
  "binary": "your BASE64 here for foo.txt",
  "filename": "foo.txt"
}

Then when the user uploads a new document, just index it as:

PUT index/doc/2
{
  "user": {
    // User details
  },
  "binary": "your BASE64 here for bar.txt",
  "filename": "bar.txt"
}

(Gaurav Garg) #9

Hi David,

In my cloud application i am having different different user. i thought of storing all documents from same user in the same index.

You are right,

i can use different id for same user for storing document.

Now in your suggested example i can store id in separate variable (only for tracking when i do search/query operation) and when user want to search document i can iterate all of the id with same index and when it find document with correspond id i can return that information.


(David Pilato) #10

Is it a question?


(Gaurav Garg) #11

Hi David,

My last comment part was question. I am creating single index per user and when user try to upload new pdf or any other file i will increment my id and store binary base64 document to my index. and next time same user want to store document i will increment id and so on....

While searching i will iterate in a loop form 1 to max id number for that user and search the document.

Am i right for above approach, am i missing anything ?

It may be silly questions i am new to elastic search.


(Gaurav Garg) #12

Hi David,

Just now i stored documents with different different id with same index.

When i performed search/query operation i did not give id number i just need give index and type and it gave me result in json response which contain id number along with other information like file name etc...

Response indexResponse = requester.performRequest("GET","/"+index+"/"+"doc"+"/_search", Collections.<String,String>emptyMap(), body);

am i missing anything in search operation, i suppose to do it without id right ?

Thanks,

Regards,
Gaurav


(David Pilato) #13

Not sure I follow.

Basically, if you want to search for documents for a given user, just wrap your query inside a bool query using a must clause, then add a filter clause which filters the results with the user id.


(Gaurav Garg) #14

I got it David,

Thank you very much for your help :slight_smile: :slight_smile: :slight_smile:


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.