Searching content doesn't show exact output

Hi elastic,

I'm indexing some text files to elastic search, i've installed ingest attachment plugin and successfully created mapping and indexed few text files in to ES and the file which i've indexed is Base 64 encoded.

Now i'm perform search query to see content "Hello World", but the search query which i performed is not giving expected output. so could guys help me in writing search query that should look for only particular content out of all the files which i've indexed.

    PUT _ingest/pipeline/attachment

{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data"
}
}
}
}
]
}
PUT company/employee/my_id?pipeline=attachment
{
"attachments":[
{
"filename":"test.txt",
"data":"dGVzdCBvbmUgZmlsZQ=="
},
{
"filename":"test1.txt",
"data":"IkhlbGxvIFdvcmxkIg=="
},
{
"filename":"test2.txt",
"data":"dGVzdDNmaWxlIHF3ZXJ0eQ=="
},
{
"filename":"test3.txt",
"data":"dGVzdA0KdGVzdDQNCnRlc3Q0NA0KdGVzdDQ0NA=="
}
]
}
GET company/employee/_search
{
"query": { "match_phrase": {
"attachments.attachment.content": "hello world"
}
},
"highlight": {
"fields": {
"content":["Hello"]
}
}
}

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

PUT _ingest/pipeline/attachment            
      {
      "description" : "Extract attachment information from arrays",
      "processors" : [
        {
         "foreach": {
           "field": "attachments",
           "processor": {
             "attachment": {
               "target_field": "_ingest._value.attachment",
               "field": "_ingest._value.data"
       }
     }
   }
  }
 ]
}

 PUT company/employee/my_id?pipeline=attachment
 { 
  "filename":"test.txt",
  "data":"dGVzdCBvbmUgZmlsZQ=="
 },
{
  "filename":"test1.txt",
  "data":"IkhlbGxvIFdvcmxkIg=="
},
{
  "filename":"test2.txt",
  "data":"dGVzdDNmaWxlIHF3ZXJ0eQ==" 
},
{
  "filename":"test3.txt",
  "data":"dGVzdA0KdGVzdDQNCnRlc3Q0NA0KdGVzdDQ0NA=="
   }
  ]  
 }
 GET company/employee/_search 
{ 
 "query": { "match_phrase": {"attachments.attachment.content": "hello world"}
 }
}

O/P

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "company",
        "_type": "employee",
        "_id": "my_id",
        "_score": 0.5753642,
        "_source": {
          "attachments": [
            {
              "filename": "test.txt",
              "data": "dGVzdCBvbmUgZmlsZQ==",
              "attachment": {
                "content_type": "text/plain; charset=ISO-8859-1",
                "language": "et",
                "content": "test one file",
                "content_length": 14
              }
            },
            {
              "filename": "test1.txt",
              "data": "IkhlbGxvIFdvcmxkIg==",
              "attachment": {
                "content_type": "text/plain; charset=ISO-8859-1",
                "language": "it",
                "content": """"Hello World"""",
                "content_length": 14
              }
            },
            {
              "filename": "test2.txt",
              "data": "dGVzdDNmaWxlIHF3ZXJ0eQ==",
              "attachment": {
                "content_type": "text/plain; charset=ISO-8859-1",
                "language": "et",
                "content": "test3file qwerty",
                "content_length": 17
              }
            },
            {
              "filename": "test3.txt",
              "data": "dGVzdA0KdGVzdDQNCnRlc3Q0NA0KdGVzdDQ0NA==",
              "attachment": {
                "content_type": "text/plain; charset=windows-1252",
                "language": "et",
                "content": """
test
test4
test44
test444
""",
                "content_length": 29
              }
            }
          ]
        }
      }
    ]
  }
}

it's very tough to paste the code in </> fields, so I've pasted the output in Blockquote column.

I'm trying to query particular content called "Hello World" and the expected output should give file information and it's metadata.

          "filename": "test1.txt",
          "data": "IkhlbGxvIFdvcmxkIg==",
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "it",
            "content": """"Hello World"""",
            "content_length": 14

.
but when i perform this query
GET company/employee/_search
{
"query": { "match_phrase": {"attachments.attachment.content": "hello world"}
}
}

I'm seeing the metadata of all the files which I've indexed, which i was not expecting.

You indexed one single document for all attachments.

One part of the document matches, which means that the whole document matches.
That's why you are seeing everything.

My advice is to create one document per attachment instead.

@dadoonet Thank you, it worked.
but i had huge number of documents, So i was doing bulk indexing for single attachment.

Q)If i index each document, it would consume more time. could you please tell me what would be the best of indexing multiple documents at a time and search query should show an exact value.

Q)I had huge number of text files that needs to be indexed in to elastic search, at present i'm converting each document in to base64, what would be the best way of converting files into base 64??

Thank you!

Q)If i index each document, it would consume more time.

Why?

Could you please tell me what would be the best of indexing multiple documents at a time and search query should show an exact value.

Bulk API and normal search.

Q)I had huge number of text files that needs to be indexed in to Elasticsearch, at present i'm converting each document in to base64, what would be the best way of converting files into base 64??

Depends on your programming language I guess.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.