Searching content doesn't show exact output

kunam · February 28, 2018, 3:53pm

Hi elastic,

I'm indexing some text files to elastic search, i've installed ingest attachment plugin and successfully created mapping and indexed few text files in to ES and the file which i've indexed is Base 64 encoded.

Now i'm perform search query to see content "Hello World", but the search query which i performed is not giving expected output. so could guys help me in writing search query that should look for only particular content out of all the files which i've indexed.

    PUT _ingest/pipeline/attachment

{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data"
}
}
}
}
]
}
PUT company/employee/my_id?pipeline=attachment
{
"attachments":[
{
"filename":"test.txt",
"data":"dGVzdCBvbmUgZmlsZQ=="
},
{
"filename":"test1.txt",
"data":"IkhlbGxvIFdvcmxkIg=="
},
{
"filename":"test2.txt",
"data":"dGVzdDNmaWxlIHF3ZXJ0eQ=="
},
{
"filename":"test3.txt",
"data":"dGVzdA0KdGVzdDQNCnRlc3Q0NA0KdGVzdDQ0NA=="
}
]
}
GET company/employee/_search
{
"query": { "match_phrase": {
"attachments.attachment.content": "hello world"
}
},
"highlight": {
"fields": {
"content":["Hello"]
}
}
}

dadoonet · February 28, 2018, 8:25pm

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

kunam · February 28, 2018, 9:28pm

PUT _ingest/pipeline/attachment            
      {
      "description" : "Extract attachment information from arrays",
      "processors" : [
        {
         "foreach": {
           "field": "attachments",
           "processor": {
             "attachment": {
               "target_field": "_ingest._value.attachment",
               "field": "_ingest._value.data"
       }
     }
   }
  }
 ]
}

 PUT company/employee/my_id?pipeline=attachment
 { 
  "filename":"test.txt",
  "data":"dGVzdCBvbmUgZmlsZQ=="
 },
{
  "filename":"test1.txt",
  "data":"IkhlbGxvIFdvcmxkIg=="
},
{
  "filename":"test2.txt",
  "data":"dGVzdDNmaWxlIHF3ZXJ0eQ==" 
},
{
  "filename":"test3.txt",
  "data":"dGVzdA0KdGVzdDQNCnRlc3Q0NA0KdGVzdDQ0NA=="
   }
  ]  
 }
 GET company/employee/_search 
{ 
 "query": { "match_phrase": {"attachments.attachment.content": "hello world"}
 }
}

O/P

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "company",
        "_type": "employee",
        "_id": "my_id",
        "_score": 0.5753642,
        "_source": {
          "attachments": [
            {
              "filename": "test.txt",
              "data": "dGVzdCBvbmUgZmlsZQ==",
              "attachment": {
                "content_type": "text/plain; charset=ISO-8859-1",
                "language": "et",
                "content": "test one file",
                "content_length": 14
              }
            },
            {
              "filename": "test1.txt",
              "data": "IkhlbGxvIFdvcmxkIg==",
              "attachment": {
                "content_type": "text/plain; charset=ISO-8859-1",
                "language": "it",
                "content": """"Hello World"""",
                "content_length": 14
              }
            },
            {
              "filename": "test2.txt",
              "data": "dGVzdDNmaWxlIHF3ZXJ0eQ==",
              "attachment": {
                "content_type": "text/plain; charset=ISO-8859-1",
                "language": "et",
                "content": "test3file qwerty",
                "content_length": 17
              }
            },
            {
              "filename": "test3.txt",
              "data": "dGVzdA0KdGVzdDQNCnRlc3Q0NA0KdGVzdDQ0NA==",
              "attachment": {
                "content_type": "text/plain; charset=windows-1252",
                "language": "et",
                "content": """
test
test4
test44
test444
""",
                "content_length": 29
              }
            }
          ]
        }
      }
    ]
  }
}

kunam · February 28, 2018, 9:35pm

it's very tough to paste the code in </> fields, so I've pasted the output in Blockquote column.

kunam · February 28, 2018, 9:48pm

I'm trying to query particular content called "Hello World" and the expected output should give file information and it's metadata.

          "filename": "test1.txt",
          "data": "IkhlbGxvIFdvcmxkIg==",
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "it",
            "content": """"Hello World"""",
            "content_length": 14

.
but when i perform this query
GET company/employee/_search
{
"query": { "match_phrase": {"attachments.attachment.content": "hello world"}
}
}

I'm seeing the metadata of all the files which I've indexed, which i was not expecting.

dadoonet · February 28, 2018, 10:01pm

You indexed one single document for all attachments.

One part of the document matches, which means that the whole document matches.
That's why you are seeing everything.

My advice is to create one document per attachment instead.

kunam · February 28, 2018, 10:27pm

@dadoonet Thank you, it worked.
but i had huge number of documents, So i was doing bulk indexing for single attachment.

Q)If i index each document, it would consume more time. could you please tell me what would be the best of indexing multiple documents at a time and search query should show an exact value.

Q)I had huge number of text files that needs to be indexed in to elastic search, at present i'm converting each document in to base64, what would be the best way of converting files into base 64??

Thank you!

dadoonet · February 28, 2018, 10:41pm

Q)If i index each document, it would consume more time.

Why?

Could you please tell me what would be the best of indexing multiple documents at a time and search query should show an exact value.

Bulk API and normal search.

Q)I had huge number of text files that needs to be indexed in to Elasticsearch, at present i'm converting each document in to base64, what would be the best way of converting files into base 64??

Depends on your programming language I guess.

system · March 28, 2018, 10:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search froma a pdf file content Elasticsearch	9	515	July 23, 2020
Search by file content Elasticsearch	5	1870	July 5, 2017
How to search a encoded content Elasticsearch	3	430	December 16, 2016
Problem in fetching text from an attachment Elasticsearch	19	2113	July 5, 2017
Not able to search through attachment contents Elasticsearch	32	8012	July 5, 2017

Searching content doesn't show exact output

Related topics