Problem in fetching text from an attachment

Hi,

I am using elasticsearch-1.7.3 and mapper attachment plugin-2.7.1
I have successfully add an attachment in elasticsearch with the help of php code, but while searching a text from attachment, i did not get the desired results. i get a complete attachment.
php code for adding attachment is:
$binary = fread(fopen($target_file,"r"), filesize($target_file));
$base = base64_encode($binary);

$article = array();
$article['index'] = 'test';
$article['type'] = 'person';
$article['body'] = array('my_attachment' => $base,'location2' => $location,'skills2' => $skills);
$result = $es->index($article);
where target_file is the location of file stored on my server folders, locations and skills are other fields in elasticsearch

when i search for a word in this file, i use the following php code:

 $params =array();

$params['index'] = 'test';
$params['type'] = 'person';
$params['body']['query']['match']['my_attachment'] = $q;

$params['body']['highlight']['fields']['my_attachment'] = array("term_vector" => "with_positions_offsets","store" => true);

$query = $es->search($params);
where $q is just a text(which is there in file)

my file is a simple ms word file.
everything worked but when i echo the result it gives me complete document printed, instead of my searched text.
please if anybody can solve it or may give me a link where i can find help regarding the php code of mapper attachment(search text in a file), it would be great please reply asap

I don't exactly understand the php code here.
Did you define a mapping?

What does it look like?

BTW, read the doc here: https://github.com/elastic/elasticsearch-mapper-attachments/blob/master/README.md#highlighting-attachments

yes mapping is defined
could you please tell me how to print our searched texted i think search is going well but i don't know how to print my result on screen
here is the mapping
{
"test": {
"mappings": {
"person": {
"properties": {
"location2": {
"type": "string"
},
"my_attachment": {
"type": "attachment",
"path": "full",
"fields": {
"my_attachment": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
"type": "string"
},
"title": {
"type": "string"
},
"name": {
"type": "string"
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"keywords": {
"type": "string"
},
"content_type": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"language": {
"type": "string"
}
}
},
"skills2": {
"type": "string"
}
}

Again: What does it look like?

this is the mapping
{
"test": {
"mappings": {
"person": {
"properties": {
"location2": {
"type": "string"
},
"my_attachment": {
"type": "attachment",
"path": "full",
"fields": {
"my_attachment": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
"type": "string"
},
"title": {
"type": "string"
},
"name": {
"type": "string"
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"keywords": {
"type": "string"
},
"content_type": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"language": {
"type": "string"
}
}
},
"skills2": {
"type": "string"
}
}

What does this produce in JSON?

$params['body']['highlight']['fields']['my_attachment'] = array("term_vector" => "with_positions_offsets","store" => true);

Can you print the full JSON query so it will be easier to understand?
Did you try to run an highlighting query from curl or SENSE ?

$params['body']['highlight']['fields']['my_attachment'] =
array("term_vector" => "with_positions_offsets","store" => true);
this is basically used for searching purpose when i delete it and try to search again it still gives me same result.
when i add file(attachment), i got a huge string of chracters like this
"source" : "my_attachment":"fdvmevkjvmvvmvvvf................................................."

i did not try to run highlighting query. i am using postman
it does not matter whether i use $params['body']['highlight']['fields']['my_attachment'] =
array("term_vector" => "with_positions_offsets","store" => true); or not
when i do this print_r($query);

i got this

Array
(
[took] => 56
[timed_out] =>
[_shards] => Array
(
[total] => 5
[successful] => 5
[failed] => 0
)

[hits] => Array
    (
        [total] => 2
        [max_score] => 0.027122213
        [hits] => Array
            (
                [0] => Array
                    (
                        [_index] => test
                        [_type] => person
                        [_id] => AVCdYEvW0tzmItCH2MC9
                        [_score] => 0.027122213
                        [_source] => Array
                            (
                                [my_attachment] => dsfgbdjherijfgvejkvefjbedvbejve vjin......................................

[location2] => sdfdsgdfgdfg
[skills2] => dfdsgdsgvdsgv
)

                    )

            )

    )

)

can you tell me how to use highlight query in php code because i think search does work but i do not know how to print it on screen
should i do echo $some_variable['_source']['my_attachment']; or what??

IMO this is incorrect. Look at the doc. When we query we don't add term_vector...

Also with POSTMAN, use POST instead of GET.

ok so what query should i use in place of that??
what i am saying is there is a word in my file "address" now when i search something else no result is shown(print_r($query) is empty)
which is excellent because that word is not there but when i search for address, it is searching from file that is correct but i do not know how to get only the desired word from file.please help me with that.

Anything unclear in doc?

GET /test/person/_search
{
"fields": [],
"query": {
"match": {
"file": "king queen"
}
},
"highlight": {
"fields": {
"file": {
}
}
}
}

ok in this king queen is searched and we get this as result
"highlight": {
"file": [
""God Save the Queen" (alternatively "God Save the King"\n"
]
}
what i am getting here is my full document instead of only desired word.
$q="address";
$params['body']['query']['match']['my_attachment'] = $q;
when i print_r($params); this
i get Array
(
[index] => test
[type] => person
[body] => Array
(
[query] => Array
(
[match] => Array
(
[my_attachment] => address
)

            )

    )

)
which is great so the next step is search i.e.
$query = $es->search($params);
so it should give me the desired word from that file but i got full document
"highlight": {
"my_attachment": [
"...... my full document ......."
]
}

Why do you want to get back a word instead of the document itself?

basically i am working on a recruitment framework and users upload their resumes.
so when i search skills(php,java) i want to search from those uploaded files and from that i get number of users and their skills and their addresses etc.

So the result you want to get is a resume, not really a single "word", right?

Then, you don't really need highlight here.

You have two options IMO:

  • get the result back from elasticsearch (use a simple search) and extract from the _source field the BASE64 content, decode it on the client and open the file in your browser.
  • ask for field file in the above example:
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "file": {
            "type": "string",
            "term_vector":"with_positions_offsets",
            "store": true
          }
        }
      }
    }
  }
}

Something like:

GET /test/person/_search
{
  "fields": [ "file" ],
  "query": {
    "match": {
      "file": "king queen"
    }
  }
}

If it does not work, try with field file.file.

basically what i want is for example there are 1000 resumes in elasticsearch and i want candidates with experience in java and i have many fields like i have skills field,experience fileld etc.so i will put java in the skills field search and say i put experience =1year in experience field search and based on that i should get all information about candidates experience and all information about skills.

When you say "fields" you mean fields inside JSON document (structured data) or inside the PDF (unstructured)?

Try to add explain: true in the query. That could help may be...

fields means fields in form

So I don't understand. If you have structured data and unstructured data, what is the problem then?

May be it will be easier to understand if you send a full example of your JSON document?