Problem in fetching text from an attachment


(Tarneet Singh) #1

Hi,

I am using elasticsearch-1.7.3 and mapper attachment plugin-2.7.1
I have successfully add an attachment in elasticsearch with the help of php code, but while searching a text from attachment, i did not get the desired results. i get a complete attachment.
php code for adding attachment is:
$binary = fread(fopen($target_file,"r"), filesize($target_file));
$base = base64_encode($binary);

$article = array();
$article['index'] = 'test';
$article['type'] = 'person';
$article['body'] = array('my_attachment' => $base,'location2' => $location,'skills2' => $skills);
$result = $es->index($article);
where target_file is the location of file stored on my server folders, locations and skills are other fields in elasticsearch

when i search for a word in this file, i use the following php code:

 $params =array();

$params['index'] = 'test';
$params['type'] = 'person';
$params['body']['query']['match']['my_attachment'] = $q;

$params['body']['highlight']['fields']['my_attachment'] = array("term_vector" => "with_positions_offsets","store" => true);

$query = $es->search($params);
where $q is just a text(which is there in file)

my file is a simple ms word file.
everything worked but when i echo the result it gives me complete document printed, instead of my searched text.
please if anybody can solve it or may give me a link where i can find help regarding the php code of mapper attachment(search text in a file), it would be great please reply asap


(David Pilato) #2

I don't exactly understand the php code here.
Did you define a mapping?

What does it look like?

BTW, read the doc here: https://github.com/elastic/elasticsearch-mapper-attachments/blob/master/README.md#highlighting-attachments


(Tarneet Singh) #3

yes mapping is defined
could you please tell me how to print our searched texted i think search is going well but i don't know how to print my result on screen
here is the mapping
{
"test": {
"mappings": {
"person": {
"properties": {
"location2": {
"type": "string"
},
"my_attachment": {
"type": "attachment",
"path": "full",
"fields": {
"my_attachment": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
"type": "string"
},
"title": {
"type": "string"
},
"name": {
"type": "string"
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"keywords": {
"type": "string"
},
"content_type": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"language": {
"type": "string"
}
}
},
"skills2": {
"type": "string"
}
}


(David Pilato) #4

Again: What does it look like?


(Tarneet Singh) #5

this is the mapping
{
"test": {
"mappings": {
"person": {
"properties": {
"location2": {
"type": "string"
},
"my_attachment": {
"type": "attachment",
"path": "full",
"fields": {
"my_attachment": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
"type": "string"
},
"title": {
"type": "string"
},
"name": {
"type": "string"
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"keywords": {
"type": "string"
},
"content_type": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"language": {
"type": "string"
}
}
},
"skills2": {
"type": "string"
}
}


(David Pilato) #6

What does this produce in JSON?

$params['body']['highlight']['fields']['my_attachment'] = array("term_vector" => "with_positions_offsets","store" => true);

Can you print the full JSON query so it will be easier to understand?
Did you try to run an highlighting query from curl or SENSE ?


(Tarneet Singh) #7

$params['body']['highlight']['fields']['my_attachment'] =
array("term_vector" => "with_positions_offsets","store" => true);
this is basically used for searching purpose when i delete it and try to search again it still gives me same result.
when i add file(attachment), i got a huge string of chracters like this
"source" : "my_attachment":"fdvmevkjvmvvmvvvf................................................."

i did not try to run highlighting query. i am using postman
it does not matter whether i use $params['body']['highlight']['fields']['my_attachment'] =
array("term_vector" => "with_positions_offsets","store" => true); or not
when i do this print_r($query);

i got this

Array
(
[took] => 56
[timed_out] =>
[_shards] => Array
(
[total] => 5
[successful] => 5
[failed] => 0
)

[hits] => Array
    (
        [total] => 2
        [max_score] => 0.027122213
        [hits] => Array
            (
                [0] => Array
                    (
                        [_index] => test
                        [_type] => person
                        [_id] => AVCdYEvW0tzmItCH2MC9
                        [_score] => 0.027122213
                        [_source] => Array
                            (
                                [my_attachment] => dsfgbdjherijfgvejkvefjbedvbejve vjin......................................

[location2] => sdfdsgdfgdfg
[skills2] => dfdsgdsgvdsgv
)

                    )

            )

    )

)

can you tell me how to use highlight query in php code because i think search does work but i do not know how to print it on screen
should i do echo $some_variable['_source']['my_attachment']; or what??


(David Pilato) #8

IMO this is incorrect. Look at the doc. When we query we don't add term_vector...


(David Pilato) #9

Also with POSTMAN, use POST instead of GET.


(Tarneet Singh) #10

ok so what query should i use in place of that??
what i am saying is there is a word in my file "address" now when i search something else no result is shown(print_r($query) is empty)
which is excellent because that word is not there but when i search for address, it is searching from file that is correct but i do not know how to get only the desired word from file.please help me with that.


(David Pilato) #11

Anything unclear in doc?

GET /test/person/_search
{
"fields": [],
"query": {
"match": {
"file": "king queen"
}
},
"highlight": {
"fields": {
"file": {
}
}
}
}


(Tarneet Singh) #12

ok in this king queen is searched and we get this as result
"highlight": {
"file": [
""God Save the Queen" (alternatively "God Save the King"\n"
]
}
what i am getting here is my full document instead of only desired word.
$q="address";
$params['body']['query']['match']['my_attachment'] = $q;
when i print_r($params); this
i get Array
(
[index] => test
[type] => person
[body] => Array
(
[query] => Array
(
[match] => Array
(
[my_attachment] => address
)

            )

    )

)
which is great so the next step is search i.e.
$query = $es->search($params);
so it should give me the desired word from that file but i got full document
"highlight": {
"my_attachment": [
"...... my full document ......."
]
}


(David Pilato) #13

Why do you want to get back a word instead of the document itself?


(Tarneet Singh) #14

basically i am working on a recruitment framework and users upload their resumes.
so when i search skills(php,java) i want to search from those uploaded files and from that i get number of users and their skills and their addresses etc.


(David Pilato) #15

So the result you want to get is a resume, not really a single "word", right?

Then, you don't really need highlight here.

You have two options IMO:

  • get the result back from elasticsearch (use a simple search) and extract from the _source field the BASE64 content, decode it on the client and open the file in your browser.
  • ask for field file in the above example:
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "file": {
            "type": "string",
            "term_vector":"with_positions_offsets",
            "store": true
          }
        }
      }
    }
  }
}

Something like:

GET /test/person/_search
{
  "fields": [ "file" ],
  "query": {
    "match": {
      "file": "king queen"
    }
  }
}

If it does not work, try with field file.file.


(Tarneet Singh) #16

basically what i want is for example there are 1000 resumes in elasticsearch and i want candidates with experience in java and i have many fields like i have skills field,experience fileld etc.so i will put java in the skills field search and say i put experience =1year in experience field search and based on that i should get all information about candidates experience and all information about skills.


(David Pilato) #17

When you say "fields" you mean fields inside JSON document (structured data) or inside the PDF (unstructured)?

Try to add explain: true in the query. That could help may be...


(Tarneet Singh) #18

fields means fields in form


(David Pilato) #19

So I don't understand. If you have structured data and unstructured data, what is the problem then?

May be it will be easier to understand if you send a full example of your JSON document?


(system) #20