Attachment(PDF/DOC) Indexing and Searching on ElasticSearch in PHP


(Selim Reza) #1

I have installed elasticsearch in my local pc. Also installed mapper-attachments and kibana.

I can index string data into a index and search. But when I trying to search text from PDF or docx in a folder :: Its not working. Its returning null or blank array().

I searched on google and read the elastic documentation but Still I have nothing to resolve it. Would you guide me ? Here is my code ::

$client = \Elasticsearch\ClientBuilder::create()->build();

$doc_src = public_path()."/uploads/files/bower.pdf";
$binary = fread(fopen($doc_src, "r"), filesize($doc_src));
$doc_str = base64_encode($binary);

$data_string =  'Welcome to Dhaka! Anwar is a gentleman!';


//index data and assign by index "my_index"
$params = [
    'index' => 'my_index',
    'type' => 'attachment',
    'id' => 'my_id',
    'body' => [
        'testField' => $data_string,
        'fileName' => $doc_src,
        'file' => $doc_str
    ]

];

$response = $client->index($params);
#print_r($response);

//search data from the index "my_index"
$params = [
    'index' => 'my_index',
    'type' => 'attachment',
    'body' => [
        'query' => [
            'match' => [
                #'testField' => 'dhaka',
                'file' => 'dhaka'
            ]
        ],
    ]
];

$response = $client->search($params);
print_r($response['hits']['hits']);

Bottom points :

1. when I tried to teach using `testField` from the `data_string` then its working nicely. 
2. When I tried to search from pdf file `doc_str` `=>` Its not working 

I think I missed something big. Any helps is highly appreciated!


(David Pilato) #2

What is your mapping?

BTW prefer using ingest-attachment instead.


(Selim Reza) #3

I did not understand the mapping policy from documentation. Would you please send an example for PHP ?

And Would you please send an example for ingest-attachment with mapping and searching ?


(David Pilato) #4

If you did not define a mapping then it explains why it does not work.

Read:

https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments-helloworld.html

For ingest:

https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html


(Selim Reza) #5

Thanks again. Now I tried to use ingest_processor in PHP

Here is the mapping code

public function ingest_processor_mapping()
{
    $client = \Elasticsearch\ClientBuilder::create()->build();
    $params = [
        'id' => 'attachment',
        'body' => [
            'description' => 'Extract attachment information',
            'processors' => [
                [
                    'attachment' => [
                        'field' => 'content',
                        'indexed_chars' => -1
                    ]
                ]
            ]
        ],
        
    ];
    return $client->ingest()->putPipeline($params);
}  

Result ::

{
 "acknowledged": true
}

Here is my Indexing Code in PHP

public function ingest_processor_indexing()
{
    $client = \Elasticsearch\ClientBuilder::create()->build();
    $fullfile = public_path().'/uploads/files/bower.pdf';
    $params = [
        'index' => 'ingest_index',
        'type'  => 'attachment',
        'id'    => 'document_id',
        'pipeline' => 'attachment',  // <----- here
        'body'  => [
            'content' => base64_encode(file_get_contents($fullfile)),
            'file_path' =>$fullfile,
        ]
    ];
    return $client->index($params);
}

Result ::

 {
  "_index": "ingest_index",
  "_type": "attachment",
  "_id": "document_id",
  "_version": 2,
  "result": "updated",
  "_shards": {
"total": 2,
"successful": 1,
"failed": 0
  },
  "created": false
}

Here is my searching CODE in PHP ::

public function ingest_processor_searching()
{
    $client = \Elasticsearch\ClientBuilder::create()->build();
    $params = [
        'index' => 'ingest_index',
        'type' => 'attachment',
        'body' => [
            'query' => [
                'match' => [
                    'content' => 'dhaka'
                ]
            ],
        ]
    ];

    $response = $client->search($params);
    print_r($response['hits']['hits']);
}

Result ::

Array( )

Apologies, I messed up something. Would you please help me so that every one can get the best answer ?


(David Pilato) #6

Can you do:

GET ingest_index/attachment/document_id

(Selim Reza) #7

Yeah Sure ..

I hit the URL ( GET )

http://localhost:9200/ingest_index/attachment/document_id

Result ::

    {
  "_index": "ingest_index",
  "_type": "attachment",
  "_id": "document_id",
  "_version": 14,
  "found": true,
  "_source": {
    "file_path": "/Users/selimreza/Sites/haber_dev/public/uploads/files/bower.txt",
    "attachment": {
      "content_type": "text/plain; charset=ISO-8859-1",
      "language": "en",
      "content": "Welcome to Dhaka",
      "content_length": 18
    },
    "content": "V2VsY29tZSB0byBEaGFrYQo="
  }
}

Now I need to search. How will I search using the URL (ingest_index/attachment/document_id) and query ?


(Selim Reza) #8

my search method is not working ::

Here is my code in PHP

public function ingest_processor_searching($query)
{
    $client = $this->client;
    $params = [
        'index' => 'ingest_index',
        'type' => 'attachment',
        'body' => [
            'query' => [
                'match' => [
                    'textField' => $query,
                ]
            ],
        ],
    ];

    $response = $client->search($params);
    return $response;
}

Result ::

{
"took": 7,
"timed_out": false,
"_shards": {
   "total": 5,
   "successful": 5,
   "failed": 0
 },
  "hits": {
  "total": 0,
  "max_score": null,
  "hits": []
 } 
}

But I have Data for the GET http://localhost:9200/ingest_index/attachment/2

{
  "_index": "ingest_index",
  "_type": "attachment",
  "_id": "2",
  "_version": 1,
  "found": true,
  "_source": {
    "file_path": "/Users/selimreza/Sites/haber_dev/public/uploads/files/bower.txt",
    "attachment": {
      "content_type": "text/plain; charset=ISO-8859-1",
      "language": "en",
      "content": "Welcome to Dhaka",
      "content_length": 18
    },
    "textField": "V2VsY29tZSB0byBEaGFrYQo="
  }
}

What is the mistake I did ? Would you please help me out ?


(David Pilato) #9

Search in attachment.content field


(Selim Reza) #10

Thank you so much .. It works...

I am going to upload my complete code in php. SO that Any one can test ES within a minute


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.