Elasticsearch 2.4.3 having issue with .docx file during creating document index


(Amar Srivastava) #1

Thanks in advance

I am new in elasticsearch and using elasticsearch 2.4.3. When I am indexing documents then .doc, pdf,txt,xls is working fine but when I tried with .docx file then it is creating issue.

Below are response which I am getting, It is showing successful but in Response section, we are getting created : false

Successful low level call on PUT: /ilemsdocuments/indexeddocument/F%3A%5CIndexTestSamples%5CIndexTestSamples%5CBaconNarrative.docx?pretty=true

Audit trail of this API call:

[1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1688474

Request:

{
"activityNumber": "2016-001390",
"activityOccurrenceDate": "2016-03-09T00:00:00",
"activityPrepDate": "2016-01-26T00:00:00",
"activityReportDate": "2016-01-08T00:00:00",
"author": "Bard Laabs",
"category": "Interviews",
"dataOwnerDatabase": "main",
"dataOwnerDisk": "main",
"databaseLastUpdate": "2016-03-28T00:00:00",
"documentEntityId": "F:\IndexTestSamples\IndexTestSamples\BaconNarrative.docx",
"file": { "_content": "-- Remove the actual content as its size is big", "_indexed_chars": -1 }, "fileExtension": "docx", "fileFullName": "F:\IndexTestSamples\IndexTestSamples\BaconNarrative.docx", "indexedFileInfo": { "creationTime": "2016-09-09T12:50:32+05:30", "fullName": "F:\IndexTestSamples\IndexTestSamples\BaconNarrative.docx", "lastIndexTime": "2017-01-03T11:55:20.6993801+05:30", "lastWriteTime": "2017-01-03T11:52:18.1251173+05:30", "length": 14479 }, "fileLastUpdate": "2017-01-03T11:52:18.1251173+05:30", "fileName": "BaconNarrative.docx", "indexDate": "2017-01-03T11:55:20.6993801+05:30", "indexSource": "IndexingAgent.FileWatcher", "indexedPath": "F:\IndexTestSamples\IndexTestSamples", "isLocked": false, "isNarrative": false, "mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "title": "My Document: 2017-01-03T11:55:22.3480831+05:30", "uploadedBy": "testuser" }

Response: { "_index" : "ilemsdocuments", "_type" : "indexeddocument", "_id" : "F:\IndexTestSamples\IndexTestSamples\BaconNarrative.docx", "_version" : 3, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : false}

Please help me, if someone face this issue.


(Mark Harwood) #2

"created":false is telling you that the doc with that ID already existed in the elasticsearch index so it was not created but was effectively replaced with the newly supplied version of the doc.
As can be seen in the response this was version 3 of this doc.


(Amar Srivastava) #3

Thanks you Mark for your response...

Let me explain the requirement in detail

First I am creating a index after that I am indexing document which is working fine first time on all extensions including .docx After that I am deleting the index from document via DeleteByQuery plugin which is showing the indexed deleted successfully.
After that when I am trying to indexed same document again then it is returning "created": false only for .docx.

As you have mention that "created":false return once index is already exists that mean this could be the issue with DeleteByQuery plugin to deleting the .docx indexed document or the way of implementation I am doing.

Below are the code which is using to delete indexed on document.

var response = this._client.DeleteByQuery("MyIndex", typeof(IndexedDocument), d => d.Query(q => q.QueryString(qs => qs.Query(@"F:\TestFile.docx"))));

Please suggest if I am doing anything wrong.

thanks you


(Mark Harwood) #4

QueryString has a syntax of it's own. The query string you passed is saying "find the document with the field called "F" and the value "\TestFile.docx" . Colons have meaning in that language [1]
You need to use a query expression e.g. term query that does not use special characters as search operators.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.1/query-dsl-query-string-query.html#_field_names
[2] https://www.elastic.co/guide/en/elasticsearch/reference/5.1/query-dsl-term-query.html


(Amar Srivastava) #5

@Mark_Harwood I have tries the same for the other extensions this is working fine i.e. F:\TestFile.docx , .F:\PdfTestFile.pdf, F:\XlsxTestFile.xlsx, F:\TestFiletext.txt etc

It is only creating problem for the .docx files.

thanks!


(Mark Harwood) #6

That doesn't change my recommendation. If you're deleting things using query_string queries and you're passing those unescaped strings then you are almost certainly not deleting what you think you are deleting.


(Amar Srivastava) #7

Then what else we can pass in QueryString apart from filepath that I am passing currently in to it.


(Mark Harwood) #8

QueryString is a language designed for power-users to write complex query strings e.g. patent searches involving boolean logic, fields, proximity operators, fuzzy operators, grouping range queries etc.

You want an exact match on a filename. This is a very different use case so you need to use a different form of exact-matching query expression with no fancy operators.

That is the fundamental difference between the term query and the query_string query and the docs I linked to explain this in more detail.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.