Hello,
(Questions are at the bottom of this post.)
I'm using:
- ES9.0.3
- Mapper-attachement plugin 1.8.0
- FSRiver plugin 0.3.0
I create an index:
curl -XPUT 'localhost:9200/documents' - '{
"document" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string",
"store" : "yes"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"filesize" : {
"type" : "long"
}
}
}
}'
I create a river:
curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [ ".docx" , ".xlsx", ".pdf", ".pptx" ]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}'
- I'm using windows so the url parameter uses "c:\tmp".
- Update rate is rather high because of testing.
- The index used is "documents", as created.
- The mapping type used in the index is "document" as created.
The c:\tmp folder contains 12 files (max size 4000KB, most of them around
100KB) which match the include pattern. When the river is added (and ES has
been restarted before, to be sure all plugins are recognized and loaded) my
documents index is filled with 13 entries (1 folder, 12 files). So far so
good.
However, if I copy some more files matching the pattern to the folder, they
are not indexed. After ten minutes waiting, wherein the river should have
tried it 60 times, there is no change in the document index. The fact there
have been 60 tries can be stated by using a simple rest call
curl -XGET 'localhost:9200/_river/_search' -d '{}'.
With the result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_meta",
"_score": 1,
"_source": {
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [
".docx",
".xlsx",
".pdf",
".pptx"
]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_fsstatus",
"_score": 1,
"_source": {
"fs": {
"status": "STARTED"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_status",
"_score": 1,
"_source": {
"ok": true,
"node": {
"id": "j1ViClzcQzSg4rqpgTseBQ",
"name": "Node1",
"transport_address": "inet[/192.31.142.25:9300]"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:33:03.092Z",
"docadded": 0,
"docdeleted": 0
}
}
}
]
}
}
The last element shows that the river has the status 'STARTED' and when I
repeat this request, the "lastdate" element in the last JSON object changes
every 10 seconds.
The same happens when most files are deleted and just one of the original
set of files is left. In this case, the index still doesn't change. It
still says there are 13 documents in stead of 2 (1 folder, 1 file). But
now, searching the _river "index" we'll see in the last JSON object this:
....
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:38:25.274Z",
"docadded": 0,
"docdeleted": 11
}
}
}
....
So the river says 11 documents are deleted, but the index does not change.
Even not the version of the indexed document. So, from the index you cannot
verify the file still exists.
In both cases above, the logfiles (loglevel DEBUG) do not give any feedback.
Then the last case. A new subfolder is added to the folder C:\tmp. After
adding the folder, logfiles say
[2013-09-18 15:45:27,942][WARN
][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1]
[fs][documentsriver] Error while indexing content from c:\tmp
After removing the folder, the warnings disappear.
Questions
- My first question is rather oblivious: Help, what am I doing wrong???
Because logfiles don't give me any information it's hard to find out what's
happening. Any suggestions? - My second question is: does Filesystem River index files recursively
amongst folders: from a rootfolder all files and subfolders are indexed?.
The last case suggests not. Hopefully it does!
Thanks in advance for the feedback.
Regards,
Erwin
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.