Filesystem River - Need some help

Hello,

(Questions are at the bottom of this post.)

I'm using:

  • ES9.0.3
  • Mapper-attachement plugin 1.8.0
  • FSRiver plugin 0.3.0

I create an index:

curl -XPUT 'localhost:9200/documents' - '{
"document" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string",
"store" : "yes"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"filesize" : {
"type" : "long"
}
}
}
}'

I create a river:

curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [ ".docx" , ".xlsx", ".pdf", ".pptx" ]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}'

  • I'm using windows so the url parameter uses "c:\tmp".
  • Update rate is rather high because of testing.
  • The index used is "documents", as created.
  • The mapping type used in the index is "document" as created.

The c:\tmp folder contains 12 files (max size 4000KB, most of them around
100KB) which match the include pattern. When the river is added (and ES has
been restarted before, to be sure all plugins are recognized and loaded) my
documents index is filled with 13 entries (1 folder, 12 files). So far so
good.

However, if I copy some more files matching the pattern to the folder, they
are not indexed. After ten minutes waiting, wherein the river should have
tried it 60 times, there is no change in the document index. The fact there
have been 60 tries can be stated by using a simple rest call

curl -XGET 'localhost:9200/_river/_search' -d '{}'.

With the result:

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_meta",
"_score": 1,
"_source": {
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [
".docx",
"
.xlsx",
".pdf",
"
.pptx"
]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_fsstatus",
"_score": 1,
"_source": {
"fs": {
"status": "STARTED"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_status",
"_score": 1,
"_source": {
"ok": true,
"node": {
"id": "j1ViClzcQzSg4rqpgTseBQ",
"name": "Node1",
"transport_address": "inet[/192.31.142.25:9300]"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:33:03.092Z",
"docadded": 0,
"docdeleted": 0
}
}
}
]
}
}

The last element shows that the river has the status 'STARTED' and when I
repeat this request, the "lastdate" element in the last JSON object changes
every 10 seconds.

The same happens when most files are deleted and just one of the original
set of files is left. In this case, the index still doesn't change. It
still says there are 13 documents in stead of 2 (1 folder, 1 file). But
now, searching the _river "index" we'll see in the last JSON object this:

....
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:38:25.274Z",
"docadded": 0,
"docdeleted": 11
}
}
}
....

So the river says 11 documents are deleted, but the index does not change.
Even not the version of the indexed document. So, from the index you cannot
verify the file still exists.

In both cases above, the logfiles (loglevel DEBUG) do not give any feedback.

Then the last case. A new subfolder is added to the folder C:\tmp. After
adding the folder, logfiles say

[2013-09-18 15:45:27,942][WARN
][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1]
[fs][documentsriver] Error while indexing content from c:\tmp

After removing the folder, the warnings disappear.

Questions

  • My first question is rather oblivious: Help, what am I doing wrong???
    Because logfiles don't give me any information it's hard to find out what's
    happening. Any suggestions?
  • My second question is: does Filesystem River index files recursively
    amongst folders: from a rootfolder all files and subfolders are indexed?.
    The last case suggests not. Hopefully it does!

Thanks in advance for the feedback.

Regards,

Erwin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Weird.

Could you open an issue in fsriver project with all that detailed information (I really appreciate that BTW)?
I'm wondering if it's due to some weird windows effect on \ chars instead of /.
Wondering as well if StringIndexOutOfBoundsException; -1; computeVirtualPathName · Issue #30 · dadoonet/fscrawler · GitHub could have the same "cause"?

Thanks!

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 18 sept. 2013 à 15:56, Erwin Rijss erijss@gmail.com a écrit :

Hello,

(Questions are at the bottom of this post.)

I'm using:
ES9.0.3
Mapper-attachement plugin 1.8.0
FSRiver plugin 0.3.0

I create an index:

curl -XPUT 'localhost:9200/documents' - '{
"document" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string",
"store" : "yes"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"filesize" : {
"type" : "long"
}
}
}
}'

I create a river:

curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [ ".docx" , ".xlsx", ".pdf", ".pptx" ]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}'

I'm using windows so the url parameter uses "c:\tmp".
Update rate is rather high because of testing.
The index used is "documents", as created.
The mapping type used in the index is "document" as created.

The c:\tmp folder contains 12 files (max size 4000KB, most of them around 100KB) which match the include pattern. When the river is added (and ES has been restarted before, to be sure all plugins are recognized and loaded) my documents index is filled with 13 entries (1 folder, 12 files). So far so good.

However, if I copy some more files matching the pattern to the folder, they are not indexed. After ten minutes waiting, wherein the river should have tried it 60 times, there is no change in the document index. The fact there have been 60 tries can be stated by using a simple rest call

curl -XGET 'localhost:9200/_river/_search' -d '{}'.

With the result:

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_meta",
"_score": 1,
"_source": {
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [
".docx",
"
.xlsx",
".pdf",
"
.pptx"
]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_fsstatus",
"_score": 1,
"_source": {
"fs": {
"status": "STARTED"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_status",
"_score": 1,
"_source": {
"ok": true,
"node": {
"id": "j1ViClzcQzSg4rqpgTseBQ",
"name": "Node1",
"transport_address": "inet[/192.31.142.25:9300]"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:33:03.092Z",
"docadded": 0,
"docdeleted": 0
}
}
}
]
}
}

The last element shows that the river has the status 'STARTED' and when I repeat this request, the "lastdate" element in the last JSON object changes every 10 seconds.

The same happens when most files are deleted and just one of the original set of files is left. In this case, the index still doesn't change. It still says there are 13 documents in stead of 2 (1 folder, 1 file). But now, searching the _river "index" we'll see in the last JSON object this:

....
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:38:25.274Z",
"docadded": 0,
"docdeleted": 11
}
}
}
....

So the river says 11 documents are deleted, but the index does not change. Even not the version of the indexed document. So, from the index you cannot verify the file still exists.

In both cases above, the logfiles (loglevel DEBUG) do not give any feedback.

Then the last case. A new subfolder is added to the folder C:\tmp. After adding the folder, logfiles say

[2013-09-18 15:45:27,942][WARN ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1] [fs][documentsriver] Error while indexing content from c:\tmp

After removing the folder, the warnings disappear.

Questions
My first question is rather oblivious: Help, what am I doing wrong??? Because logfiles don't give me any information it's hard to find out what's happening. Any suggestions?
My second question is: does Filesystem River index files recursively amongst folders: from a rootfolder all files and subfolders are indexed?. The last case suggests not. Hopefully it does!

Thanks in advance for the feedback.

Regards,

Erwin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hallo David,

Issue is created. And btw: good answers come by detailed questions :slight_smile:

ER

On Wednesday, September 18, 2013 3:56:18 PM UTC+2, Erwin Rijss wrote:

Hello,

(Questions are at the bottom of this post.)

I'm using:

  • ES9.0.3
  • Mapper-attachement plugin 1.8.0
  • FSRiver plugin 0.3.0

I create an index:

curl -XPUT 'localhost:9200/documents' - '{
"document" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string",
"store" : "yes"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"filesize" : {
"type" : "long"
}
}
}
}'

I create a river:

curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [ ".docx" , ".xlsx", ".pdf", ".pptx" ]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}'

  • I'm using windows so the url parameter uses "c:\tmp".
  • Update rate is rather high because of testing.
  • The index used is "documents", as created.
  • The mapping type used in the index is "document" as created.

The c:\tmp folder contains 12 files (max size 4000KB, most of them around
100KB) which match the include pattern. When the river is added (and ES has
been restarted before, to be sure all plugins are recognized and loaded) my
documents index is filled with 13 entries (1 folder, 12 files). So far so
good.

However, if I copy some more files matching the pattern to the folder,
they are not indexed. After ten minutes waiting, wherein the river should
have tried it 60 times, there is no change in the document index. The fact
there have been 60 tries can be stated by using a simple rest call

curl -XGET 'localhost:9200/_river/_search' -d '{}'.

With the result:

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_meta",
"_score": 1,
"_source": {
"type": "fs",
"fs": {
"url": "c:\tmp",
"update_rate": 10000,
"includes": [
".docx",
"
.xlsx",
".pdf",
"
.pptx"
]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_fsstatus",
"_score": 1,
"_source": {
"fs": {
"status": "STARTED"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_status",
"_score": 1,
"_source": {
"ok": true,
"node": {
"id": "j1ViClzcQzSg4rqpgTseBQ",
"name": "Node1",
"transport_address": "inet[/192.31.142.25:9300]"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:33:03.092Z",
"docadded": 0,
"docdeleted": 0
}
}
}
]
}
}

The last element shows that the river has the status 'STARTED' and when I
repeat this request, the "lastdate" element in the last JSON object changes
every 10 seconds.

The same happens when most files are deleted and just one of the original
set of files is left. In this case, the index still doesn't change. It
still says there are 13 documents in stead of 2 (1 folder, 1 file). But
now, searching the _river "index" we'll see in the last JSON object this:

....
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:38:25.274Z",
"docadded": 0,
"docdeleted": 11
}
}
}
....

So the river says 11 documents are deleted, but the index does not change.
Even not the version of the indexed document. So, from the index you cannot
verify the file still exists.

In both cases above, the logfiles (loglevel DEBUG) do not give any
feedback.

Then the last case. A new subfolder is added to the folder C:\tmp. After
adding the folder, logfiles say

[2013-09-18 15:45:27,942][WARN
][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1]
[fs][documentsriver] Error while indexing content from c:\tmp

After removing the folder, the warnings disappear.

Questions

  • My first question is rather oblivious: Help, what am I doing
    wrong??? Because logfiles don't give me any information it's hard to find
    out what's happening. Any suggestions?
  • My second question is: does Filesystem River index files recursively
    amongst folders: from a rootfolder all files and subfolders are indexed?.
    The last case suggests not. Hopefully it does!

Thanks in advance for the feedback.

Regards,

Erwin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.