Question about FSRiver 0.0.3

FS River seems like a great plugin!
I tried to play around with some data. For one, it was quite fast (y).

But I ran into some issues/quirks, or maybe I didn't use it correctly. I'm
using version 0.0.3, with 0.20.5 of elasticsearch and using filesystem.

  1. When I create river with following setting:
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
    "type": "fs",
    "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000,
    "includes": "*.json",
    "json_support" : true
    },
    "index": {
    "index": "foo",
    "type": "foo",
    "bulk_size": 50
    }
    }'

and search on this with this query:
{
"query": {
"query_string": {
"default_field": "_all",
"query": "slodha"
}
}
}

I get results like:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 6.9584246
    • _source: {
      • sunnyVal: slodha
        }
        }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 6.901722
    • _source: {
      • fileNameTest: slodha
        }
        }

]

-- I never get which file it belonged to - which I would definitely need to
be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
"type": "fs",
"fs": {
"name": "Foo Data",
"url": "/Users/slodha/foo/content",
"update_rate": 60000
},
"index": {
"index": "city",
"type": "city",
"bulk_size": 50
}
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 2.0015228
    • _source: {
      • name: slodha_1.json
      • postDate: 1363384941000
      • pathEncoded: 44d22b925f562f4e8d1d847253493336
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest
      • file: {
        • _name: slodha_1.json
        • content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
          }
          }
          }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 1.7533717
    • _source: {
      • name: file1.json
      • postDate: 1363388628000
      • pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest/sunnyTest2
      • file: {
        • _name: file1.json
        • content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
          }
          }
          }

]

Now, here I get to know the exact file paths, but this time the content is
all a hash sum, and not readable..

I'm sure there should be a way for me to see both content and file paths as
human readable with just one river. Can you suggest what I'm doing wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

An update, I tried with Latest Elasticsearch and FS River plugin - 0.1.0,
but still issue persists. The only difference was with 0.1.0, if I use
{"filename_as_id": true} in the definition of "fs", then I do get the
local file name (Not the complete URL of file/ relative url of the file
from the url given in definition). I would definitely not use
"filename_as_id" flag , because then chances of conflict are quite high, I
would have expected with that flas, I get absolute/relative file path from
the url given. can we get that?

Also, still looking for a solution where I see the content of the files
completely in search result and also the full/relative file path where the
search hit a match.

Thanks!

On Friday, March 15, 2013 4:22:52 PM UTC-7, Shadow wrote:

FS River seems like a great plugin!
I tried to play around with some data. For one, it was quite fast (y).

But I ran into some issues/quirks, or maybe I didn't use it correctly. I'm
using version 0.0.3, with 0.20.5 of elasticsearch and using filesystem.

  1. When I create river with following setting:
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
    "type": "fs",
    "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000,
    "includes": "*.json",
    "json_support" : true
    },
    "index": {
    "index": "foo",
    "type": "foo",
    "bulk_size": 50
    }
    }'

and search on this with this query:
{
"query": {
"query_string": {
"default_field": "_all",
"query": "slodha"
}
}
}

I get results like:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 6.9584246
    • _source: {
      • sunnyVal: slodha
        }
        }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 6.901722
    • _source: {
      • fileNameTest: slodha
        }
        }

]

-- I never get which file it belonged to - which I would definitely need
to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
"type": "fs",
"fs": {
"name": "Foo Data",
"url": "/Users/slodha/foo/content",
"update_rate": 60000
},
"index": {
"index": "city",
"type": "city",
"bulk_size": 50
}
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 2.0015228
    • _source: {
      • name: slodha_1.json
      • postDate: 1363384941000
      • pathEncoded: 44d22b925f562f4e8d1d847253493336
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest
      • file: {
        • _name: slodha_1.json
        • content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
          }
          }
          }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 1.7533717
    • _source: {
      • name: file1.json
      • postDate: 1363388628000
      • pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest/sunnyTest2
      • file: {
        • _name: file1.json
        • content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
          }
          }
          }

]

Now, here I get to know the exact file paths, but this time the content is
all a hash sum, and not readable..

I'm sure there should be a way for me to see both content and file paths
as human readable with just one river. Can you suggest what I'm doing wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is by design.

If you want to inject JSON files (json_support: true), it's injected as is without any modification.
When you disable json_support (default), files (whatever the format) are sent to the mapper attachment plugin and content is extracted with Tika and indexed.

What you see in file.content is the raw file base64 encoded.

A nice option would be to add path in metadata section as we have _index, _type, _id.
I should try to do it with Elasticsearch Platform — Find real-time answers at scale | Elastic.

Could you open an issue for that?

Thanks for the feedback.

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mars 2013 à 01:44, Shadow suryaveersingh.lodha@gmail.com a écrit :

An update, I tried with Latest Elasticsearch and FS River plugin - 0.1.0, but still issue persists. The only difference was with 0.1.0, if I use {"filename_as_id": true} in the definition of "fs", then I do get the local file name (Not the complete URL of file/ relative url of the file from the url given in definition). I would definitely not use "filename_as_id" flag , because then chances of conflict are quite high, I would have expected with that flas, I get absolute/relative file path from the url given. can we get that?

Also, still looking for a solution where I see the content of the files completely in search result and also the full/relative file path where the search hit a match.

Thanks!

On Friday, March 15, 2013 4:22:52 PM UTC-7, Shadow wrote:

FS River seems like a great plugin!
I tried to play around with some data. For one, it was quite fast (y).

But I ran into some issues/quirks, or maybe I didn't use it correctly. I'm using version 0.0.3, with 0.20.5 of elasticsearch and using filesystem.

  1. When I create river with following setting:
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
    "type": "fs",
    "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000,
    "includes": "*.json",
    "json_support" : true
    },
    "index": {
    "index": "foo",
    "type": "foo",
    "bulk_size": 50
    }
    }'

and search on this with this query:
{
"query": {
"query_string": {
"default_field": "_all",
"query": "slodha"
}
}
}

I get results like:

hits: [
{
_index: foo
_type: foo
_id: 18156b6b5a6b3a8e1ec5984f185e18
_score: 6.9584246
_source: {
sunnyVal: slodha
}
}
{
_index: foo
_type: foo
_id: d7b4df4222e0d075d74ffde8aaa04a56
_score: 6.901722
_source: {
fileNameTest: slodha
}
}
]

-- I never get which file it belonged to - which I would definitely need to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
"type": "fs",
"fs": {
"name": "Foo Data",
"url": "/Users/slodha/foo/content",
"update_rate": 60000
},
"index": {
"index": "city",
"type": "city",
"bulk_size": 50
}
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [
{
_index: foo
_type: foo
_id: 18156b6b5a6b3a8e1ec5984f185e18
_score: 2.0015228
_source: {
name: slodha_1.json
postDate: 1363384941000
pathEncoded: 44d22b925f562f4e8d1d847253493336
rootpath: 948cd64d775db4119962b5a36dd530
virtualpath: t/sunnyTest
file: {
_name: slodha_1.json
content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
}
}
}
{
_index: foo
_type: foo
_id: d7b4df4222e0d075d74ffde8aaa04a56
_score: 1.7533717
_source: {
name: file1.json
postDate: 1363388628000
pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
rootpath: 948cd64d775db4119962b5a36dd530
virtualpath: t/sunnyTest/sunnyTest2
file: {
_name: file1.json
content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
}
}
}
]

Now, here I get to know the exact file paths, but this time the content is all a hash sum, and not readable..

I'm sure there should be a way for me to see both content and file paths as human readable with just one river. Can you suggest what I'm doing wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks, opened an issue - Can we add feature to be able to to see both file content and file paths as human readable format (text) with just one river for text files · Issue #11 · dadoonet/fscrawler · GitHub

I think the idea behind such a design would have been to be able to just
search across files and see what files changed, after a given time or
something like that, but I assume, if we want to do search using this on
file system for text files - json, xml, any custom text format (instead of
grepping around), we should be able show the content & location of that
content along with the result of the search too.

Cheers
Sunny

On Friday, March 15, 2013 7:58:27 PM UTC-7, David Pilato wrote:

This is by design.

If you want to inject JSON files (json_support: true), it's injected as is
without any modification.
When you disable json_support (default), files (whatever the format) are
sent to the mapper attachment plugin and content is extracted with Tika and
indexed.

What you see in file.content is the raw file base64 encoded.

A nice option would be to add path in metadata section as we have _index,
_type, _id.
I should try to do it with
Elasticsearch Platform — Find real-time answers at scale | Elastic.

Could you open an issue for that?

Thanks for the feedback.

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mars 2013 à 01:44, Shadow <suryaveers...@gmail.com <javascript:>> a
écrit :

An update, I tried with Latest Elasticsearch and FS River plugin - 0.1.0,
but still issue persists. The only difference was with 0.1.0, if I use
{"filename_as_id": true} in the definition of "fs", then I do get the
local file name (Not the complete URL of file/ relative url of the file
from the url given in definition). I would definitely not use
"filename_as_id" flag , because then chances of conflict are quite high, I
would have expected with that flas, I get absolute/relative file path from
the url given. can we get that?

Also, still looking for a solution where I see the content of the files
completely in search result and also the full/relative file path where the
search hit a match.

Thanks!

On Friday, March 15, 2013 4:22:52 PM UTC-7, Shadow wrote:

FS River seems like a great plugin!
I tried to play around with some data. For one, it was quite fast (y).

But I ran into some issues/quirks, or maybe I didn't use it correctly.
I'm using version 0.0.3, with 0.20.5 of elasticsearch and using filesystem.

  1. When I create river with following setting:
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
    "type": "fs",
    "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000,
    "includes": "*.json",
    "json_support" : true
    },
    "index": {
    "index": "foo",
    "type": "foo",
    "bulk_size": 50
    }
    }'

and search on this with this query:
{
"query": {
"query_string": {
"default_field": "_all",
"query": "slodha"
}
}
}

I get results like:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 6.9584246
    • _source: {
      • sunnyVal: slodha
        }
        }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 6.901722
    • _source: {
      • fileNameTest: slodha
        }
        }

]

-- I never get which file it belonged to - which I would definitely need
to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
"type": "fs",
"fs": {
"name": "Foo Data",
"url": "/Users/slodha/foo/content",
"update_rate": 60000
},
"index": {
"index": "city",
"type": "city",
"bulk_size": 50
}
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 2.0015228
    • _source: {
      • name: slodha_1.json
      • postDate: 1363384941000
      • pathEncoded: 44d22b925f562f4e8d1d847253493336
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest
      • file: {
        • _name: slodha_1.json
        • content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
          }
          }
          }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 1.7533717
    • _source: {
      • name: file1.json
      • postDate: 1363388628000
      • pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest/sunnyTest2
      • file: {
        • _name: file1.json
        • content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
          }
          }
          }

]

Now, here I get to know the exact file paths, but this time the content
is all a hash sum, and not readable..

I'm sure there should be a way for me to see both content and file paths
as human readable with just one river. Can you suggest what I'm doing wrong?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Also, when using JSON support and "filename as ID" is it possible to get
the absolute/relative "filepath" as ID, and not just the filename? maybe
another key "filepath_as_id" ? I personally feel that would be quite
helpful as well.

On Friday, March 15, 2013 11:48:07 PM UTC-7, Shadow wrote:

Thanks, opened an issue - Can we add feature to be able to to see both file content and file paths as human readable format (text) with just one river for text files · Issue #11 · dadoonet/fscrawler · GitHub

I think the idea behind such a design would have been to be able to just
search across files and see what files changed, after a given time or
something like that, but I assume, if we want to do search using this on
file system for text files - json, xml, any custom text format (instead of
grepping around), we should be able show the content & location of that
content along with the result of the search too.

Cheers
Sunny

On Friday, March 15, 2013 7:58:27 PM UTC-7, David Pilato wrote:

This is by design.

If you want to inject JSON files (json_support: true), it's injected as
is without any modification.
When you disable json_support (default), files (whatever the format) are
sent to the mapper attachment plugin and content is extracted with Tika and
indexed.

What you see in file.content is the raw file base64 encoded.

A nice option would be to add path in metadata section as we have _index,
_type, _id.
I should try to do it with
Elasticsearch Platform — Find real-time answers at scale | Elastic.

Could you open an issue for that?

Thanks for the feedback.

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mars 2013 à 01:44, Shadow suryaveers...@gmail.com a écrit :

An update, I tried with Latest Elasticsearch and FS River plugin - 0.1.0,
but still issue persists. The only difference was with 0.1.0, if I use
{"filename_as_id": true} in the definition of "fs", then I do get the
local file name (Not the complete URL of file/ relative url of the file
from the url given in definition). I would definitely not use
"filename_as_id" flag , because then chances of conflict are quite high, I
would have expected with that flas, I get absolute/relative file path from
the url given. can we get that?

Also, still looking for a solution where I see the content of the files
completely in search result and also the full/relative file path where the
search hit a match.

Thanks!

On Friday, March 15, 2013 4:22:52 PM UTC-7, Shadow wrote:

FS River seems like a great plugin!
I tried to play around with some data. For one, it was quite fast (y).

But I ran into some issues/quirks, or maybe I didn't use it correctly.
I'm using version 0.0.3, with 0.20.5 of elasticsearch and using filesystem.

  1. When I create river with following setting:
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
    "type": "fs",
    "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000,
    "includes": "*.json",
    "json_support" : true
    },
    "index": {
    "index": "foo",
    "type": "foo",
    "bulk_size": 50
    }
    }'

and search on this with this query:
{
"query": {
"query_string": {
"default_field": "_all",
"query": "slodha"
}
}
}

I get results like:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 6.9584246
    • _source: {
      • sunnyVal: slodha
        }
        }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 6.901722
    • _source: {
      • fileNameTest: slodha
        }
        }

]

-- I never get which file it belonged to - which I would definitely need
to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
"type": "fs",
"fs": {
"name": "Foo Data",
"url": "/Users/slodha/foo/content",
"update_rate": 60000
},
"index": {
"index": "city",
"type": "city",
"bulk_size": 50
}
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [

  • {
    • _index: foo
    • _type: foo
    • _id: 18156b6b5a6b3a8e1ec5984f185e18
    • _score: 2.0015228
    • _source: {
      • name: slodha_1.json
      • postDate: 1363384941000
      • pathEncoded: 44d22b925f562f4e8d1d847253493336
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest
      • file: {
        • _name: slodha_1.json
        • content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
          }
          }
          }
  • {
    • _index: foo
    • _type: foo
    • _id: d7b4df4222e0d075d74ffde8aaa04a56
    • _score: 1.7533717
    • _source: {
      • name: file1.json
      • postDate: 1363388628000
      • pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
      • rootpath: 948cd64d775db4119962b5a36dd530
      • virtualpath: t/sunnyTest/sunnyTest2
      • file: {
        • _name: file1.json
        • content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
          }
          }
          }

]

Now, here I get to know the exact file paths, but this time the content
is all a hash sum, and not readable..

I'm sure there should be a way for me to see both content and file paths
as human readable with just one river. Can you suggest what I'm doing wrong?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No you can't have an id with / I think. Or it needs to be encoded.
But, if I can add path in meta, I can add the full path or the relative one as needed.

Perhaps, you can modify your issue or add a comment?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 18 mars 2013 à 18:52, Shadow suryaveersingh.lodha@gmail.com a écrit :

Also, when using JSON support and "filename as ID" is it possible to get the absolute/relative "filepath" as ID, and not just the filename? maybe another key "filepath_as_id" ? I personally feel that would be quite helpful as well.

On Friday, March 15, 2013 11:48:07 PM UTC-7, Shadow wrote:
Thanks, opened an issue - Can we add feature to be able to to see both file content and file paths as human readable format (text) with just one river for text files · Issue #11 · dadoonet/fscrawler · GitHub

I think the idea behind such a design would have been to be able to just search across files and see what files changed, after a given time or something like that, but I assume, if we want to do search using this on file system for text files - json, xml, any custom text format (instead of grepping around), we should be able show the content & location of that content along with the result of the search too.

Cheers
Sunny

On Friday, March 15, 2013 7:58:27 PM UTC-7, David Pilato wrote:
This is by design.

If you want to inject JSON files (json_support: true), it's injected as is without any modification.
When you disable json_support (default), files (whatever the format) are sent to the mapper attachment plugin and content is extracted with Tika and indexed.

What you see in file.content is the raw file base64 encoded.

A nice option would be to add path in metadata section as we have _index, _type, _id.
I should try to do it with Elasticsearch Platform — Find real-time answers at scale | Elastic.

Could you open an issue for that?

Thanks for the feedback.

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mars 2013 à 01:44, Shadow suryaveers...@gmail.com a écrit :

An update, I tried with Latest Elasticsearch and FS River plugin - 0.1.0, but still issue persists. The only difference was with 0.1.0, if I use {"filename_as_id": true} in the definition of "fs", then I do get the local file name (Not the complete URL of file/ relative url of the file from the url given in definition). I would definitely not use "filename_as_id" flag , because then chances of conflict are quite high, I would have expected with that flas, I get absolute/relative file path from the url given. can we get that?

Also, still looking for a solution where I see the content of the files completely in search result and also the full/relative file path where the search hit a match.

Thanks!

On Friday, March 15, 2013 4:22:52 PM UTC-7, Shadow wrote:
FS River seems like a great plugin!
I tried to play around with some data. For one, it was quite fast (y).

But I ran into some issues/quirks, or maybe I didn't use it correctly. I'm using version 0.0.3, with 0.20.5 of elasticsearch and using filesystem.

  1. When I create river with following setting:
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
    "type": "fs",
    "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000,
    "includes": "*.json",
    "json_support" : true
    },
    "index": {
    "index": "foo",
    "type": "foo",
    "bulk_size": 50
    }
    }'

and search on this with this query:
{
"query": {
"query_string": {
"default_field": "_all",
"query": "slodha"
}
}
}

I get results like:

hits: [
{
_index: foo
_type: foo
_id: 18156b6b5a6b3a8e1ec5984f185e18
_score: 6.9584246
_source: {
sunnyVal: slodha
}
}
{
_index: foo
_type: foo
_id: d7b4df4222e0d075d74ffde8aaa04a56
_score: 6.901722
_source: {
fileNameTest: slodha
}
}
]

-- I never get which file it belonged to - which I would definitely need to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
"type": "fs",
"fs": {
"name": "Foo Data",
"url": "/Users/slodha/foo/content",
"update_rate": 60000
},
"index": {
"index": "city",
"type": "city",
"bulk_size": 50
}
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [
{
_index: foo
_type: foo
_id: 18156b6b5a6b3a8e1ec5984f185e18
_score: 2.0015228
_source: {
name: slodha_1.json
postDate: 1363384941000
pathEncoded: 44d22b925f562f4e8d1d847253493336
rootpath: 948cd64d775db4119962b5a36dd530
virtualpath: t/sunnyTest
file: {
_name: slodha_1.json
content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
}
}
}
{
_index: foo
_type: foo
_id: d7b4df4222e0d075d74ffde8aaa04a56
_score: 1.7533717
_source: {
name: file1.json
postDate: 1363388628000
pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
rootpath: 948cd64d775db4119962b5a36dd530
virtualpath: t/sunnyTest/sunnyTest2
file: {
_name: file1.json
content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
}
}
}
]

Now, here I get to know the exact file paths, but this time the content is all a hash sum, and not readable..

I'm sure there should be a way for me to see both content and file paths as human readable with just one river. Can you suggest what I'm doing wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.