How to make the metadata more meaningful using FS River

Hi guys,

I am just trying out the fs river here with FS river pointing one directory
"/test" and with two files(one doc one pdf) in it. When I try to search
something, I get the response as below:

{

  • took: 3,
  • timed_out: false,
  • _shards:
    {
    • total: 5,
    • successful: 5,
    • failed: 0
      },
  • hits:
    {
    • total: 1,
    • max_score: 0.037158426,
    • hits:
      [

      {
      - _index: "mydocs",
      - _type: "doc",
      - _id: "c2ded9e9985cc69e86db1ff38c4babe",
      - _score: 0.037158426,
      - _source:
      {
      - name: "lance-armstrong.doc",
      - postDate: 1358507306000,
      - pathEncoded: "89a1526c99cccf3ef9235c0d1841e9",
      - rootpath: "89a1526c99cccf3ef9235c0d1841e9",
      - virtualpath: "p",
      - file:
      {
      - _name: "lance-armstrong.doc",
      - content:
      - "0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/.....like
      this very long...."
      - }
      - }
    • }

]

}
}

for me, the rootpath, virtualpath and content completely makes no sense.
Are they encoded as default, how to configure them shown properly?

How can I stop the response returning the whole content of the whole file
for every hit? eg. like
grep -B 10 -A 10
you only get to see the 10 lines before and after the hit instead of the
content of the whole file?

Shengjie

--

Hi there,

I don't remember if I answered to your question. Sounds like I didn't.

rootpath and pathencoded are only internals fields. It helps me to map
virtualpath to their original ones.

virtualpath is used to display results to user.

Let's say that you create a river on /tmp
Virtualpath could be /company/docs but mapped to /tmp

Let's say you start another river on /tmp2 and you want to use a virtualpath as
/company/sales

When searching a document that is originaly in /tmp/mydir/doc1.pdf, virtual path
should help to display /company/docs/mydir/doc1.pdf

rootpath and pathencoded are Base64 encoded path that help me to make the link
between a real path and the virtual one.

But, I did not look at the river for a while and I can't remember all the
details (Do I really use it ???). Do you have any special use case to describe
here?
If you need to have a new feature, please ask for it and I will be happy to add
it or do your own pull request.

About the content, as far as I remember it, the only workaround I can see is to
ask for specific fields (and don't ask for _source). See
http://www.elasticsearch.org/guide/reference/api/search/fields.html
http://www.elasticsearch.org/guide/reference/api/search/fields.html

Does it help?

Le 19 janvier 2013 à 19:29, Shengjie Min kelvin.msj@gmail.com a écrit :

Hi guys,

I am just trying out the fs river here with FS river pointing one directory
"/test" and with two files(one doc one pdf) in it. When I try to search
something, I get the response as below:

{
* took: 3,
* timed_out: false,
* _shards:

     {
           o total:  5,
           o successful:  5,
           o failed:  0
     },
   * hits :

     {
           o total:  1,
           o max_score:  0.037158426,
           o hits :

             [
                   +
                     {
                           # _index:  "mydocs",
                           # _type:  "doc",
                           # _id:  "c2ded9e9985cc69e86db1ff38c4babe",
                           # _score:  0.037158426,
                           # _source :

                             {
                                   * name:  "lance-armstrong.doc",
                                   * postDate:  1358507306000,
                                   * pathEncoded:

"89a1526c99cccf3ef9235c0d1841e9",
* rootpath:
"89a1526c99cccf3ef9235c0d1841e9",
* virtualpath: "p",
* file :

                                     {
                                           o _name:

"lance-armstrong.doc",
o content:
o
"0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/.....like this very long...."
* }
+ }
* }
* ]
* }
}

for me, the rootpath, virtualpath and content completely makes no sense. Are
they encoded as default, how to configure them shown properly?

How can I stop the response returning the whole content of the whole file for
every hit? eg. like
grep - B 10 - A 10
you only get to see the 10 lines before and after the hit instead of the
content of the whole file?

Shengjie

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Hi David,

Thanks for the reply. Here is my fs-river creation request:

curl -XPUT 'localhost:9200/_river/my_fs_river/_meta' -d '{
"type": "fs",
"fs": {
"name": "My tmp dir",
"url": "/tmp/mydocs",
"update_rate": 3600000,
"includes": ".doc,.pdf",
"excludes": "resume"
},
"index": {
"index": "my_index",
"type": "my_docs",
"bulk_size": 20
}
}'

once the index gets built, querying
http://localhost:9200/my_index/my_docs/_search?pretty&q=http://10.21.233.100:9200/my_index/my_docs/_search?pretty&q=*walcott,
here is what I am getting:

hits:
{

  • total: 1,
  • max_score: 0.17469281,
  • hits:
    [

    {
    - _index: "my_index",
    - _type: "my_docs",
    - _id: "fa49669b88a05434b2d876802d791dff",
    - _score: 0.17469281,
    - _source:
    {
    - name: "walcott.pdf",
    - postDate: 1358768716000,
    - pathEncoded: "c75a42ab98626977ea21178f8daa919",
    - rootpath: "c75a42ab98626977ea21178f8daa919",
    - virtualpath: "s",
    - file:
    {
    - _name: "walcott.pdf",
    - content:
    /wxVpVTAWeYSczc1jyM5mAB7qwPvlfsneN8qsXO3fIPsCLT..........
    }
    }
    }
    ]

}

My question here are:
1. my virtualpath doesn't make sense, does it need to be configured as
part of the river request? if it does, do you mind give me an simple
example?
2. "specific fields", you meant specify what terms to return?

Thanks,
Shengjie

--

I have to check the FSRiver source code. Not sure I can do it today. I will let
you know.

Try curl "localhost:9200/_search?fields=name,postdate"

Le 21 janvier 2013 à 17:15, Shengjie Min kelvin.msj@gmail.com a écrit :

Hi David,

Thanks for the reply. Here is my fs-river creation request:

  > >       curl -XPUT 'localhost:9200/_river/my_fs_river/_meta' -d '{
    "type": "fs",
    "fs": {
      "name": "My tmp dir",
      "url": "/tmp/mydocs",
      "update_rate": 3600000,
      "includes": "*.doc,*.pdf",
      "excludes": "resume"
    },
    "index": {
      "index": "my_index",
      "type": "my_docs",
      "bulk_size": 20
    }
  }'

once the index gets built, querying
http://localhost:9200/my_index/my_docs/_search?pretty&q=
http://10.21.233.100:9200/my_index/my_docs/_search?pretty&q=* walcott,
here is what I am getting:

hits :

{
* total: 1,
* max_score: 0.17469281,
* hits:

     [
           o
             {
                   + _index:  "my_index",
                   + _type:  "my_docs",
                   + _id:  "fa49669b88a05434b2d876802d791dff",
                   + _score:  0.17469281,
                   + _source:

                     {
                           # name:  "walcott.pdf",
                           # postDate:  1358768716000,
                           # pathEncoded:

"c75a42ab98626977ea21178f8daa919",
# rootpath: "c75a42ab98626977ea21178f8daa919",
# virtualpath: "s",
# file:

                             {
                                   * _name:  "walcott.pdf",
                                   * content:

/wxVpVTAWeYSczc1jyM5mAB7qwPvlfsneN8qsXO3fIPsCLT..........
}
}
}
]
}

My question here are:
1. my virtualpath doesn't make sense, does it need to be configured as
part of the river request? if it does, do you mind give me an simple example?
2. " specific fields", you meant specify what terms to return?

Thanks,
Shengjie

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Thanks David,

Just the response shows me virtualpath: "s",really bothers me. I will have
a look at the source code myself as well.

Shengjie

--