Problem with document metadata: document indexed from fs river

hi all

my documents are indexed from fsriver,

i user the folowing index:

curl -XPUT 'http://localhost:9200/mydocs/' -d '{
"settings" : {"index" : {"analysis" :
{"analyzer" : {
"synfromfile" : {"tokenizer" : "whitespace","filter" : ["synonym"]},
"autocomplete" : {"tokenizer" : "whitespace","filter" :
["lowercase","engram"]}},
"filter" : {"synonym" : {"type" : "synonym", "synonyms_path" :
"C:\TempElasticSearch\synonym\synonyms.txt"},
"engram" : {"type" : "edgeNGram","min_gram" : 3,"max_gram" : 10}}}}}}'

the folowing mapping

curl -XPUT 'http://localhost:9200/mydocs/doc/_mapping' -d '
{"doc" : {
"properties" : {
"file" : { "type" : "attachment",
"path" : "full",
"fields" : {"file" : {"type" : "string","store" : "yes","term_vector" :
"with_positions_offsets","analyzer" : "synfromfile"},
"author" : {"type" : "string"},
"title" : {"type" : "string","store" : "yes"},
"date" : {"type" : "date","format" : "dateOptionalTime"},
"keywords" : {"type" : "string"},
"content_type" : {"type" : "string" }}},
"name" : {"type" : "string","analyzer" : "autocomplete"},
"pathEncoded" : {"type" : "string"},
"postDate" : {"type" : "date","format" : "dateOptionalTime"},
"rootpath" : {"type" : "string"},
"virtualpath" : { "type" : "string"}}}}'

and the folowing river

curl -XPUT 'localhost:9200/_river/Myfs_river/_meta' -d '{
"type": "fs",
"fs": {
"name": "fs river",
"url": "C:\TempElasticSearch\tempDoc",
"update_rate": 180000,
"includes": [ ".doc" , ".docx", ".xls", ".pdf", "*.txt" ]},
"index": { "index": "mydocs","type": "doc"}
}'

but when looking through a document keyword (search for a word that exists
in the content of a document) metadata of the documents returned does not
contain the field author !!!!

i use the folowing query:

TermsFacetBuilder fb = FacetBuilders.termsFacet(TERMS_FACET_F).field(
TYPE_FIELD);
QueryBuilder queryBuilder = QueryBuilders.queryString(keyword);
SearchResponse searchHits = esClient.prepareSearch()
.setIndices(INDEX_NAME)
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).addFacet(fb)
.setFrom(query.getStart()).setSize(query.getRow())
.setQuery(queryBuilder).addHighlightedField(NAME_FIELD)
.addHighlightedField(FILE_FIELD).setHighlighterOrder(SCORE)
.execute().actionGet();

return searchHits;

and i have this result : looking for "application" key world

{
"took" : 1133,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.078125,
"hits" : [ {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "19943db2a847cbb66daf35446bab775",
"_score" : 0.078125, "_source" :
{"name":"ElasticSearch.docx","postDate":1365068589756,"pathEncoded":"8580d27817ab2e7da7fd9c55a54c81cc","rootpath":"8580d27817ab2e7da7fd9c55a54c81cc","virtualpath":"c","file":{"_name":"ElasticSearch.docx","content":"KLMN......
"highlight" : {
"file" : [ "notre application se fait
dans le fichier application-context comme suit :\n<bean
id="esClient"\n\t\tclass="fr", "comme une application distribuée et répartit la charge entre les
nœuds.\nIndexe :\nAu moment ou en crée un" ]
}
}, {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "33711747ec838172ca777eba396127",
"_score" : 0.078125, "_source" :
{"name":"ElasticSearch1.docx","postDate":1365069178174,"pathEncoded":"8580d27817ab2e7da7fd9c55a54c81cc","rootpath":"8580d27817ab2e7da7fd9c55a54c81cc","virtualpath":"c","file":{"_name":"ElasticSearch1.docx","content":"UEs....
"highlight" : {
"file" : [ "notre application se fait
dans le fichier application-context comme suit :\n<bean
id="esClient"\n\t\tclass="fr", "comme une application distribuée et répartit la charge entre les
nœuds.\nIndexe :\nAu moment ou en crée un" ]
}
} ]
},
"facets" : {
"f" : {
"_type" : "terms",
"missing" : 0,
"total" : 3,
"other" : 0,
"terms" : [ {
"term" : "doc",
"count" : 3
} ]
}
}
}

I do not know where the problem lies !!!!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Note: if you want to speak french, you can use the french mailing list. https://groups.google.com/group/elasticsearch-fr

Elasticsearch returns _source document as it was given by the user (here the FSRiver is the user).
The mapper-attachment plugin extracts metadata (author, date, title…) and index it.

So you can probably search for an author but you can't display it as is unless you specifically store it (show mapping for title for example).

You should try to see if you can use fields option and retrieve the author (note that you need to set store=yes).
See Elasticsearch Platform — Find real-time answers at scale | Elastic

That said, I just tried to play a bit with the mapper attachment plugin and saw that some metadata does not seem to be extracted by Tika.
I wrote a GIST here: Testing FSRiver with Mapper attachment and check metadata extracted · GitHub

I will check that.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 4 avr. 2013 à 13:51, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

hi all

my documents are indexed from fsriver,

i user the folowing index:

curl -XPUT 'http://localhost:9200/mydocs/' -d '{
"settings" : {"index" : {"analysis" :
{"analyzer" : {
"synfromfile" : {"tokenizer" : "whitespace","filter" : ["synonym"]},
"autocomplete" : {"tokenizer" : "whitespace","filter" : ["lowercase","engram"]}},
"filter" : {"synonym" : {"type" : "synonym", "synonyms_path" : "C:\TempElasticSearch\synonym\synonyms.txt"},
"engram" : {"type" : "edgeNGram","min_gram" : 3,"max_gram" : 10}}}}}}'

the folowing mapping

curl -XPUT 'http://localhost:9200/mydocs/doc/_mapping' -d '
{"doc" : {
"properties" : {
"file" : { "type" : "attachment",
"path" : "full",
"fields" : {"file" : {"type" : "string","store" : "yes","term_vector" : "with_positions_offsets","analyzer" : "synfromfile"},
"author" : {"type" : "string"},
"title" : {"type" : "string","store" : "yes"},
"date" : {"type" : "date","format" : "dateOptionalTime"},
"keywords" : {"type" : "string"},
"content_type" : {"type" : "string" }}},
"name" : {"type" : "string","analyzer" : "autocomplete"},
"pathEncoded" : {"type" : "string"},
"postDate" : {"type" : "date","format" : "dateOptionalTime"},
"rootpath" : {"type" : "string"},
"virtualpath" : { "type" : "string"}}}}'

and the folowing river

curl -XPUT 'localhost:9200/_river/Myfs_river/_meta' -d '{
"type": "fs",
"fs": {
"name": "fs river",
"url": "C:\TempElasticSearch\tempDoc",
"update_rate": 180000,
"includes": [ ".doc" , ".docx", ".xls", ".pdf", "*.txt" ]},
"index": { "index": "mydocs","type": "doc"}
}'

but when looking through a document keyword (search for a word that exists in the content of a document) metadata of the documents returned does not contain the field author !!!!

i use the folowing query:

TermsFacetBuilder fb = FacetBuilders.termsFacet(TERMS_FACET_F).field(
TYPE_FIELD);
QueryBuilder queryBuilder = QueryBuilders.queryString(keyword);
SearchResponse searchHits = esClient.prepareSearch()
.setIndices(INDEX_NAME)
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).addFacet(fb)
.setFrom(query.getStart()).setSize(query.getRow())
.setQuery(queryBuilder).addHighlightedField(NAME_FIELD)
.addHighlightedField(FILE_FIELD).setHighlighterOrder(SCORE)
.execute().actionGet();

  return searchHits;

and i have this result : looking for "application" key world

{
"took" : 1133,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.078125,
"hits" : [ {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "19943db2a847cbb66daf35446bab775",
"_score" : 0.078125, "_source" : {"name":"Elasticsearch.docx","postDate":1365068589756,"pathEncoded":"8580d27817ab2e7da7fd9c55a54c81cc","rootpath":"8580d27817ab2e7da7fd9c55a54c81cc","virtualpath":"c","file":{"_name":"Elasticsearch.docx","content":"KLMN......
"highlight" : {
"file" : [ "notre application se fait dans le fichier application-context comme suit :\n<bean id="esClient"\n\t\tclass="fr", "comme une application distribuée et répartit la charge entre les nœuds.\nIndexe :\nAu moment ou en crée un" ]
}
}, {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "33711747ec838172ca777eba396127",
"_score" : 0.078125, "_source" : {"name":"Elasticsearch1.docx","postDate":1365069178174,"pathEncoded":"8580d27817ab2e7da7fd9c55a54c81cc","rootpath":"8580d27817ab2e7da7fd9c55a54c81cc","virtualpath":"c","file":{"_name":"Elasticsearch1.docx","content":"UEs....
"highlight" : {
"file" : [ "notre application se fait dans le fichier application-context comme suit :\n<bean id="esClient"\n\t\tclass="fr", "comme une application distribuée et répartit la charge entre les nœuds.\nIndexe :\nAu moment ou en crée un" ]
}
} ]
},
"facets" : {
"f" : {
"_type" : "terms",
"missing" : 0,
"total" : 3,
"other" : 0,
"terms" : [ {
"term" : "doc",
"count" : 3
} ]
}
}
}

I do not know where the problem lies !!!!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

thank you David, for your answer, I tried mapping and river declaration
used in this exemple
http://www.elasticsearch.org/guide/reference/api/search/fields/,
and i use this query :
curl 'http://localhost:9200/mydocs/doc/_search?pretty' -d '

{
"fields" : ["*"],
"query":{
"match_all" : {}
}
}'

so i have the following result

{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "19943db2a847cbb66daf35446bab775",
"_score" : 1.0,
"fields" : {
"file" : "application\n",
"file.title" : "ElasticSearch"
}
}, {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "97efc9cefa8b3eb5510fdf530caec1e",
"_score" : 1.0,
"fields" : {
"file" : "Application2\n",
"file.title" : "Etude de l'existant"
}
}, {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "5a7a7a3d70861285bf3261bcec092a5",
"_score" : 1.0,
"fields" : {
"file" : "application windows applications apple\n"
}
} ]
}
}

the maper-attachment does not index all the document metadata , we still
miss author, date , contect_type, keywords informations

what is the solution !!

Cordially

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sounds like the issue comes from FSRiver.
I checked the mapper attachment and it works as expected.

Let me check if I did something nasty there.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 4 avr. 2013 à 16:22, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

thank you David, for your answer, I tried mapping and river declaration used in this exemple Elasticsearch Platform — Find real-time answers at scale | Elastic,
and i use this query :
curl 'http://localhost:9200/mydocs/doc/_search?pretty' -d '
{
"fields" : ["*"],
"query":{
"match_all" : {}
}
}'

so i have the following result

{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "19943db2a847cbb66daf35446bab775",
"_score" : 1.0,
"fields" : {
"file" : "application\n",
"file.title" : "Elasticsearch"
}
}, {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "97efc9cefa8b3eb5510fdf530caec1e",
"_score" : 1.0,
"fields" : {
"file" : "Application2\n",
"file.title" : "Etude de l'existant"
}
}, {
"_index" : "mydocs",
"_type" : "doc",
"_id" : "5a7a7a3d70861285bf3261bcec092a5",
"_score" : 1.0,
"fields" : {
"file" : "application windows applications apple\n"
}
} ]
}
}

the maper-attachment does not index all the document metadata , we still miss author, date , contect_type, keywords informations

what is the solution !!

Cordially

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I made a mistake in the example link, I used this example:

cordially

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So David , there is no news ?? the issue comes from FSRiver ??

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I did not find any issue.
See my comments in Meta data · Issue #14 · dadoonet/fscrawler · GitHub.

I will push a full test case tomorrow but everything is fine in my opinion.

That said, I should probably create a default mapping that stores metadata by default and probably does not store source.
I will think about it.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 4 avr. 2013 à 17:05, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

So David , there is no news ?? the issue comes from FSRiver ??

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

thank you very much David for your help, I was able to
extract document metada ,
concerning content_type field, it works for pdf documents, the content of
content_type for pdf document is application/pdf
but its content is false when it comes from another type of document...

is that we can also indexed informtion about the size of a document ? like
this :

"size": {
"type": "string",
"store": "yes"
},

Cordially

--Yahia

2013/4/4 David Pilato david@pilato.fr

I did not find any issue.
See my comments in Meta data · Issue #14 · dadoonet/fscrawler · GitHub.

I will push a full test case tomorrow but everything is fine in my opinion.

That said, I should probably create a default mapping that stores metadata
by default and probably does not store source.
I will think about it.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 4 avr. 2013 à 17:05, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

So David , there is no news ?? the issue comes from FSRiver ??

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/uks2Zbc4iKU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Yahia AMMAR

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Size is not extracted by the mapper-attachment AFAIK.
That said, I think it could make sense to add it to the FS River. Open an issue is FSRiver project?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 5 avr. 2013 à 11:43, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

thank you very much David for your help, I was able to extract document metada ,
concerning content_type field, it works for pdf documents, the content of content_type for pdf document is application/pdf
but its content is false when it comes from another type of document...

is that we can also indexed informtion about the size of a document ? like this :

"size": {
"type": "string",
"store": "yes"
},

Cordially

--Yahia

2013/4/4 David Pilato david@pilato.fr
I did not find any issue.
See my comments in Meta data · Issue #14 · dadoonet/fscrawler · GitHub.

I will push a full test case tomorrow but everything is fine in my opinion.

That said, I should probably create a default mapping that stores metadata by default and probably does not store source.
I will think about it.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 4 avr. 2013 à 17:05, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

So David , there is no news ?? the issue comes from FSRiver ??

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/uks2Zbc4iKU/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Yahia AMMAR

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yes, this is interesting information, which can be used in the display of a
document, or use facet for grouping document bye their size...

2013/4/5 David Pilato david@pilato.fr

Size is not extracted by the mapper-attachment AFAIK.
That said, I think it could make sense to add it to the FS River. Open an
issue is FSRiver project?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 5 avr. 2013 à 11:43, Ammar Yahia yahia.ammar.info@gmail.com a écrit :

thank you very much David for your help, I was able to
extract document metada ,
concerning content_type field, it works for pdf documents, the content of
content_type for pdf document is application/pdf
but its content is false when it comes from another type of document...

is that we can also indexed informtion about the size of a document ? like
this :

"size": {
"type": "string",
"store": "yes"
},

Cordially

--Yahia

2013/4/4 David Pilato david@pilato.fr

I did not find any issue.
See my comments in Meta data · Issue #14 · dadoonet/fscrawler · GitHub.

I will push a full test case tomorrow but everything is fine in my
opinion.

That said, I should probably create a default mapping that stores
metadata by default and probably does not store source.
I will think about it.

--
David Pilato | Technical Advocate | *Elasticsearch.comhttp://elasticsearch.com/
*
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 4 avr. 2013 à 17:05, Ammar Yahia yahia.ammar.info@gmail.com a écrit
:

So David , there is no news ?? the issue comes from FSRiver ??

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/uks2Zbc4iKU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Yahia AMMAR

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/uks2Zbc4iKU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Yahia AMMAR

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.