How to encode content of web page or file attachment with elasticsearch-river-mongodb


(Zoran Jeremic) #1

Hi,

I'm using elasticsearch-river-mongodb to index data from mongodb and make
it possible to search in elasticsearch. Document stored in mongodb could
contain different types of fields, and one of fields is attachment where
content of web pages of other files should be stored. Mappings I created
looks like this:

{
"document": {
"properties": {
"engine_id": {
"store": "yes",
"type": "string"
},
"fields": {
"type": "nested",
"properties": {
"text_value": {
"type": "string",
"analyzer": "simple"
},
"name_value": {
"type": "string",
"analyzer": "simple"
},
"float_value": {
"type": "double"
},
"key": {
"index": "not_analyzed",
"type": "string",
"index_options": "docs",
"omit_norms": true
},
"file_value": {
"type": "attachment",
"file_value":{
"term_vector":"with_positions_offsets",
"store":"yes"
}
}
}
}
}
}
}
}

Field "file_value" stores content of web page. I tried to store it in
several different ways e.g.:

byte[] encodedContent = org.elasticsearch.common.io.Streams.copyToByteArray(
inputStream);
String encodedContent = org.elasticsearch.common.Base64.encodeFromFile(
"test.html");

However, encoded value seems to be treated as regular string in
elasticsearch and I can search it only if I use encoded value in search
query. If I insert real query, I don't have any results. This used to work
fine when I have direct inserts into the elasticsearch, but with mongodb
river it doesn't work or I'm making some mistake. The only solution I have
at the moment to store the whole web page content (with html including) and
store it or to use pre-processing of web page to extract the content and
store as a string.

This is a sample of document stored in mongodb:

{
"_id" : ObjectId("5293cf6a2318b3b53ca5694d"),
"engine_id" : "engineid1234",
"fields" : [
{
"key" : "title",
"text_value" : "Healthcare in India"
},
{
"key" : "file",
"file_value" :
"em9yYW4gamVyZW1pYyBsb2dpdGVjaCBzZWFyY2ggZWxhc3RpY3NlYXJjaAo="
}
]
}

I hope that some of you guys could give me idea what's wrong here.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

Could you copy a document as elasticsearch has indexed it? I mean a http://localhost:9200/yourindex/yourtype/anyid/_source

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 26 novembre 2013 at 00:15:17, Zoran Jeremic (zoran.jeremic@gmail.com) a écrit:

Hi,

I'm using elasticsearch-river-mongodb to index data from mongodb and make it possible to search in elasticsearch. Document stored in mongodb could contain different types of fields, and one of fields is attachment where content of web pages of other files should be stored. Mappings I created looks like this:

{
"document": {
"properties": {
"engine_id": {
"store": "yes",
"type": "string"
},
"fields": {
"type": "nested",
"properties": {
"text_value": {
"type": "string",
"analyzer": "simple"
},
"name_value": {
"type": "string",
"analyzer": "simple"
},
"float_value": {
"type": "double"
},
"key": {
"index": "not_analyzed",
"type": "string",
"index_options": "docs",
"omit_norms": true
},
"file_value": {
"type": "attachment",
"file_value":{
"term_vector":"with_positions_offsets",
"store":"yes"
}
}
}
}
}
}
}
}

Field "file_value" stores content of web page. I tried to store it in several different ways e.g.:

byte[] encodedContent = org.elasticsearch.common.io.Streams.copyToByteArray(inputStream);
String encodedContent = org.elasticsearch.common.Base64.encodeFromFile("test.html");

However, encoded value seems to be treated as regular string in elasticsearch and I can search it only if I use encoded value in search query. If I insert real query, I don't have any results. This used to work fine when I have direct inserts into the elasticsearch, but with mongodb river it doesn't work or I'm making some mistake. The only solution I have at the moment to store the whole web page content (with html including) and store it or to use pre-processing of web page to extract the content and store as a string.

This is a sample of document stored in mongodb:

{
"_id" : ObjectId("5293cf6a2318b3b53ca5694d"),
"engine_id" : "engineid1234",
"fields" : [
{
"key" : "title",
"text_value" : "Healthcare in India"
},
{
"key" : "file",
"file_value" : "em9yYW4gamVyZW1pYyBsb2dpdGVjaCBzZWFyY2ggZWxhc3RpY3NlYXJjaAo="
}
]
}

I hope that some of you guys could give me idea what's wrong here.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zoran Jeremic) #3

Hi David,

Thank you for your quick response. Actually, after checking index mapping
after inserting document, I realized that there was mistake in index name
provided for mapping (documents) and index name I set in river (document),
so ES created default mapping instead of using the one I created, and
probably that was the reason why attachment was treated as index. However,
after fixing this issue, I got into the another. I can't search any of the
nested fields. I'm tried the following query:

{
"query": {
"bool": {
"should": {
"match": {"fields.text_value":"Healthcare in India"}
}
}
}
}

The document is stored in ES index as follows:

{ "_index": "inextweb_documents",
"_type": "documents",
"_id": "5294e14023189da221c66101",
"_version": 1,
"exists": true,
"_source": {
"_id": "5294e14023189da221c66101",
"engine_id": "engineid1234",
"fields": [
{
"key": "title",
"text_value": "Healthcare in India"
},
{
"key": "domain",
"text_value": "wikipedia.org"
},
{
"key": "jobId",
"text_value": "jobid123"
},
{
"key": "file",
"file_value":
"SG93IHRvIGVuY29kZSBjb250ZW50IG9mIHdlYiBwYWdlIG9yIGZpbGUgYXR0YWNobWVudCB3aXRoIGVsYXN0aWNzZWFyY2gtcml2ZXItbW9uZ29kYgo="
}
]
}
}

Zoran

On Tuesday, November 26, 2013 12:36:38 AM UTC-8, David Pilato wrote:

Could you copy a document as elasticsearch has indexed it? I mean a
http://localhost:9200/yourindex/yourtype/anyid/_sourcehttp://www.google.com/url?q=http%3A%2F%2Flocalhost%3A9200%2Fyourindex%2Fyourtype%2Fanyid%2F_source&sa=D&sntz=1&usg=AFQjCNEHe7plMwQFK90OLpiGqvm-4JlKTQ

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 26 novembre 2013 at 00:15:17, Zoran Jeremic (zoran....@gmail.com<javascript:>)
a écrit:

Hi,

I'm using elasticsearch-river-mongodb to index data from mongodb and make
it possible to search in elasticsearch. Document stored in mongodb could
contain different types of fields, and one of fields is attachment where
content of web pages of other files should be stored. Mappings I created
looks like this:

{
"document": {
"properties": {
"engine_id": {
"store": "yes",
"type": "string"
},
"fields": {
"type": "nested",
"properties": {
"text_value": {
"type": "string",
"analyzer": "simple"
},
"name_value": {
"type": "string",
"analyzer": "simple"
},
"float_value": {
"type": "double"
},
"key": {
"index": "not_analyzed",
"type": "string",
"index_options": "docs",
"omit_norms": true
},
"file_value": {
"type": "attachment",
"file_value":{
"term_vector":"with_positions_offsets",
"store":"yes"
}
}
}
}
}
}
}
}

Field "file_value" stores content of web page. I tried to store it in
several different ways e.g.:

byte[] encodedContent = org.elasticsearch.common.io.Streams.
copyToByteArray(inputStream);
String encodedContent = org.elasticsearch.common.Base64.encodeFromFile(
"test.html");

However, encoded value seems to be treated as regular string in
elasticsearch and I can search it only if I use encoded value in search
query. If I insert real query, I don't have any results. This used to work
fine when I have direct inserts into the elasticsearch, but with mongodb
river it doesn't work or I'm making some mistake. The only solution I have
at the moment to store the whole web page content (with html including) and
store it or to use pre-processing of web page to extract the content and
store as a string.

This is a sample of document stored in mongodb:

{
"_id" : ObjectId("5293cf6a2318b3b53ca5694d"),
"engine_id" : "engineid1234",
"fields" : [
{
"key" : "title",
"text_value" : "Healthcare in India"
},
{
"key" : "file",
"file_value" :
"em9yYW4gamVyZW1pYyBsb2dpdGVjaCBzZWFyY2ggZWxhc3RpY3NlYXJjaAo="
}
]
}

I hope that some of you guys could give me idea what's wrong here.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a716181-a2c6-4901-b11c-bf7b4c6a4539%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

I guess you need to use nested query: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 26 nov. 2013 à 19:47, Zoran Jeremic zoran.jeremic@gmail.com a écrit :

Hi David,

Thank you for your quick response. Actually, after checking index mapping after inserting document, I realized that there was mistake in index name provided for mapping (documents) and index name I set in river (document), so ES created default mapping instead of using the one I created, and probably that was the reason why attachment was treated as index. However, after fixing this issue, I got into the another. I can't search any of the nested fields. I'm tried the following query:

{
"query": {
"bool": {
"should": {
"match": {"fields.text_value":"Healthcare in India"}
}
}
}
}

The document is stored in ES index as follows:

{ "_index": "inextweb_documents",
"_type": "documents",
"_id": "5294e14023189da221c66101",
"_version": 1,
"exists": true,
"_source": {
"_id": "5294e14023189da221c66101",
"engine_id": "engineid1234",
"fields": [
{
"key": "title",
"text_value": "Healthcare in India"
},
{
"key": "domain",
"text_value": "wikipedia.org"
},
{
"key": "jobId",
"text_value": "jobid123"
},
{
"key": "file",
"file_value": "SG93IHRvIGVuY29kZSBjb250ZW50IG9mIHdlYiBwYWdlIG9yIGZpbGUgYXR0YWNobWVudCB3aXRoIGVsYXN0aWNzZWFyY2gtcml2ZXItbW9uZ29kYgo="
}
]
}
}

Zoran

On Tuesday, November 26, 2013 12:36:38 AM UTC-8, David Pilato wrote:
Could you copy a document as elasticsearch has indexed it? I mean a http://localhost:9200/yourindex/yourtype/anyid/_source

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 26 novembre 2013 at 00:15:17, Zoran Jeremic (zoran....@gmail.com) a écrit:

Hi,

I'm using elasticsearch-river-mongodb to index data from mongodb and make it possible to search in elasticsearch. Document stored in mongodb could contain different types of fields, and one of fields is attachment where content of web pages of other files should be stored. Mappings I created looks like this:

{
"document": {
"properties": {
"engine_id": {
"store": "yes",
"type": "string"
},
"fields": {
"type": "nested",
"properties": {
"text_value": {
"type": "string",
"analyzer": "simple"
},
"name_value": {
"type": "string",
"analyzer": "simple"
},
"float_value": {
"type": "double"
},
"key": {
"index": "not_analyzed",
"type": "string",
"index_options": "docs",
"omit_norms": true
},
"file_value": {
"type": "attachment",
"file_value":{
"term_vector":"with_positions_offsets",
"store":"yes"
}
}
}
}
}
}
}
}

Field "file_value" stores content of web page. I tried to store it in several different ways e.g.:

byte[] encodedContent = org.elasticsearch.common.io.Streams.copyToByteArray(inputStream);
String encodedContent = org.elasticsearch.common.Base64.encodeFromFile("test.html");

However, encoded value seems to be treated as regular string in elasticsearch and I can search it only if I use encoded value in search query. If I insert real query, I don't have any results. This used to work fine when I have direct inserts into the elasticsearch, but with mongodb river it doesn't work or I'm making some mistake. The only solution I have at the moment to store the whole web page content (with html including) and store it or to use pre-processing of web page to extract the content and store as a string.

This is a sample of document stored in mongodb:

{
"_id" : ObjectId("5293cf6a2318b3b53ca5694d"),
"engine_id" : "engineid1234",
"fields" : [
{
"key" : "title",
"text_value" : "Healthcare in India"
},
{
"key" : "file",
"file_value" : "em9yYW4gamVyZW1pYyBsb2dpdGVjaCBzZWFyY2ggZWxhc3RpY3NlYXJjaAo="
}
]
}

I hope that some of you guys could give me idea what's wrong here.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a716181-a2c6-4901-b11c-bf7b4c6a4539%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/BC62803D-059F-41F8-B863-9FCA594B81ED%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Zoran Jeremic) #5

Yes. That's it. It works now :slight_smile:

One additional question. I see that highlighting is not supported with
nested queries. Is there any workaround it?

Thanks.
Zoran

On Tuesday, November 26, 2013 11:49:25 AM UTC-8, David Pilato wrote:

I guess you need to use nested query:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 26 nov. 2013 à 19:47, Zoran Jeremic <zoran....@gmail.com <javascript:>>
a écrit :

Hi David,

Thank you for your quick response. Actually, after checking index mapping
after inserting document, I realized that there was mistake in index name
provided for mapping (documents) and index name I set in river (document),
so ES created default mapping instead of using the one I created, and
probably that was the reason why attachment was treated as index. However,
after fixing this issue, I got into the another. I can't search any of the
nested fields. I'm tried the following query:

{
"query": {
"bool": {
"should": {
"match": {"fields.text_value":"Healthcare in India"}
}
}
}
}

The document is stored in ES index as follows:

{ "_index": "inextweb_documents",
"_type": "documents",
"_id": "5294e14023189da221c66101",
"_version": 1,
"exists": true,
"_source": {
"_id": "5294e14023189da221c66101",
"engine_id": "engineid1234",
"fields": [
{
"key": "title",
"text_value": "Healthcare in India"
},
{
"key": "domain",
"text_value": "wikipedia.org"
},
{
"key": "jobId",
"text_value": "jobid123"
},
{
"key": "file",
"file_value":
"SG93IHRvIGVuY29kZSBjb250ZW50IG9mIHdlYiBwYWdlIG9yIGZpbGUgYXR0YWNobWVudCB3aXRoIGVsYXN0aWNzZWFyY2gtcml2ZXItbW9uZ29kYgo="
}
]
}
}

Zoran

On Tuesday, November 26, 2013 12:36:38 AM UTC-8, David Pilato wrote:

Could you copy a document as elasticsearch has indexed it? I mean a
http://localhost:9200/yourindex/yourtype/anyid/_sourcehttp://www.google.com/url?q=http%3A%2F%2Flocalhost%3A9200%2Fyourindex%2Fyourtype%2Fanyid%2F_source&sa=D&sntz=1&usg=AFQjCNEHe7plMwQFK90OLpiGqvm-4JlKTQ

--
David Pilato | Technical Advocate | Elasticsearch.com
http://Elasticsearch.com

@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 26 novembre 2013 at 00:15:17, Zoran Jeremic (zoran....@gmail.com) a
écrit:

Hi,

I'm using elasticsearch-river-mongodb to index data from mongodb and make
it possible to search in elasticsearch. Document stored in mongodb could
contain different types of fields, and one of fields is attachment where
content of web pages of other files should be stored. Mappings I created
looks like this:

{
"document": {
"properties": {
"engine_id": {
"store": "yes",
"type": "string"
},
"fields": {
"type": "nested",
"properties": {
"text_value": {
"type": "string",
"analyzer": "simple"
},
"name_value": {
"type": "string",
"analyzer": "simple"
},
"float_value": {
"type": "double"
},
"key": {
"index": "not_analyzed",
"type": "string",
"index_options": "docs",
"omit_norms": true
},
"file_value": {
"type": "attachment",
"file_value":{
"term_vector":"with_positions_offsets",
"store":"yes"
}
}
}
}
}
}
}
}

Field "file_value" stores content of web page. I tried to store it in
several different ways e.g.:

byte[] encodedContent = org.elasticsearch.common.io.Streams.
copyToByteArray(inputStream);
String encodedContent = org.elasticsearch.common.Base64.encodeFromFile(
"test.html");

However, encoded value seems to be treated as regular string in
elasticsearch and I can search it only if I use encoded value in search
query. If I insert real query, I don't have any results. This used to work
fine when I have direct inserts into the elasticsearch, but with mongodb
river it doesn't work or I'm making some mistake. The only solution I have
at the moment to store the whole web page content (with html including) and
store it or to use pre-processing of web page to extract the content and
store as a string.

This is a sample of document stored in mongodb:

{
"_id" : ObjectId("5293cf6a2318b3b53ca5694d"),
"engine_id" : "engineid1234",
"fields" : [
{
"key" : "title",
"text_value" : "Healthcare in India"
},
{
"key" : "file",
"file_value" :
"em9yYW4gamVyZW1pYyBsb2dpdGVjaCBzZWFyY2ggZWxhc3RpY3NlYXJjaAo="
}
]
}

I hope that some of you guys could give me idea what's wrong here.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4a716181-a2c6-4901-b11c-bf7b4c6a4539%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c0a2ade4-6c96-4de7-812d-6a7c6200fe6f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6