Highlight content from crawl data from manifoldcf to ES

We are using manifolcf to crawl web pages and then index them through
Elastic search.

Is there way to get only few lines that contain the searched keyword in
response of elastic search query instead of whole content. Like we get in
google search.
Solution we are trying: Reference
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fdocs-termvectors.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-highlighting.html
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-highlighting.html

We are trying to do mapping like:

{
      "mappings": {
        "test": {
          "properties": {
      "file": {
    "type": "attachment",
    "path": "full",
    "fields": {
      "_content_type": {
        "type": "string",
        "store": true
      },
      "_name": {
        "type": "string",
        "store": true
      },
      "content": {
        "type": "string",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "index_analyzer" : "fulltext_analyzer"

      }
    },
"store" : true, 
"term_vector" : "with_positions_offsets_payloads" 
  }
          }
        }
      },
      "settings" : {
        "index" : {
          "number_of_shards" : 1,
          "number_of_replicas" : 0
        },
        "analysis": {
          "analyzer": {
            "fulltext_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "type_as_payload"
              ]
            }
          }
        }
      }
    }

and then query like:

{
"query": {
"match": {
"file": "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}

But we don't get highlight in response instead we get whole content in
response.

Any help is appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f59d248-0ddb-4ee0-9e0f-b78844bde48b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Dec 1, 2014 at 10:42 PM, N Bijalwan ahcirpma@gmail.com wrote:

We are using manifolcf to crawl web pages and then index them through
Elastic search.

Is there way to get only few lines that contain the searched keyword in
response of Elasticsearch query instead of whole content. Like we get in
google search.
Solution we are trying: Reference
Elasticsearch Platform — Find real-time answers at scale | Elastic
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fdocs-termvectors.html

Elasticsearch Platform — Find real-time answers at scale | Elastic
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-highlighting.html

We are trying to do mapping like:

{
      "mappings": {
        "test": {
          "properties": {
      "file": {
    "type": "attachment",
    "path": "full",
    "fields": {
      "_content_type": {
        "type": "string",
        "store": true
      },
      "_name": {
        "type": "string",
        "store": true
      },
      "content": {
        "type": "string",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "index_analyzer" : "fulltext_analyzer"

      }
    },
"store" : true,
"term_vector" : "with_positions_offsets_payloads"
  }
          }
        }
      },
      "settings" : {
        "index" : {
          "number_of_shards" : 1,
          "number_of_replicas" : 0
        },
        "analysis": {
          "analyzer": {
            "fulltext_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "type_as_payload"
              ]
            }
          }
        }
      }
    }

and then query like:

{
"query": {
"match": {
"file": "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}

But we don't get highlight in response instead we get whole content in
response.

Any help is appreciated.

You have to specify the full path of what you want to highlight. Like
{"fields": {"file.content"}}. You'll also get the whole content back by
default unless you turn it off with "_source": false in the request. You
can filter it to only get the parts that you want:

You'll find that using term vectors costs a ton of extra space and therefor
IO on write, merge, and search. You can reduce it by switching to the
postings highlighter if your content is prose. The plain highlighter isn't
an option if your content might get long. Each of those highlighters
require different stored data and produce different shaped results. You
could also try the experimental highlighter (if you are using Elasticsearch
1.3.X) - it steals a ton of ideas from the others and is super flexible and
quite stable. We wrote it because we couldn't afford the extra space for
term vectors but didn't like the posting's highlighter's segmentation rules.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3St8hp9Mrg7bwB3Ez4ZNy-EuWO3FbGgCQOuu%3DzesB_7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Nik for very descriptive solution. I also did some mapping mistakes
for which i was not able to get highlighted text in response for sample
data.

I fixed it by using folllowing mapping

http://localhost:9200/cnn/test/_mapping

{
"test": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"_content_type": {
"store":"yes"
},
"_name": {
"store":"yes"
},
"content": { "term_vector":"with_positions_offsets",
"store":"yes" },
"file": { "term_vector":"with_positions_offsets", "store":"yes" }

    }
  }
}

}
}

and then i queried it like

http://localhost:9200/cnn/test/_search?pretty=true

{
"_source" : ["file._content_type","file._name"],
"query" : {
"query_string" : {
"query" : "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"number_of_fragments" : 1}
}
}
}

Yes. I'll look into version 1.3 for performance.

naveen

On Tuesday, 2 December 2014 19:05:03 UTC+5:30, Nikolas Everett wrote:

On Mon, Dec 1, 2014 at 10:42 PM, N Bijalwan <ahci...@gmail.com
<javascript:>> wrote:

We are using manifolcf to crawl web pages and then index them through
Elastic search.

Is there way to get only few lines that contain the searched keyword in
response of Elasticsearch query instead of whole content. Like we get in
google search.
Solution we are trying: Reference
Elasticsearch Platform — Find real-time answers at scale | Elastic
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fdocs-termvectors.html

Elasticsearch Platform — Find real-time answers at scale | Elastic
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-highlighting.html

We are trying to do mapping like:

{
      "mappings": {
        "test": {
          "properties": {
      "file": {
    "type": "attachment",
    "path": "full",
    "fields": {
      "_content_type": {
        "type": "string",
        "store": true
      },
      "_name": {
        "type": "string",
        "store": true
      },
      "content": {
        "type": "string",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "index_analyzer" : "fulltext_analyzer"

      }
    },
"store" : true, 
"term_vector" : "with_positions_offsets_payloads" 
  }
          }
        }
      },
      "settings" : {
        "index" : {
          "number_of_shards" : 1,
          "number_of_replicas" : 0
        },
        "analysis": {
          "analyzer": {
            "fulltext_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "type_as_payload"
              ]
            }
          }
        }
      }
    }

and then query like:

{
"query": {
"match": {
"file": "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}

But we don't get highlight in response instead we get whole content in
response.

Any help is appreciated.

You have to specify the full path of what you want to highlight. Like
{"fields": {"file.content"}}. You'll also get the whole content back by
default unless you turn it off with "_source": false in the request. You
can filter it to only get the parts that you want:
Elasticsearch Platform — Find real-time answers at scale | Elastic

You'll find that using term vectors costs a ton of extra space and
therefor IO on write, merge, and search. You can reduce it by switching to
the postings highlighter if your content is prose. The plain highlighter
isn't an option if your content might get long. Each of those highlighters
require different stored data and produce different shaped results. You
could also try the experimental highlighter (if you are using Elasticsearch
1.3.X) - it steals a ton of ideas from the others and is super flexible and
quite stable. We wrote it because we couldn't afford the extra space for
term vectors but didn't like the posting's highlighter's segmentation rules.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e1004ca7-f0f8-4917-978c-70357e910451%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Setting store to yes isn't actually required. It might increase
performance in some cases at the cost of extra disk space. I leave it
false everywhere and have no trouble.

Nik

On Tue, Dec 2, 2014 at 10:00 AM, N Bijalwan ahcirpma@gmail.com wrote:

Thanks Nik for very descriptive solution. I also did some mapping mistakes
for which i was not able to get highlighted text in response for sample
data.

I fixed it by using folllowing mapping

http://localhost:9200/cnn/test/_mapping

{
"test": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"_content_type": {
"store":"yes"
},
"_name": {
"store":"yes"
},
"content": { "term_vector":"with_positions_offsets",
"store":"yes" },
"file": { "term_vector":"with_positions_offsets", "store":"yes" }

    }
  }
}

}
}

and then i queried it like

http://localhost:9200/cnn/test/_search?pretty=true

{
"_source" : ["file._content_type","file._name"],
"query" : {
"query_string" : {
"query" : "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"number_of_fragments" : 1}
}
}
}

Yes. I'll look into version 1.3 for performance.

naveen

On Tuesday, 2 December 2014 19:05:03 UTC+5:30, Nikolas Everett wrote:

On Mon, Dec 1, 2014 at 10:42 PM, N Bijalwan ahci...@gmail.com wrote:

We are using manifolcf to crawl web pages and then index them through
Elastic search.

Is there way to get only few lines that contain the searched keyword in
response of Elasticsearch query instead of whole content. Like we get in
google search.
Solution we are trying: Reference http://www.
Elasticsearch Platform — Find real-time answers at scale | Elastic
current/docs-termvectors.html
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fdocs-termvectors.html

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/search-request-highlighting.html
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-highlighting.html

We are trying to do mapping like:

{
      "mappings": {
        "test": {
          "properties": {
      "file": {
    "type": "attachment",
    "path": "full",
    "fields": {
      "_content_type": {
        "type": "string",
        "store": true
      },
      "_name": {
        "type": "string",
        "store": true
      },
      "content": {
        "type": "string",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "index_analyzer" : "fulltext_analyzer"

      }
    },
"store" : true,
"term_vector" : "with_positions_offsets_payloads"
  }
          }
        }
      },
      "settings" : {
        "index" : {
          "number_of_shards" : 1,
          "number_of_replicas" : 0
        },
        "analysis": {
          "analyzer": {
            "fulltext_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "type_as_payload"
              ]
            }
          }
        }
      }
    }

and then query like:

{
"query": {
"match": {
"file": "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}

But we don't get highlight in response instead we get whole content in
response.

Any help is appreciated.

You have to specify the full path of what you want to highlight. Like
{"fields": {"file.content"}}. You'll also get the whole content back by
default unless you turn it off with "_source": false in the request. You
can filter it to only get the parts that you want:
Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/search-request-source-filtering.html

You'll find that using term vectors costs a ton of extra space and
therefor IO on write, merge, and search. You can reduce it by switching to
the postings highlighter if your content is prose. The plain highlighter
isn't an option if your content might get long. Each of those highlighters
require different stored data and produce different shaped results. You
could also try the experimental highlighter (if you are using Elasticsearch
1.3.X) - it steals a ton of ideas from the others and is super flexible and
quite stable. We wrote it because we couldn't afford the extra space for
term vectors but didn't like the posting's highlighter's segmentation rules.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e1004ca7-f0f8-4917-978c-70357e910451%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e1004ca7-f0f8-4917-978c-70357e910451%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2rCj%2BU%3DBqZqMHtEuzDdRs0O7CPw3LiW4x8Tm_ZPZcCaA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ok. thts a good suggestion. i'll use store to "no" if "yes" is not very
essential.

naveen

On Tuesday, 2 December 2014 20:33:03 UTC+5:30, Nikolas Everett wrote:

Setting store to yes isn't actually required. It might increase
performance in some cases at the cost of extra disk space. I leave it
false everywhere and have no trouble.

Nik

On Tue, Dec 2, 2014 at 10:00 AM, N Bijalwan <ahci...@gmail.com
<javascript:>> wrote:

Thanks Nik for very descriptive solution. I also did some mapping
mistakes for which i was not able to get highlighted text in response for
sample data.

I fixed it by using folllowing mapping

http://localhost:9200/cnn/test/_mapping

{
"test": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"_content_type": {
"store":"yes"
},
"_name": {
"store":"yes"
},
"content": { "term_vector":"with_positions_offsets",
"store":"yes" },
"file": { "term_vector":"with_positions_offsets", "store":"yes"
}
}
}
}
}
}

and then i queried it like

http://localhost:9200/cnn/test/_search?pretty=true

{
"_source" : ["file._content_type","file._name"],
"query" : {
"query_string" : {
"query" : "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"number_of_fragments" : 1}
}
}
}

Yes. I'll look into version 1.3 for performance.

naveen

On Tuesday, 2 December 2014 19:05:03 UTC+5:30, Nikolas Everett wrote:

On Mon, Dec 1, 2014 at 10:42 PM, N Bijalwan ahci...@gmail.com wrote:

We are using manifolcf to crawl web pages and then index them through
Elastic search.

Is there way to get only few lines that contain the searched keyword
in response of Elasticsearch query instead of whole content. Like we get
in google search.
Solution we are trying: Reference http://www.
Elasticsearch Platform — Find real-time answers at scale | Elastic
current/docs-termvectors.html
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fdocs-termvectors.html

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/search-request-highlighting.html
https://email2010.searshc.com/owa/redir.aspx?C=VTBiULXBnE-XzIuMedjuaGPHLq134dEI2v0GWL91l1pzNGDfDsz11x4ckLumFc5e2EMae1ef3sk.&URL=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-highlighting.html

We are trying to do mapping like:

{
      "mappings": {
        "test": {
          "properties": {
      "file": {
    "type": "attachment",
    "path": "full",
    "fields": {
      "_content_type": {
        "type": "string",
        "store": true
      },
      "_name": {
        "type": "string",
        "store": true
      },
      "content": {
        "type": "string",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "index_analyzer" : "fulltext_analyzer"

      }
    },
"store" : true, 
"term_vector" : "with_positions_offsets_payloads" 
  }
          }
        }
      },
      "settings" : {
        "index" : {
          "number_of_shards" : 1,
          "number_of_replicas" : 0
        },
        "analysis": {
          "analyzer": {
            "fulltext_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "type_as_payload"
              ]
            }
          }
        }
      }
    }

and then query like:

{
"query": {
"match": {
"file": "CROWLEY"
}
},
"highlight" : {
"fields" : {
"file" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}

But we don't get highlight in response instead we get whole content in
response.

Any help is appreciated.

You have to specify the full path of what you want to highlight. Like
{"fields": {"file.content"}}. You'll also get the whole content back by
default unless you turn it off with "_source": false in the request. You
can filter it to only get the parts that you want:
Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/search-request-source-filtering.html

You'll find that using term vectors costs a ton of extra space and
therefor IO on write, merge, and search. You can reduce it by switching to
the postings highlighter if your content is prose. The plain highlighter
isn't an option if your content might get long. Each of those highlighters
require different stored data and produce different shaped results. You
could also try the experimental highlighter (if you are using Elasticsearch
1.3.X) - it steals a ton of ideas from the others and is super flexible and
quite stable. We wrote it because we couldn't afford the extra space for
term vectors but didn't like the posting's highlighter's segmentation rules.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e1004ca7-f0f8-4917-978c-70357e910451%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e1004ca7-f0f8-4917-978c-70357e910451%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1beaebfd-47a4-4f54-ac1d-d4610e0723e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.