Fielddata breaker question

On 3/17/14, 2:36 AM, Dunaeth wrote:

Hi,

Due to the insert and search query frequency, it's nearly impossible to
get logs from specific queries. That said, logs attached are extracts of
the logs since the cluster restart and are most probably generated
during document inserts.

Hi Dunaeth,

Thanks! I'll take a look and let you know if I see anything!

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5327707A.1080608%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

On 3/17/14, 2:36 AM, Dunaeth wrote:

Hi,

Due to the insert and search query frequency, it's nearly impossible to
get logs from specific queries. That said, logs attached are extracts of
the logs since the cluster restart and are most probably generated
during document inserts.

It looks like you have incredibly small segments for this index
(tester), what does the data look like? Can you share your mappings for
the index as well as example documents?

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/532830DF.9040905%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Actually, tester is a dedicated percolator index with 5 percolation queries
stored and no other data. Percolated documents are web logs and the tester
mapping is :

{

"tester": {
"mappings": {
".percolator": {
"_id": {
"index": "not_analyzed"
},
"properties": {
"query": {
"type": "object",
"enabled": false
}
}
},
"test_hit": {
"dynamic_templates": [
{
"template1": {
"mapping": {
"type": "integer"
},
"match": "*_id"
}
}
],
"_timestamp": {
"enabled": true,
"path": "date",
"format": "date_time"
},
"_source": {
"excludes": [
"@timestamp"
]
},
"properties": {
"@timestamp": {
"type": "date",
"index": "no",
"format": "dateOptionalTime"
},
"date": {
"type": "date",
"format": "date_time"
},
"geoip": {
"type": "geo_point"
},
"host": {
"type": "string",
"index": "not_analyzed"
},
"ip": {
"type": "ip"
},
"prefered-language": {
"type": "string"
},
"referer": {
"type": "string",
"analyzer": "splitter"
},
"reverse_ip": {
"type": "string"
},
"session_id": {
"type": "string",
"index": "not_analyzed"
},
"ua_build": {
"type": "short"
},
"ua_device": {
"type": "string"
},
"ua_major": {
"type": "short"
},
"ua_minor": {
"type": "short"
},
"ua_name": {
"type": "string"
},
"ua_os": {
"type": "string"
},
"ua_os_major": {
"type": "short"
},
"ua_os_minor": {
"type": "short"
},
"ua_os_name": {
"type": "string"
},
"ua_patch": {
"type": "short"
},
"unique": {
"type": "boolean"
},
"uri": {
"type": "string"
},
"user-agent": {
"type": "string",
"analyzer": "splitter"
},
"valid": {
"type": "boolean"
}
}
}
}
}
}

Le mardi 18 mars 2014 12:41:19 UTC+1, Lee Hinman a écrit :

On 3/17/14, 2:36 AM, Dunaeth wrote:

Hi,

Due to the insert and search query frequency, it's nearly impossible to
get logs from specific queries. That said, logs attached are extracts of
the logs since the cluster restart and are most probably generated
during document inserts.

It looks like you have incredibly small segments for this index
(tester), what does the data look like? Can you share your mappings for
the index as well as example documents?

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/57ed2777-ee88-47ac-8033-fa87234d1a64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

That said, our stats indices (monthly indices) have almost the same mapping
but the documents are stored. I do not believe the tester index is
concerned by the issue since the only logs linked with it remains in the
first seconds after the cluster restart (no trace log after). To go further
with our data description, sure each document remains quiet small (atm
we're talking about an average of 250B per document in the index size, 2M
logs for 500MB in size). To be complete, here is the detail of our customer
splitter analyzer (from tester/_settings) :

{

"tester": {
"settings": {
"index": {
"uuid": "HGEPQdWoRLWe0ATNGUElvw",
"number_of_replicas": "1",
"analysis": {
"analyzer": {
"splitter": {
"type": "custom",
"tokenizer": "pattern"
}
}
},
"number_of_shards": "1",
"version": {
"created": "1000099"
}
}
}
}
}

Le mardi 18 mars 2014 12:58:56 UTC+1, Dunaeth a écrit :

Actually, tester is a dedicated percolator index with 5 percolation
queries stored and no other data. Percolated documents are web logs and the
tester mapping is :

{

"tester": {
"mappings": {
".percolator": {
"_id": {
"index": "not_analyzed"
},
"properties": {
"query": {
"type": "object",
"enabled": false
}
}
},
"test_hit": {
"dynamic_templates": [
{
"template1": {
"mapping": {
"type": "integer"
},
"match": "*_id"
}
}
],
"_timestamp": {
"enabled": true,
"path": "date",
"format": "date_time"
},
"_source": {
"excludes": [
"@timestamp"
]
},
"properties": {
"@timestamp": {
"type": "date",
"index": "no",
"format": "dateOptionalTime"
},
"date": {
"type": "date",
"format": "date_time"
},
"geoip": {
"type": "geo_point"
},
"host": {
"type": "string",
"index": "not_analyzed"
},
"ip": {
"type": "ip"
},
"prefered-language": {
"type": "string"
},
"referer": {
"type": "string",
"analyzer": "splitter"
},
"reverse_ip": {
"type": "string"
},
"session_id": {
"type": "string",
"index": "not_analyzed"
},
"ua_build": {
"type": "short"
},
"ua_device": {
"type": "string"
},
"ua_major": {
"type": "short"
},
"ua_minor": {
"type": "short"
},
"ua_name": {
"type": "string"
},
"ua_os": {
"type": "string"
},
"ua_os_major": {
"type": "short"
},
"ua_os_minor": {
"type": "short"
},
"ua_os_name": {
"type": "string"
},
"ua_patch": {
"type": "short"
},
"unique": {
"type": "boolean"
},
"uri": {
"type": "string"
},
"user-agent": {
"type": "string",
"analyzer": "splitter"
},
"valid": {
"type": "boolean"
}
}
}
}
}
}

Le mardi 18 mars 2014 12:41:19 UTC+1, Lee Hinman a écrit :

On 3/17/14, 2:36 AM, Dunaeth wrote:

Hi,

Due to the insert and search query frequency, it's nearly impossible to
get logs from specific queries. That said, logs attached are extracts
of
the logs since the cluster restart and are most probably generated
during document inserts.

It looks like you have incredibly small segments for this index
(tester), what does the data look like? Can you share your mappings for
the index as well as example documents?

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/632227b9-d295-45cc-be34-5580d9e793c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Meant "custom" splitter analyzer

Le mardi 18 mars 2014 15:07:24 UTC+1, Dunaeth a écrit :

That said, our stats indices (monthly indices) have almost the same
mapping but the documents are stored. I do not believe the tester index is
concerned by the issue since the only logs linked with it remains in the
first seconds after the cluster restart (no trace log after). To go further
with our data description, sure each document remains quiet small (atm
we're talking about an average of 250B per document in the index size, 2M
logs for 500MB in size). To be complete, here is the detail of our customer
splitter analyzer (from tester/_settings) :

{

"tester": {
"settings": {
"index": {
"uuid": "HGEPQdWoRLWe0ATNGUElvw",
"number_of_replicas": "1",
"analysis": {
"analyzer": {
"splitter": {
"type": "custom",
"tokenizer": "pattern"
}
}
},
"number_of_shards": "1",
"version": {
"created": "1000099"
}
}
}
}
}

Le mardi 18 mars 2014 12:58:56 UTC+1, Dunaeth a écrit :

Actually, tester is a dedicated percolator index with 5 percolation
queries stored and no other data. Percolated documents are web logs and the
tester mapping is :

{

"tester": {
"mappings": {
".percolator": {
"_id": {
"index": "not_analyzed"
},
"properties": {
"query": {
"type": "object",
"enabled": false
}
}
},
"test_hit": {
"dynamic_templates": [
{
"template1": {
"mapping": {
"type": "integer"
},
"match": "*_id"
}
}
],
"_timestamp": {
"enabled": true,
"path": "date",
"format": "date_time"
},
"_source": {
"excludes": [
"@timestamp"
]
},
"properties": {
"@timestamp": {
"type": "date",
"index": "no",
"format": "dateOptionalTime"
},
"date": {
"type": "date",
"format": "date_time"
},
"geoip": {
"type": "geo_point"
},
"host": {
"type": "string",
"index": "not_analyzed"
},
"ip": {
"type": "ip"
},
"prefered-language": {
"type": "string"
},
"referer": {
"type": "string",
"analyzer": "splitter"
},
"reverse_ip": {
"type": "string"
},
"session_id": {
"type": "string",
"index": "not_analyzed"
},
"ua_build": {
"type": "short"
},
"ua_device": {
"type": "string"
},
"ua_major": {
"type": "short"
},
"ua_minor": {
"type": "short"
},
"ua_name": {
"type": "string"
},
"ua_os": {
"type": "string"
},
"ua_os_major": {
"type": "short"
},
"ua_os_minor": {
"type": "short"
},
"ua_os_name": {
"type": "string"
},
"ua_patch": {
"type": "short"
},
"unique": {
"type": "boolean"
},
"uri": {
"type": "string"
},
"user-agent": {
"type": "string",
"analyzer": "splitter"
},
"valid": {
"type": "boolean"
}
}
}
}
}
}

Le mardi 18 mars 2014 12:41:19 UTC+1, Lee Hinman a écrit :

On 3/17/14, 2:36 AM, Dunaeth wrote:

Hi,

Due to the insert and search query frequency, it's nearly impossible
to
get logs from specific queries. That said, logs attached are extracts
of
the logs since the cluster restart and are most probably generated
during document inserts.

It looks like you have incredibly small segments for this index
(tester), what does the data look like? Can you share your mappings for
the index as well as example documents?

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ced2a531-d2f8-4abe-af3d-be896c826a39%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 3/18/14, 5:58 AM, Dunaeth wrote:

Actually, tester is a dedicated percolator index with 5 percolation
queries stored and no other data. Percolated documents are web logs and
the tester mapping is (elided)

Are you doing aggregations when you percolate documents? Are you
percolating existing documents or sending new ones every time?

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53296AEC.9010405%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

There's no aggregation in percolation queries and we only percolate new
documents, the flow is as follow :

Step 1 : data logging

  1. An event occures
  2. possible ES search queries with or without facet on different indices
  3. ES percolation query with event data on the tester index (5 queries
    stored, no other data)
  4. event data logging in a plain text file

Step 2 : defered data indexing

  1. logstash detects the new event and send it to a centralized redis queue
    like the centralized example described in logstash documentation
  2. logstash parses the queue and indexes the event in ES

Meanwhile, there can be aggregation queries on the stat part of the
application but their rate is minimal compared to the insert rate.

Le mercredi 19 mars 2014 11:01:16 UTC+1, Lee Hinman a écrit :

On 3/18/14, 5:58 AM, Dunaeth wrote:

Actually, tester is a dedicated percolator index with 5 percolation
queries stored and no other data. Percolated documents are web logs and
the tester mapping is (elided)

Are you doing aggregations when you percolate documents? Are you
percolating existing documents or sending new ones every time?

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/596f7efa-11e6-4045-a4bc-9200cb535b37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 3/19/14, 4:25 AM, Dunaeth wrote:

There's no aggregation in percolation queries and we only percolate new
documents, the flow is as follow :

Step 1 : data logging

  1. An event occures
  2. possible ES search queries with or without facet on different indices
  3. ES percolation query with event data on the tester index (5 queries
    stored, no other data)
  4. event data logging in a plain text file

Step 2 : defered data indexing

  1. logstash detects the new event and send it to a centralized redis
    queue like the centralized example described in logstash documentation
  2. logstash parses the queue and indexes the event in ES

Meanwhile, there can be aggregation queries on the stat part of the
application but their rate is minimal compared to the insert rate.

Okay, so we'll rule out percolation for now as a cause of the circuit
breaker estimations.

More information about your data would be helpful, you sent the mapping,
can you send an example document that is similar to most of the
documents you're indexing? Can you also provide some of the queries,
facets, and aggregations that you're performing? I can try to index some
test data and reproduce it with this information and see if that works.

Also, if this is not sensitive data, taking a snapshot of the index with
the snapshot/restore functionality and sending it to me would allow me
to reproduce the issue with your exact data. If that's an option it
would definitely be useful.

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/532A5557.7010301%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Here's a small sample of data (atm, the march index is near 6M docs for
1.37GB size).

Le jeudi 20 mars 2014 03:41:27 UTC+1, Lee Hinman a écrit :

On 3/19/14, 4:25 AM, Dunaeth wrote:

There's no aggregation in percolation queries and we only percolate new
documents, the flow is as follow :

Step 1 : data logging

  1. An event occures
  2. possible ES search queries with or without facet on different indices
  3. ES percolation query with event data on the tester index (5 queries
    stored, no other data)
  4. event data logging in a plain text file

Step 2 : defered data indexing

  1. logstash detects the new event and send it to a centralized redis
    queue like the centralized example described in logstash documentation
  2. logstash parses the queue and indexes the event in ES

Meanwhile, there can be aggregation queries on the stat part of the
application but their rate is minimal compared to the insert rate.

Okay, so we'll rule out percolation for now as a cause of the circuit
breaker estimations.

More information about your data would be helpful, you sent the mapping,
can you send an example document that is similar to most of the
documents you're indexing? Can you also provide some of the queries,
facets, and aggregations that you're performing? I can try to index some
test data and reproduce it with this information and see if that works.

Also, if this is not sensitive data, taking a snapshot of the index with
the snapshot/restore functionality and sending it to me would allow me
to reproduce the issue with your exact data. If that's an option it
would definitely be useful.

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/40486946-9167-4d25-9845-60525a96f876%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 3/24/14, 7:46 AM, Dunaeth wrote:

Hi,

Here's a small sample of data (atm, the march index is near 6M docs for
1.37GB size).

I haven't been able to reproduce the issue locally, but Simon noticed
that this may be caused by
[LUCENE-5553] IndexReader#ReaderClosedListener is not always called on IndexReader#close() - ASF JIRA , which is fixed in
the upcoming 4.7.1 release (which will be incorporated into ES shortly
after).

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53334917.5070302%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

According to the graph, the fielddata_breaker_estimated_size increase is
linked to our data inserts.
Since you weren't able to reproduce the issue by inserting given data,
either the data set is not enough varied or the issue is due to queries
done prior the inserts.
For the 1st case, I could probably give you some snapeshots from our data.
For the 2nd, I could give you more details about our queries.

Le mercredi 26 mars 2014 22:39:35 UTC+1, Lee Hinman a écrit :

On 3/24/14, 7:46 AM, Dunaeth wrote:

Hi,

Here's a small sample of data (atm, the march index is near 6M docs for
1.37GB size).

I haven't been able to reproduce the issue locally, but Simon noticed
that this may be caused by
[LUCENE-5553] IndexReader#ReaderClosedListener is not always called on IndexReader#close() - ASF JIRA , which is fixed in
the upcoming 4.7.1 release (which will be incorporated into ES shortly
after).

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c1d05b77-d66c-427c-a59a-519208547a2a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

I stopped logstash to check whether the fielddata breaker issue was due to
data inserts or not. As the fielddata breaker estimated size continued to
grow the same way, I'd say the issue is caused by the queries done prior
inserts (ie. eventually a facet or an aggregation query followed by a
search query (or nothing at all) and finally a percolation query (this one
happens each time) per event created).

Le jeudi 27 mars 2014 16:53:18 UTC+1, Dunaeth a écrit :

According to the graph, the fielddata_breaker_estimated_size increase is
linked to our data inserts.
Since you weren't able to reproduce the issue by inserting given data,
either the data set is not enough varied or the issue is due to queries
done prior the inserts.
For the 1st case, I could probably give you some snapeshots from our data.
For the 2nd, I could give you more details about our queries.

Le mercredi 26 mars 2014 22:39:35 UTC+1, Lee Hinman a écrit :

On 3/24/14, 7:46 AM, Dunaeth wrote:

Hi,

Here's a small sample of data (atm, the march index is near 6M docs for
1.37GB size).

I haven't been able to reproduce the issue locally, but Simon noticed
that this may be caused by
[LUCENE-5553] IndexReader#ReaderClosedListener is not always called on IndexReader#close() - ASF JIRA , which is fixed in
the upcoming 4.7.1 release (which will be incorporated into ES shortly
after).

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5a38a03a-c4b8-4a34-a455-503132dc64e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 4/4/14, 8:51 AM, Dunaeth wrote:

Hi,

I stopped logstash to check whether the fielddata breaker issue was due
to data inserts or not. As the fielddata breaker estimated size
continued to grow the same way, I'd say the issue is caused by the
queries done prior inserts (ie. eventually a facet or an aggregation
query followed by a search query (or nothing at all) and finally a
percolation query (this one happens each time) per event created).

Hi Dunaeth,

I think your issue may also be caused by
Percolator doesn't reduce CircuitBreaker stats in every case. by martijnvg · Pull Request #5588 · elastic/elasticsearch · GitHub, which was
recently resolved.

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/533F3812.5060502%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Lee,

This issue could exactly match what we're experiencing, we'll wait for the
next revision then and see if it solves our problem. Thanks :slight_smile:

Le samedi 5 avril 2014 00:54:10 UTC+2, Lee Hinman a écrit :

On 4/4/14, 8:51 AM, Dunaeth wrote:

Hi,

I stopped logstash to check whether the fielddata breaker issue was due
to data inserts or not. As the fielddata breaker estimated size
continued to grow the same way, I'd say the issue is caused by the
queries done prior inserts (ie. eventually a facet or an aggregation
query followed by a search query (or nothing at all) and finally a
percolation query (this one happens each time) per event created).

Hi Dunaeth,

I think your issue may also be caused by
Percolator doesn't reduce CircuitBreaker stats in every case. by martijnvg · Pull Request #5588 · elastic/elasticsearch · GitHub, which was
recently resolved.

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d3449b92-590c-4a5f-8e98-47ee4a569518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Some feedback on this subject, the latest ES patches made my day. Using ES
1.0.3 solved the issue. Thanks :slight_smile:

Le lundi 7 avril 2014 09:35:00 UTC+2, Dunaeth a écrit :

Hi Lee,

This issue could exactly match what we're experiencing, we'll wait for the
next revision then and see if it solves our problem. Thanks :slight_smile:

Le samedi 5 avril 2014 00:54:10 UTC+2, Lee Hinman a écrit :

On 4/4/14, 8:51 AM, Dunaeth wrote:

Hi,

I stopped logstash to check whether the fielddata breaker issue was due
to data inserts or not. As the fielddata breaker estimated size
continued to grow the same way, I'd say the issue is caused by the
queries done prior inserts (ie. eventually a facet or an aggregation
query followed by a search query (or nothing at all) and finally a
percolation query (this one happens each time) per event created).

Hi Dunaeth,

I think your issue may also be caused by
Percolator doesn't reduce CircuitBreaker stats in every case. by martijnvg · Pull Request #5588 · elastic/elasticsearch · GitHub, which was
recently resolved.

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ccdc25b-90a3-4399-b94a-937299b9ae6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 4/17/14, 4:32 AM, Dunaeth wrote:

Hi,

Some feedback on this subject, the latest ES patches made my day. Using
ES 1.0.3 solved the issue. Thanks :slight_smile:

Great! Glad to hear it worked for you!

;; Lee

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/534FE78A.3020803%40gmail.com.
For more options, visit https://groups.google.com/d/optout.