Design advice for market data


(Bobby Richards) #1

Wanting to get some advice on how to go about design. I have some currency
market data and I get roughly 10 million events a week currently storing in
postgres, it actually ends up being about 10 gigs, though I would like to
work on getting this down obviously. The data is seldom queried but I have
all of my other data in elastic search which I love. I am trying to
determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts, etc...is
this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/54f02434-37b8-4435-a846-8d20f7e9d723%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

so index: month, type: symbol? Might make sense.. you could also use
routing and use the symbol for this, to ensure that you only query one
shard for one symbol in order to have faster queries. This presentation
might be interesting for you regarding data flows

http://www.elasticsearch.org/videos/big-data-search-and-analytics/

--Alex

On Sat, Feb 1, 2014 at 9:27 PM, Bobby Richards bobby.richards@gmail.comwrote:

Wanting to get some advice on how to go about design. I have some
currency market data and I get roughly 10 million events a week currently
storing in postgres, it actually ends up being about 10 gigs, though I
would like to work on getting this down obviously. The data is seldom
queried but I have all of my other data in elastic search which I love. I
am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts,
etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/54f02434-37b8-4435-a846-8d20f7e9d723%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM-z%3DV7YhiVO9Qxq_3TKHeze9LpvWaCh4FJae5Z%3D-Q_-6w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Bobby Richards) #3

awesome, that is a great resource. thanks

On Monday, February 3, 2014 3:20:53 AM UTC-6, Alexander Reelsen wrote:

Hey,

so index: month, type: symbol? Might make sense.. you could also use
routing and use the symbol for this, to ensure that you only query one
shard for one symbol in order to have faster queries. This presentation
might be interesting for you regarding data flows

http://www.elasticsearch.org/videos/big-data-search-and-analytics/

--Alex

On Sat, Feb 1, 2014 at 9:27 PM, Bobby Richards <bobby.r...@gmail.com<javascript:>

wrote:

Wanting to get some advice on how to go about design. I have some
currency market data and I get roughly 10 million events a week currently
storing in postgres, it actually ends up being about 10 gigs, though I
would like to work on getting this down obviously. The data is seldom
queried but I have all of my other data in elastic search which I love. I
am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts,
etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/54f02434-37b8-4435-a846-8d20f7e9d723%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c426767a-cb0e-49dc-b90d-8844b3099a8b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Bobby Richards) #4

So I have decided on using the week of year as the index and quotes as my
type. I want to clarfiy a couple of things that I am seeing.

first I create my index curl 'http://localhost:9200/2014_6/quotes'

then I set my mapping:

curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping' -d '

{

  • "quotes" : {*

  • "properties" : {*
    
  •    "time_stamp": {"type":"date"},*
    
  •    "symbol": {"type":"string"},*
    
  •    "side" : {"type":"string"},*
    
  •    "price" : {"type":"double"}*
    
  • },*
    
  • "_routing" : {*

  •   "required": true,*
    
  •  "path":"symbol"*
    
  • },*

  • "_timestamp" : {*
    
  •    "enabled" : true,*
    
  •    "path":  "time_stamp",*
    
  •    "format": "date_hour_minute_second_millis"*
    
  • }*
    
  • }*

}

'
now because of this I understand when I am posting a new event to be
indexed I do not need to specify quote?routing=. However my first
question is that now I must include symbol in the json object I am posting,
is this costing me more as far as storage? If I do not do this via the
mapping I have no problem adding the routing to the uri, especially if it
saves me space.

second I am seeing a couple of weird things...
by running this:
curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd'

i get the following, which is good, what I expect.
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"2014_5","_type":"quotes","_id":"ZW5u1nCHTGW-xToRy8Yy5g","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"a",
"price":1.3456}},{"_index":"2014_5","_type":"quotes","_id":"ok4FLnrfR4u2CnJ3lVNKkg","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"b",
"price":1.3457}},{"_index":"2014_5","_type":"quotes","_id":"1eG5m0riSoiDEquQ3I-QSA","_score":1.0,
"_source" :
{ "time_stamp":1391653001100, "symbol":"eurusd", "side":"b",
"price":1.3458}}]}}

however if you will notice the first entry is of side "a". by running the
following I get nothing.
url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d
'

{"query":{"filtered":{"query":{"match_all":{}},"filter":{"term":{"side":"a"}}}}}'

however if I change side to "b" I get 2 as I would expect. Is there some
reserved feature that would limit me searching the a or is there some text
search thing I am not thinking about.

Finally, I have added a few usdjpy quotes which are routed to a separate
shard. In my query I accidentally type *usejpy *and I got the two eurusd
events, even though it honored the side filter.
correcting the symbol I get what I would expect. Is this another text
search 'thing'? All I can think of is that by mistyping the e matches the
eur in the other indexed items.

I just want to understand fully what I have going on there, thanks.

On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:

Wanting to get some advice on how to go about design. I have some
currency market data and I get roughly 10 million events a week currently
storing in postgres, it actually ends up being about 10 gigs, though I
would like to work on getting this down obviously. The data is seldom
queried but I have all of my other data in elastic search which I love. I
am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts,
etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #5

Hey,

the side field as defined in your mapping (I assume you use elasticsearch
0.90.X) uses the standard analyzer, which by default removes stopwords. As
"a" is a stopword, it gets removed as part of the indexing process - and
that makes it impossible to search for. In order to find out more about
this, a good way is to play around with the analyze API. If you like a nice
UI on top of that, go with the inquisitor plugin.

The analyze API basically tells you, how a string is tokenized and stored
in the index, which parts are being removed or altered (due to stemming for
example).

See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

--Alex

On Thu, Feb 6, 2014 at 3:38 AM, Bobby Richards bobby.richards@gmail.comwrote:

So I have decided on using the week of year as the index and quotes as my
type. I want to clarfiy a couple of things that I am seeing.

first I create my index curl 'http://localhost:9200/2014_6/quotes
http://localhost:9200/2014_6/quotes'

then I set my mapping:

curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping
http://localhost:9200/2014_6/quotes/_mapping' -d '

{

  • "quotes" : {*

  • "properties" : {*
    
  •    "time_stamp": {"type":"date"},*
    
  •    "symbol": {"type":"string"},*
    
  •    "side" : {"type":"string"},*
    
  •    "price" : {"type":"double"}*
    
  • },*
    
  • "_routing" : {*

  •   "required": true,*
    
  •  "path":"symbol"*
    
  • },*

  • "_timestamp" : {*
    
  •    "enabled" : true,*
    
  •    "path":  "time_stamp",*
    
  •    "format": "date_hour_minute_second_millis"*
    
  • }*
    
  • }*

}

'
now because of this I understand when I am posting a new event to be
indexed I do not need to specify quote?routing=. However my first
question is that now I must include symbol in the json object I am posting,
is this costing me more as far as storage? If I do not do this via the
mapping I have no problem adding the routing to the uri, especially if it
saves me space.

second I am seeing a couple of weird things...
by running this:
curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
http://localhost:9200/2014_5/quotes/_search?routing=eurusd'

i get the following, which is good, what I expect.
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"2014_5","_type":"quotes","_id":"ZW5u1nCHTGW-xToRy8Yy5g","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"a",
"price":1.3456}},{"_index":"2014_5","_type":"quotes","_id":"ok4FLnrfR4u2CnJ3lVNKkg","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"b",
"price":1.3457}},{"_index":"2014_5","_type":"quotes","_id":"1eG5m0riSoiDEquQ3I-QSA","_score":1.0,
"_source" :
{ "time_stamp":1391653001100, "symbol":"eurusd", "side":"b",
"price":1.3458}}]}}

however if you will notice the first entry is of side "a". by running the
following I get nothing.
url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d '

{"query":{"filtered":{"query":{"match_all":{}},"filter":{"term":{"side":"a"}}}}}'

however if I change side to "b" I get 2 as I would expect. Is there some
reserved feature that would limit me searching the a or is there some text
search thing I am not thinking about.

Finally, I have added a few usdjpy quotes which are routed to a separate
shard. In my query I accidentally type *usejpy *and I got the two eurusd
events, even though it honored the side filter.
correcting the symbol I get what I would expect. Is this another text
search 'thing'? All I can think of is that by mistyping the e matches the
eur in the other indexed items.

I just want to understand fully what I have going on there, thanks.

On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:

Wanting to get some advice on how to go about design. I have some
currency market data and I get roughly 10 million events a week currently
storing in postgres, it actually ends up being about 10 gigs, though I
would like to work on getting this down obviously. The data is seldom
queried but I have all of my other data in elastic search which I love. I
am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts,
etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9Vj-Mv3vQGQBipbR7c11cfrc2AZ_5PnVm%2BOS72DMuifg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Bobby Richards) #6

great thanks. I am not sure I would have found this on my own anytime
soon. Ill look into it.

Bobby

On Thu, Feb 6, 2014 at 4:33 AM, Alexander Reelsen alr@spinscale.de wrote:

Hey,

the side field as defined in your mapping (I assume you use elasticsearch
0.90.X) uses the standard analyzer, which by default removes stopwords. As
"a" is a stopword, it gets removed as part of the indexing process - and
that makes it impossible to search for. In order to find out more about
this, a good way is to play around with the analyze API. If you like a nice
UI on top of that, go with the inquisitor plugin.

The analyze API basically tells you, how a string is tokenized and stored
in the index, which parts are being removed or altered (due to stemming for
example).

See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

--Alex

On Thu, Feb 6, 2014 at 3:38 AM, Bobby Richards bobby.richards@gmail.comwrote:

So I have decided on using the week of year as the index and quotes as my
type. I want to clarfiy a couple of things that I am seeing.

first I create my index curl 'http://localhost:9200/2014_6/quotes
http://localhost:9200/2014_6/quotes'

then I set my mapping:

curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping
http://localhost:9200/2014_6/quotes/_mapping' -d '

{

  • "quotes" : {*

  • "properties" : {*
    
  •    "time_stamp": {"type":"date"},*
    
  •    "symbol": {"type":"string"},*
    
  •    "side" : {"type":"string"},*
    
  •    "price" : {"type":"double"}*
    
  • },*
    
  • "_routing" : {*

  •   "required": true,*
    
  •  "path":"symbol"*
    
  • },*

  • "_timestamp" : {*
    
  •    "enabled" : true,*
    
  •    "path":  "time_stamp",*
    
  •    "format": "date_hour_minute_second_millis"*
    
  • }*
    
  • }*

}

'
now because of this I understand when I am posting a new event to be
indexed I do not need to specify quote?routing=. However my first
question is that now I must include symbol in the json object I am posting,
is this costing me more as far as storage? If I do not do this via the
mapping I have no problem adding the routing to the uri, especially if it
saves me space.

second I am seeing a couple of weird things...
by running this:
curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
http://localhost:9200/2014_5/quotes/_search?routing=eurusd'

i get the following, which is good, what I expect.
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"2014_5","_type":"quotes","_id":"ZW5u1nCHTGW-xToRy8Yy5g","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"a",
"price":1.3456}},{"_index":"2014_5","_type":"quotes","_id":"ok4FLnrfR4u2CnJ3lVNKkg","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"b",
"price":1.3457}},{"_index":"2014_5","_type":"quotes","_id":"1eG5m0riSoiDEquQ3I-QSA","_score":1.0,
"_source" :
{ "time_stamp":1391653001100, "symbol":"eurusd", "side":"b",
"price":1.3458}}]}}

however if you will notice the first entry is of side "a". by running
the following I get nothing.
url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d '

{"query":{"filtered":{"query":{"match_all":{}},"filter":{"term":{"side":"a"}}}}}'

however if I change side to "b" I get 2 as I would expect. Is there some
reserved feature that would limit me searching the a or is there some text
search thing I am not thinking about.

Finally, I have added a few usdjpy quotes which are routed to a separate
shard. In my query I accidentally type *usejpy *and I got the two eurusd
events, even though it honored the side filter.
correcting the symbol I get what I would expect. Is this another text
search 'thing'? All I can think of is that by mistyping the e matches the
eur in the other indexed items.

I just want to understand fully what I have going on there, thanks.

On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:

Wanting to get some advice on how to go about design. I have some
currency market data and I get roughly 10 million events a week currently
storing in postgres, it actually ends up being about 10 gigs, though I
would like to work on getting this down obviously. The data is seldom
queried but I have all of my other data in elastic search which I love. I
am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts,
etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/anmeu6gNL6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9Vj-Mv3vQGQBipbR7c11cfrc2AZ_5PnVm%2BOS72DMuifg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAqkdrvDE%3DS6-0ffPcvugZOxpc-SBihG%3DPi7Je6DhJxv0qT5ZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Bobby Richards) #7

Wanted to hit the list to get some more advice to finalize my market data
design.

Currently I have about 10 million events per week. I would like to keep
weekly indexes because they provide a nice logical separation of data (ie
markets closed on weekend)
as of now I am using the default number of 5 shards which I was thinking of
bumping to 10, right now I am routing based on symbol which there are about
20, and I am wandering if I should just set number of shards = to number of
symbols?

Data is about 1.5 gig per week so with 10 shards that 150 m each but I see
that github has 120 gigs per shard (all be it with much beefier machines)

I had thought about daily indexes which is appealing because the potential
is that many queries will not typically span more than a day and I would
assume it is best to design indexes around the most frequent queries?
Would I be able to combine the daily indexes into a weekly
and optimize over the weekend, is this possible?

Also, I am trying to build candle data which is represented by the open
(head) high, low, and close (last) values of the time period for which date
histogram aggs are ideal. High and low are easy but as of now its a two
step query. Any clever ways to get the first and last element of the
bucket with aggs?

Just trying to nail this down and I appreciate any and all advice and
feedback.

On Thursday, February 6, 2014 4:18:54 PM UTC-6, Bobby Richards wrote:

great thanks. I am not sure I would have found this on my own anytime
soon. Ill look into it.

Bobby

On Thu, Feb 6, 2014 at 4:33 AM, Alexander Reelsen alr@spinscale.dewrote:

Hey,

the side field as defined in your mapping (I assume you use elasticsearch
0.90.X) uses the standard analyzer, which by default removes stopwords. As
"a" is a stopword, it gets removed as part of the indexing process - and
that makes it impossible to search for. In order to find out more about
this, a good way is to play around with the analyze API. If you like a nice
UI on top of that, go with the inquisitor plugin.

The analyze API basically tells you, how a string is tokenized and stored
in the index, which parts are being removed or altered (due to stemming for
example).

See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

--Alex

On Thu, Feb 6, 2014 at 3:38 AM, Bobby Richards bobby.richards@gmail.comwrote:

So I have decided on using the week of year as the index and quotes as
my type. I want to clarfiy a couple of things that I am seeing.

first I create my index curl 'http://localhost:9200/2014_6/quotes
http://localhost:9200/2014_6/quotes'

then I set my mapping:

curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping
http://localhost:9200/2014_6/quotes/_mapping' -d '

{

  • "quotes" : {*

  • "properties" : {*
    
  •    "time_stamp": {"type":"date"},*
    
  •    "symbol": {"type":"string"},*
    
  •    "side" : {"type":"string"},*
    
  •    "price" : {"type":"double"}*
    
  • },*
    
  • "_routing" : {*

  •   "required": true,*
    
  •  "path":"symbol"*
    
  • },*

  • "_timestamp" : {*
    
  •    "enabled" : true,*
    
  •    "path":  "time_stamp",*
    
  •    "format": "date_hour_minute_second_millis"*
    
  • }*
    
  • }*

}

'
now because of this I understand when I am posting a new event to be
indexed I do not need to specify quote?routing=. However my first
question is that now I must include symbol in the json object I am posting,
is this costing me more as far as storage? If I do not do this via the
mapping I have no problem adding the routing to the uri, especially if it
saves me space.

second I am seeing a couple of weird things...
by running this:
curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
http://localhost:9200/2014_5/quotes/_search?routing=eurusd'

i get the following, which is good, what I expect.
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"2014_5","_type":"quotes","_id":"ZW5u1nCHTGW-xToRy8Yy5g","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"a",
"price":1.3456}},{"_index":"2014_5","_type":"quotes","_id":"ok4FLnrfR4u2CnJ3lVNKkg","_score":1.0,
"_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"b",
"price":1.3457}},{"_index":"2014_5","_type":"quotes","_id":"1eG5m0riSoiDEquQ3I-QSA","_score":1.0,
"_source" :
{ "time_stamp":1391653001100, "symbol":"eurusd", "side":"b",
"price":1.3458}}]}}

however if you will notice the first entry is of side "a". by running
the following I get nothing.
url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d '

{"query":{"filtered":{"query":{"match_all":{}},"filter":{"term":{"side":"a"}}}}}'

however if I change side to "b" I get 2 as I would expect. Is there
some reserved feature that would limit me searching the a or is there some
text search thing I am not thinking about.

Finally, I have added a few usdjpy quotes which are routed to a separate
shard. In my query I accidentally type *usejpy *and I got the two
eurusd events, even though it honored the side filter.
correcting the symbol I get what I would expect. Is this another text
search 'thing'? All I can think of is that by mistyping the e matches the
eur in the other indexed items.

I just want to understand fully what I have going on there, thanks.

On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:

Wanting to get some advice on how to go about design. I have some
currency market data and I get roughly 10 million events a week currently
storing in postgres, it actually ends up being about 10 gigs, though I
would like to work on getting this down obviously. The data is seldom
queried but I have all of my other data in elastic search which I love. I
am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can
drop months whenever. i guess that would mean 'month/symbol/(unixtime for
minute).

I am far from a data guy, so I am looking for direction, thoughts,
etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/anmeu6gNL6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9Vj-Mv3vQGQBipbR7c11cfrc2AZ_5PnVm%2BOS72DMuifg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/324e8919-e4d7-4ffa-ba8a-2513b002a49e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #8

Currently I have about 10 million events per week. I would like to keep weekly indexes because they provide a nice logical separation of data (ie markets closed on weekend)
as of now I am using the default number of 5 shards which I was thinking of bumping to 10, right now I am routing based on symbol which there are about 20, and I am wandering if I should just set number of shards = to number of symbols?

In that case, it could happen that no document at all goes to one or more shard.
So you need to test with each routing key where the document fall down.

I had thought about daily indexes which is appealing because the potential is that many queries will not typically span more than a day and I would assume it is best to design indexes around the most frequent queries? Would I be able to combine the daily indexes into a weekly and optimize over the weekend, is this possible?

It will basically mean that you will reindex your data. Not sure it worths it. I would probably simply use optimize API after a day on cool index.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 9 mars 2014 à 05:37:58, Bobby Richards (bobby.richards@gmail.com) a écrit:

Wanted to hit the list to get some more advice to finalize my market data design.

Currently I have about 10 million events per week. I would like to keep weekly indexes because they provide a nice logical separation of data (ie markets closed on weekend)
as of now I am using the default number of 5 shards which I was thinking of bumping to 10, right now I am routing based on symbol which there are about 20, and I am wandering if I should just set number of shards = to number of symbols?

Data is about 1.5 gig per week so with 10 shards that 150 m each but I see that github has 120 gigs per shard (all be it with much beefier machines)

I had thought about daily indexes which is appealing because the potential is that many queries will not typically span more than a day and I would assume it is best to design indexes around the most frequent queries? Would I be able to combine the daily indexes into a weekly and optimize over the weekend, is this possible?

Also, I am trying to build candle data which is represented by the open (head) high, low, and close (last) values of the time period for which date histogram aggs are ideal. High and low are easy but as of now its a two step query. Any clever ways to get the first and last element of the bucket with aggs?

Just trying to nail this down and I appreciate any and all advice and feedback.

On Thursday, February 6, 2014 4:18:54 PM UTC-6, Bobby Richards wrote:
great thanks. I am not sure I would have found this on my own anytime soon. Ill look into it.

Bobby

On Thu, Feb 6, 2014 at 4:33 AM, Alexander Reelsen alr@spinscale.de wrote:
Hey,

the side field as defined in your mapping (I assume you use elasticsearch 0.90.X) uses the standard analyzer, which by default removes stopwords. As "a" is a stopword, it gets removed as part of the indexing process - and that makes it impossible to search for. In order to find out more about this, a good way is to play around with the analyze API. If you like a nice UI on top of that, go with the inquisitor plugin.

The analyze API basically tells you, how a string is tokenized and stored in the index, which parts are being removed or altered (due to stemming for example).

See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

--Alex

On Thu, Feb 6, 2014 at 3:38 AM, Bobby Richards bobby.richards@gmail.com wrote:
So I have decided on using the week of year as the index and quotes as my type. I want to clarfiy a couple of things that I am seeing.

first I create my index curl 'http://localhost:9200/2014_6/quotes'

then I set my mapping:
curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping' -d '

{

"quotes" : {

 "properties" : {

    "time_stamp": {"type":"date"},

    "symbol": {"type":"string"},

    "side" : {"type":"string"},

    "price" : {"type":"double"}

 },

"_routing" : {

   "required": true,

  "path":"symbol"

},

 "_timestamp" : {

    "enabled" : true,

    "path":  "time_stamp",

    "format": "date_hour_minute_second_millis"

 }

}

}

'

now because of this I understand when I am posting a new event to be indexed I do not need to specify quote?routing=. However my first question is that now I must include symbol in the json object I am posting, is this costing me more as far as storage? If I do not do this via the mapping I have no problem adding the routing to the uri, especially if it saves me space.

second I am seeing a couple of weird things...
by running this:
curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd'

i get the following, which is good, what I expect.
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"2014_5","_type":"quotes","_id":"ZW5u1nCHTGW-xToRy8Yy5g","_score":1.0, "_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"a", "price":1.3456}},{"_index":"2014_5","_type":"quotes","_id":"ok4FLnrfR4u2CnJ3lVNKkg","_score":1.0, "_source" :
{ "time_stamp":1391653001000, "symbol":"eurusd", "side":"b", "price":1.3457}},{"_index":"2014_5","_type":"quotes","_id":"1eG5m0riSoiDEquQ3I-QSA","_score":1.0, "_source" :
{ "time_stamp":1391653001100, "symbol":"eurusd", "side":"b", "price":1.3458}}]}}

however if you will notice the first entry is of side "a". by running the following I get nothing.
url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d '
{"query":{"filtered":{"query":{"match_all":{}},"filter":{"term":{"side":"a"}}}}}'

however if I change side to "b" I get 2 as I would expect. Is there some reserved feature that would limit me searching the a or is there some text search thing I am not thinking about.

Finally, I have added a few usdjpy quotes which are routed to a separate shard. In my query I accidentally type usejpy and I got the two eurusd events, even though it honored the side filter.
correcting the symbol I get what I would expect. Is this another text search 'thing'? All I can think of is that by mistyping the e matches the eur in the other indexed items.

I just want to understand fully what I have going on there, thanks.

On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:
Wanting to get some advice on how to go about design. I have some currency market data and I get roughly 10 million events a week currently storing in postgres, it actually ends up being about 10 gigs, though I would like to work on getting this down obviously. The data is seldom queried but I have all of my other data in elastic search which I love. I am trying to determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can drop months whenever. i guess that would mean 'month/symbol/(unixtime for minute).

I am far from a data guy, so I am looking for direction, thoughts, etc...is this even a good use case for elastic search?

Thanks,
Bobby

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/anmeu6gNL6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9Vj-Mv3vQGQBipbR7c11cfrc2AZ_5PnVm%2BOS72DMuifg%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/324e8919-e4d7-4ffa-ba8a-2513b002a49e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.531c33dc.140e0f76.9291%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(system) #9