Need help architecting an index/schema (migrating data from MySQL)

Hi,

I'm just getting started with elastic search and need some advice. We
have 3 tables in MySQL that we'd like to combine into a simpler schema
with elasticsearch:

+---------------------+
| form +
+---------------------+
| id |
| name |
+---------------------+

+---------------------+
| entry +
+---------------------+
| id |
| form_id |
| timestamp |
+---------------------+

+---------------------+
| entry_data +
+---------------------+
| id |
| entry_id |
| key |
| value |
+---------------------+

Constraints as follows: A form can have many entries, an entry can
have many entry_data's. Each form has it's own schema. Our DB
currently has 250,000 forms and 47,000,000 entries.

A few ways that I think we can accomplish this within elasticsearch:

  1. Keep the same structure we have in MySQL: create an entry index,
    the "form_id" will be a part of the entry document.
  2. Have a entry index, each form becomes a type, entry is the
    document.
  3. Each form is an index.

My main questions:

Is one of the schemes above better than the other? I'm thinking that
#2 will be the simplest. Is there a performance cost to having that
many types (250k)?

Would you architect this differently?

Thanks,

Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IMHO, the first question should be always user centric and not technical centric.
I mean that you should ask yourself: "what my users want to find?"

I don't think that they want to search for entries or for data entries, do they?
Probably, you have behind this flexible design a real use case that you could share with us.

I would probably try to create real business documents instead of technical ones in my application and then send to Elasticsearch real documents.
That said, I don't have any idea of your use case and I might be wrong.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 11 avr. 2013 à 04:17, mmattax mattax.m@gmail.com a écrit :

Hi,

I'm just getting started with Elasticsearch and need some advice. We
have 3 tables in MySQL that we'd like to combine into a simpler schema
with elasticsearch:

+---------------------+
| form +
+---------------------+
| id |
| name |
+---------------------+

+---------------------+
| entry +
+---------------------+
| id |
| form_id |
| timestamp |
+---------------------+

+---------------------+
| entry_data +
+---------------------+
| id |
| entry_id |
| key |
| value |
+---------------------+

Constraints as follows: A form can have many entries, an entry can
have many entry_data's. Each form has it's own schema. Our DB
currently has 250,000 forms and 47,000,000 entries.

A few ways that I think we can accomplish this within elasticsearch:

  1. Keep the same structure we have in MySQL: create an entry index,
    the "form_id" will be a part of the entry document.
  2. Have a entry index, each form becomes a type, entry is the
    document.
  3. Each form is an index.

My main questions:

Is one of the schemes above better than the other? I'm thinking that
#2 will be the simplest. Is there a performance cost to having that
many types (250k)?

Would you architect this differently?

Thanks,

Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The entry and entry_data tables represent form responses that customers
have built (I work for Formstack http://formstack.com). The data can
represent anything from a survey, event registration forms, order forms,
etc. Ideally everything in entry_data becomes searchable and sortable:

entry_data might have a few rows in MySQL:

id entry_id key value
1 100 name first=Michael&last=Mattax
2 100 email michael@example.com
3 100 sex male
4 100 birthday 1985-08-15
.. there could be hundreds more ...

Using Elasticsearch we want to make all of this data searchable so the
document might look like

curl -XPUT 'http://localhost:9200/entries/<FORM_ID>/100' -d '{

name : { first : 'Michael', last : 'Mattax' },
email : 'michael@example.com',
sex : 'male',
'birthday' : '1985-08-15'

}'

That way we can use Elasticsearch power and not do any joins (which we do
now for complex searches). Note that each form has a different schema
(besides some common shared metadata).

Does that clarify things at all?

On Thursday, April 11, 2013 4:24:24 AM UTC-4, David Pilato wrote:

IMHO, the first question should be always user centric and not technical
centric.
I mean that you should ask yourself: "what my users want to find?"

I don't think that they want to search for entries or for data entries, do
they?
Probably, you have behind this flexible design a real use case that you
could share with us.

I would probably try to create real business documents instead of
technical ones in my application and then send to Elasticsearch real
documents.
That said, I don't have any idea of your use case and I might be wrong.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 11 avr. 2013 à 04:17, mmattax <matt...@gmail.com <javascript:>> a
écrit :

Hi,

I'm just getting started with Elasticsearch and need some advice. We
have 3 tables in MySQL that we'd like to combine into a simpler schema
with elasticsearch:

+---------------------+
| form +
+---------------------+
| id |
| name |
+---------------------+

+---------------------+
| entry +
+---------------------+
| id |
| form_id |
| timestamp |
+---------------------+

+---------------------+
| entry_data +
+---------------------+
| id |
| entry_id |
| key |
| value |
+---------------------+

Constraints as follows: A form can have many entries, an entry can
have many entry_data's. Each form has it's own schema. Our DB
currently has 250,000 forms and 47,000,000 entries.

A few ways that I think we can accomplish this within elasticsearch:

  1. Keep the same structure we have in MySQL: create an entry index,
    the "form_id" will be a part of the entry document.
  2. Have a entry index, each form becomes a type, entry is the
    document.
  3. Each form is an index.

My main questions:

Is one of the schemes above better than the other? I'm thinking that
#2 will be the simplest. Is there a performance cost to having that
many types (250k)?

Would you architect this differently?

Thanks,

Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Definitely!

It's a cool idea.
So you agree that you will have to create a document on your side based on entry_data.

You should be aware that you will probably need to define some things using mapping. Otherwise, the default analysis process could give you undesired results.

For example, for email, you should define an email analyzer.
Be careful of field names and types. I would try to avoid to have a field name for example answer1 and answer1 in type1 is a Number and answer1 in type2 is a String.

Regarding to your first question: I also think that #2 is one of the best option here.
I also like #3 as I can imagine that after some months, you may want to remove old surveys? It's really more efficient to remove a full index than removing documents.
Also, you can think of mixing #2 and #3 and have a fine tuning about the number of shards needed for each index. Let's say that you have a very big Survey with 10 000 000 of answers. You probably want to index it in its own index. Let's say that you have 240 000 small surveys. You probably want to share the same index…

Make sense?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 11 avr. 2013 à 15:54, mmattax mattax.m@gmail.com a écrit :

The entry and entry_data tables represent form responses that customers have built (I work for Formstack). The data can represent anything from a survey, event registration forms, order forms, etc. Ideally everything in entry_data becomes searchable and sortable:

entry_data might have a few rows in MySQL:

id entry_id key value
1 100 name first=Michael&last=Mattax
2 100 email michael@example.com
3 100 sex male
4 100 birthday 1985-08-15
.. there could be hundreds more ...

Using Elasticsearch we want to make all of this data searchable so the document might look like

curl -XPUT 'http://localhost:9200/entries/<FORM_ID>/100' -d '{
name : { first : 'Michael', last : 'Mattax' },
email : 'michael@example.com',
sex : 'male',
'birthday' : '1985-08-15'
}'

That way we can use Elasticsearch power and not do any joins (which we do now for complex searches). Note that each form has a different schema (besides some common shared metadata).

Does that clarify things at all?

On Thursday, April 11, 2013 4:24:24 AM UTC-4, David Pilato wrote:
IMHO, the first question should be always user centric and not technical centric.
I mean that you should ask yourself: "what my users want to find?"

I don't think that they want to search for entries or for data entries, do they?
Probably, you have behind this flexible design a real use case that you could share with us.

I would probably try to create real business documents instead of technical ones in my application and then send to Elasticsearch real documents.
That said, I don't have any idea of your use case and I might be wrong.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 11 avr. 2013 à 04:17, mmattax matt...@gmail.com a écrit :

Hi,

I'm just getting started with Elasticsearch and need some advice. We
have 3 tables in MySQL that we'd like to combine into a simpler schema
with elasticsearch:

+---------------------+
| form +
+---------------------+
| id |
| name |
+---------------------+

+---------------------+
| entry +
+---------------------+
| id |
| form_id |
| timestamp |
+---------------------+

+---------------------+
| entry_data +
+---------------------+
| id |
| entry_id |
| key |
| value |
+---------------------+

Constraints as follows: A form can have many entries, an entry can
have many entry_data's. Each form has it's own schema. Our DB
currently has 250,000 forms and 47,000,000 entries.

A few ways that I think we can accomplish this within elasticsearch:

  1. Keep the same structure we have in MySQL: create an entry index,
    the "form_id" will be a part of the entry document.
  2. Have a entry index, each form becomes a type, entry is the
    document.
  3. Each form is an index.

My main questions:

Is one of the schemes above better than the other? I'm thinking that
#2 will be the simplest. Is there a performance cost to having that
many types (250k)?

Would you architect this differently?

Thanks,

Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

A few follow up questions, which might help me pick between #2 and #3.

  • Can we delete (in bulk) a type out of an index (each form being it's
    own type and it's own maping)?
  • Is there any pitfalls with having a large number of types (and
    mapping), for example 250k mapping/types for the index?
  • Is there a point where we're going to need to worry about open file
    descriptors (with 250k indexes)?

On Thursday, April 11, 2013 10:13:57 AM UTC-4, David Pilato wrote:

Definitely!

It's a cool idea.
So you agree that you will have to create a document on your side based on
entry_data.

You should be aware that you will probably need to define some things
using mapping. Otherwise, the default analysis process could give you
undesired results.

For example, for email, you should define an email analyzer.
Be careful of field names and types. I would try to avoid to have a field
name for example answer1 and answer1 in type1 is a Number and answer1 in
type2 is a String.

Regarding to your first question: I also think that #2 is one of the best
option here.
I also like #3 as I can imagine that after some months, you may want to
remove old surveys? It's really more efficient to remove a full index than
removing documents.
Also, you can think of mixing #2 and #3 and have a fine tuning about the
number of shards needed for each index. Let's say that you have a very big
Survey with 10 000 000 of answers. You probably want to index it in its own
index. Let's say that you have 240 000 small surveys. You probably want
to share the same index…

Make sense?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 11 avr. 2013 à 15:54, mmattax <matt...@gmail.com <javascript:>> a
écrit :

The entry and entry_data tables represent form responses that customers
have built (I work for Formstack http://formstack.com/). The data can
represent anything from a survey, event registration forms, order forms,
etc. Ideally everything in entry_data becomes searchable and sortable:

entry_data might have a few rows in MySQL:

id entry_id key value
1 100 name first=Michael&last=Mattax
2 100 email mic...@example.com <javascript:>
3 100 sex male
4 100 birthday 1985-08-15
.. there could be hundreds more ...

Using Elasticsearch we want to make all of this data searchable so the
document might look like

curl -XPUT 'http://localhost:9200/entries/<FORM_ID>/100' -d '{

name : { first : 'Michael', last : 'Mattax' },
email : 'mic...@example.com <javascript:>',
sex : 'male',
'birthday' : '1985-08-15'

}'

That way we can use Elasticsearch power and not do any joins (which we do
now for complex searches). Note that each form has a different schema
(besides some common shared metadata).

Does that clarify things at all?

On Thursday, April 11, 2013 4:24:24 AM UTC-4, David Pilato wrote:

IMHO, the first question should be always user centric and not technical
centric.
I mean that you should ask yourself: "what my users want to find?"

I don't think that they want to search for entries or for data entries,
do they?
Probably, you have behind this flexible design a real use case that you
could share with us.

I would probably try to create real business documents instead of
technical ones in my application and then send to Elasticsearch real
documents.
That said, I don't have any idea of your use case and I might be wrong.

My 2 cents

--
David Pilato | Technical Advocate | *Elasticsearch.comhttp://elasticsearch.com/
*
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 11 avr. 2013 à 04:17, mmattax matt...@gmail.com a écrit :

Hi,

I'm just getting started with Elasticsearch and need some advice. We
have 3 tables in MySQL that we'd like to combine into a simpler schema
with elasticsearch:

+---------------------+
| form +
+---------------------+
| id |
| name |
+---------------------+

+---------------------+
| entry +
+---------------------+
| id |
| form_id |
| timestamp |
+---------------------+

+---------------------+
| entry_data +
+---------------------+
| id |
| entry_id |
| key |
| value |
+---------------------+

Constraints as follows: A form can have many entries, an entry can
have many entry_data's. Each form has it's own schema. Our DB
currently has 250,000 forms and 47,000,000 entries.

A few ways that I think we can accomplish this within elasticsearch:

  1. Keep the same structure we have in MySQL: create an entry index,
    the "form_id" will be a part of the entry document.
  2. Have a entry index, each form becomes a type, entry is the
    document.
  3. Each form is an index.

My main questions:

Is one of the schemes above better than the other? I'm thinking that
#2 will be the simplest. Is there a performance cost to having that
many types (250k)?

Would you architect this differently?

Thanks,

Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My 2 cents inline...

A few follow up questions, which might help me pick between #2 and #3.
Can we delete (in bulk) a type out of an index (each form being it's own type and it's own maping)?
Not using bulk but using: Elasticsearch Platform — Find real-time answers at scale | Elastic
Is there any pitfalls with having a large number of types (and mapping), for example 250k mapping/types for the index?
I don't see any. That said, my concern is more about field names collision with different mapping definition.
Is there a point where we're going to need to worry about open file descriptors (with 250k indexes)?
Yes. If you need to update all indices at the same time, you will probably have issues if you run that on a small number of nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I really appreciate the insight. Going to get started with Elasticsearch!

On Thursday, April 11, 2013 10:42:01 AM UTC-4, David Pilato wrote:

My 2 cents inline...

A few follow up questions, which might help me pick between #2 and #3.

  • Can we delete (in bulk) a type out of an index (each form being it's
    own type and it's own maping)?

Not using bulk but using:
Elasticsearch Platform — Find real-time answers at scale | Elastic

  • Is there any pitfalls with having a large number of types (and
    mapping), for example 250k mapping/types for the index?

I don't see any. That said, my concern is more about field names collision
with different mapping definition.

  • Is there a point where we're going to need to worry about open file
    descriptors (with 250k indexes)?

Yes. If you need to update all indices at the same time, you will probably
have issues if you run that on a small number of nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.