Some queries on ElasticSearch

I'm looking at ES to enhance and enable our application to have a
robust and quick search service. There are a few questions in my mind.
I'm sure someone would have already thought about all these and have
found answers too.

Our application has 'Resource' and also 'Events' that services withing
our application use. Does it make sense to define them as separate
indexes. This application can easily grow to have billion+ such
'Resources' and 'Events'.

We have defined our model in RDF. I know that it can easily be
represented as JSON documents to index them in ES. How expensive it is
to embed an atom/xml document within the JSON document? Will I be able
to search, sort, filter by attributes within this atom/xml?

We also have a notion of tagging these 'Resources' any user can tag
these 'Resources', should these tags be treated as separate indexes?

How should we deal with multiple threads indexing or searching at the
same time ? What control does ES provide over this ?

How do changes in document size and number of documents increase mem
and cpu usage ?

How quickly do the indices increase in size?

As our documents increase on daily basis and grow beyond billion
records, how many nodes would we need?

What is needed to support indexing billion+ documents (mem/cpu/
disk) ?

With best regards,
Afzal Jan

Hi,

unfortunately, your questions are too general to give you any useful
advice, however, see inlined.

On Mon, Jan 30, 2012 at 11:49 AM, ajan jan.afzal@gmail.com wrote:

I'm looking at ES to enhance and enable our application to have a
robust and quick search service. There are a few questions in my mind.
I'm sure someone would have already thought about all these and have
found answers too.

Our application has 'Resource' and also 'Events' that services withing
our application use. Does it make sense to define them as separate
indexes. This application can easily grow to have billion+ such
'Resources' and 'Events'.

What is in index has an impact on relevancy scoring, at least that is how I
understand it, so if you include Resources and Events into the same index
but need to search for Resources only I think your relevancy might be
impacted by that. Moreover, if you expect fast grow then I think you will
end up using index aliasing anyway. This means you will have more indices
with the same alias to search against.

We have defined our model in RDF. I know that it can easily be
represented as JSON documents to index them in ES. How expensive it is
to embed an atom/xml document within the JSON document? Will I be able
to search, sort, filter by attributes within this atom/xml?

You can use attachment type for this (which is using Tika under the hood),
I think Tika can parse XML documents but it extracts only basic set of
attributes from the document so I would recommend to convert to JSON prior
indexing.

We also have a notion of tagging these 'Resources' any user can tag
these 'Resources', should these tags be treated as separate indexes?

Depending on what you want to use these tags for. I would put it into the
same index (traditionally you tend to use very denormalized documents in
Lucene world) but you can have a look at nested query/mapping and
parent/child features if you have a specific requirements.

How should we deal with multiple threads indexing or searching at the
same time ? What control does ES provide over this ?

You do not care about it (as long your ES client handles that correctly,
normally they do).

How do changes in document size and number of documents increase mem
and cpu usage ?

You have to try it. It depends on many factors (type of query, mapping...)

How quickly do the indices increase in size?

What is acceptable for you?
You need to try with your data. Generally speaking you can reduce it by
disabling _source (do not do it unless you are sure you need it) and tuning
mapping.

As our documents increase on daily basis and grow beyond billion
records, how many nodes would we need?

Will you need to search in all data/indices or just in recent? If all then
try to index portion of the data and do the math.

What is needed to support indexing billion+ documents (mem/cpu/
disk) ?

All you need is enough resources.

With best regards,
Afzal Jan

Regards,
Lukas

Some comments.

David

Le 31 janv. 2012 à 06:53, Lukáš Vlček lukas.vlcek@gmail.com a écrit :

We have defined our model in RDF. I know that it can easily be
represented as JSON documents to index them in ES. How expensive it is
to embed an atom/xml document within the JSON document? Will I be able
to search, sort, filter by attributes within this atom/xml?

You can use attachment type for this (which is using Tika under the hood), I think Tika can parse XML documents but it extracts only basic set of attributes from the document so I would recommend to convert to JSON prior indexing.
I think that Tika just extract text from XML. It does not produce a Json document from XML so you won't be able to search for specific fields.
But, you can use Jackson lib to convert your XML beans to Json.

What is needed to support indexing billion+ documents (mem/cpu/
disk) ?

All you need is enough resources.

LOL !

Thank you for your response.

Let me try to explain our application with something similar on
another domain.

Lets say our application is about discovering and monitoring all cars
in Los Angeles. A car is related to another car by, manufacturer,
model, seating capacity, engine type, year of manufacture and so on.

The discovery process happens every alternate day.

As an example a Toyota Camry with 14" wheels is similar to some other
car that uses 14" wheels.

As the application is also monitoring these cars, it recieves events
on these cars too, events such as, 'low wheel pressure', 'flat tire',
'dead battery' and so on.

Lets say that there is another such deployment that discovers and
monitors cars in Boston.

On such a setup there is a SearchService that allows a user to search
through site "LA" and site "Boston" based on model, make, events and
so on.

With the amount of data coming into the system which is in XML,
transforming this into JSON would be expensive and could slow down the
entire data flow.

How should we deal with multiple threads indexing or searching at the
same time ? What control does ES provide over this ?

You do not care about it (as long your ES client handles that correctly,
normally they do).

I need to prove this if I have to present my case for ES. Is there
some documentation on this?

What is needed to support indexing billion+ documents (mem/cpu/
disk) ?

All you need is enough resources.

Well, this is the main concern, I need to define this enough
resources.

With best regards,
Afzal Jan

On Jan 31, 10:53 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

unfortunately, your questions are too general to give you any useful
advice, however, see inlined.

On Mon, Jan 30, 2012 at 11:49 AM, ajan jan.af...@gmail.com wrote:

I'm looking at ES to enhance and enable our application to have a
robust and quick search service. There are a few questions in my mind.
I'm sure someone would have already thought about all these and have
found answers too.

Our application has 'Resource' and also 'Events' that services withing
our application use. Does it make sense to define them as separate
indexes. This application can easily grow to have billion+ such
'Resources' and 'Events'.

What is in index has an impact on relevancy scoring, at least that is how I
understand it, so if you include Resources and Events into the same index
but need to search for Resources only I think your relevancy might be
impacted by that. Moreover, if you expect fast grow then I think you will
end up using index aliasing anyway. This means you will have more indices
with the same alias to search against.

We have defined our model in RDF. I know that it can easily be
represented as JSON documents to index them in ES. How expensive it is
to embed an atom/xml document within the JSON document? Will I be able
to search, sort, filter by attributes within this atom/xml?

You can use attachment type for this (which is using Tika under the hood),
I think Tika can parse XML documents but it extracts only basic set of
attributes from the document so I would recommend to convert to JSON prior
indexing.

We also have a notion of tagging these 'Resources' any user can tag
these 'Resources', should these tags be treated as separate indexes?

Depending on what you want to use these tags for. I would put it into the
same index (traditionally you tend to use very denormalized documents in
Lucene world) but you can have a look at nested query/mapping and
parent/child features if you have a specific requirements.

How should we deal with multiple threads indexing or searching at the
same time ? What control does ES provide over this ?

You do not care about it (as long your ES client handles that correctly,
normally they do).

How do changes in document size and number of documents increase mem
and cpu usage ?

You have to try it. It depends on many factors (type of query, mapping...)

How quickly do the indices increase in size?

What is acceptable for you?
You need to try with your data. Generally speaking you can reduce it by
disabling _source (do not do it unless you are sure you need it) and tuning
mapping.

As our documents increase on daily basis and grow beyond billion
records, how many nodes would we need?

Will you need to search in all data/indices or just in recent? If all then
try to index portion of the data and do the math.

What is needed to support indexing billion+ documents (mem/cpu/
disk) ?

All you need is enough resources.

With best regards,
Afzal Jan

Regards,
Lukas

Le 1 févr. 2012 à 06:27, ajan jan.afzal@gmail.com a écrit :

With the amount of data coming into the system which is in XML,
transforming this into JSON would be expensive and could slow down the
entire data flow.
as ES "speak" JSON I don't think you have an other choice.
If you send XML files, you won't be able to search on specific fields.

How should we deal with multiple threads indexing or searching at the
same time ? What control does ES provide over this ?

You do not care about it (as long your ES client handles that correctly,
normally they do).

I need to prove this if I have to present my case for ES. Is there
some documentation on this?
If you want to prove it, test it.
Instead you can search in Google. There are some blogs talking about SOLR vs ES.
One of the main difference is that ES is always fast while indexing.

What is needed to support indexing billion+ documents (mem/cpu/
disk) ?

All you need is enough resources.

Well, this is the main concern, I need to define this enough
resources.
There is no rules.
It depends on the complexity of your documents.
If your documents have only one field, you can use a laptop with 2go ram for many millions docs.
But if you have 100 hundred fields it will not be enough.

So, you have to test on your docs.
Plus, sharding is an important point also.

A nice idea could be to find the perfect platform on Amazon...

Some advices : use SSD drives and as much as you can memory.

HTH
David