Advice


(boeledi) #1

Hi,

I am currently developing a web site in .NET + SQL Server and in order to
make search much faster, I would like to use Elastic Search.
I am totally new in ES and at the moment, I am only validating the concept.
This is why I would like your expert advise.

Let me first talk about my data structure (in SQL Server):

The database is of course relational and made up of 100+ tables.

Data is linked to 2 main topics: advertisers and products.
An advertiser might have 1 to many business fields and sell many products.
A product might be linked to dates of availability.
The website is multi-lingual (7 languages).

Simplified schema would be (for the purpose of the explanation, I
simplified it):

Table "tbl_advertisers" => records all information linked to an advertiser
for a specific business field
Table "tbl_products" => records all information linked to a product,
for a specific advertiser
Table "tbl_availability" => records list of dates between 2 limits and
for each day, a record mentions whether the product is available or not and
its daily price.
Table "tbl_texts" => records the texts linked to either an
advertiser or a product (1 record per language, advertiser business field
OR 1 record per language and product). Each record might be up to 6000
characters long. 1 record per language.
Table "tbl_addresses" => records the addresses for both advertisers and
products

Question #1 ( SQL Server + Elastic Search : synchronization)

Today, in order to make sure the database is always consistent, when I need
to apply any kind of modifications (insert, update, delete) to an
advertiser or a product, all changes are done within 1 single Stored
Procedure.
Thanks to the global transaction at the stored procedure level, I am always
sure that when a user runs a query against the web site, results which are
returned are always in-sync with the database.

Since I would need to apply the updates in Elastic Search as well, I intend
to complement the stored procedure with HTTP Post requests towards
ElasticSearch (as many as necessary, seeing the 8K limitation for each HTTP
Post).

Therefore I would end up with something like:

Begin Transaction Main
Begin Transaction SQLServer

Commit Transaction SQL Server

-- Do all the HTTP Post requests towards Elastic Search
Commit Transaction Main

This way if something goes wrong with ES, I simply rollback (however this
solution would not prevent from having some lack of synchronization (case
of network failure during one POST, or failure at ES level) except if there
exists any kind of transaction mechanism in ES as well --- but is there
any??).

I read about the River but this will lead to a delay between the SQL Server
database and Elastic Search, plus additional risk of de-synchronization if
something wrong occurs during the update of Elastic Search.

==> What would be your advice as regard all this?

Question #2 (how to structure the information in Elastic Search)

At first, I thought of considering one document to store all information
linked to advertisers, and one document for the products.
Both documents would also contain the information linked to addresses.

(a) But what about the texts? Would you create 1 field per language or
create another document to store the texts?
(b) What about availabilities? How would it be possible to link the
products with the availabilities?

For (a), I thought of sending 1 HTTP Post (see Question #1) to update (or
create) the advertiser or product, then 7 HTTP Posts to update and save the
texts (1 post per language in order not to go beyond the 8K limitation).

Additional information, when the user is querying the website, he might
look for advertisers only, products only but also everything (products and
advertisers) that match the search criteria.

==> What would be your advice as regard all this?

In advance, many thanks for your help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

Hi,

On Mon, Sep 23, 2013 at 2:54 PM, boeledi didier.boelens@gmail.com wrote:

I am currently developing a web site in .NET + SQL Server and in order to
make search much faster, I would like to use Elastic Search.
I am totally new in ES and at the moment, I am only validating the
concept. This is why I would like your expert advise.

Let me first talk about my data structure (in SQL Server):

The database is of course relational and made up of 100+ tables.

Data is linked to 2 main topics: advertisers and products.
An advertiser might have 1 to many business fields and sell many products.
A product might be linked to dates of availability.
The website is multi-lingual (7 languages).

Simplified schema would be (for the purpose of the explanation, I
simplified it):

Table "tbl_advertisers" => records all information linked to an
advertiser for a specific business field
Table "tbl_products" => records all information linked to a product,
for a specific advertiser
Table "tbl_availability" => records list of dates between 2 limits and
for each day, a record mentions whether the product is available or not and
its daily price.
Table "tbl_texts" => records the texts linked to either an
advertiser or a product (1 record per language, advertiser business field
OR 1 record per language and product). Each record might be up to 6000
characters long. 1 record per language.
Table "tbl_addresses" => records the addresses for both advertisers and
products

Question #1 ( SQL Server + Elastic Search : synchronization)

Today, in order to make sure the database is always consistent, when I
need to apply any kind of modifications (insert, update, delete) to an
advertiser or a product, all changes are done within 1 single Stored
Procedure.
Thanks to the global transaction at the stored procedure level, I am
always sure that when a user runs a query against the web site, results
which are returned are always in-sync with the database.

Since I would need to apply the updates in Elastic Search as well, I
intend to complement the stored procedure with HTTP Post requests towards
ElasticSearch (as many as necessary, seeing the 8K limitation for each HTTP
Post).

Therefore I would end up with something like:

Begin Transaction Main
Begin Transaction SQLServer

Commit Transaction SQL Server

-- Do all the HTTP Post requests towards Elastic Search
Commit Transaction Main

This way if something goes wrong with ES, I simply rollback (however this
solution would not prevent from having some lack of synchronization (case
of network failure during one POST, or failure at ES level) except if there
exists any kind of transaction mechanism in ES as well --- but is there
any??).

Indeed, Elasticsearch doesn't have any transaction support.

I read about the River but this will lead to a delay between the SQL
Server database and Elastic Search, plus additional risk of
de-synchronization if something wrong occurs during the update of Elastic
Search.

Note that you will have a short delay anyway: In Elasticsearch, search is
near-realtime, meaning that there is a short delay to wait for (1s by
default) before data becomes searchable. You could force Elasticsearch to
refresh after every update of the index but this would be super slow.

Something else to be aware of is that indexing in Elasticsearch is usually
very fast: it is rather common to see indexing rates of several thousands
of documents per second on conventional hardware when documents are simple
meaning that even a collection containing 10M documents would only require
1 or 2 hours to be reindexed.

Question #2 (how to structure the information in Elastic Search)

At first, I thought of considering one document to store all information
linked to advertisers, and one document for the products.
Both documents would also contain the information linked to addresses.

(a) But what about the texts? Would you create 1 field per language or
create another document to store the texts?

It depends on whether all your documents have text in every language and
how you are going to search on your documents. If all your documents have
text in every language or if you are mostly going to search on all
documents no matter what language they are available in, then it probably
makes sense to have one field per language in your documents. Otherwise,
you could just have a different index per language. However, you should not
create other documents to store the texts: search engines best work with
flat content and the existing solutions for dealing with relational content
(parent/child and nested documents in the case of Elasticsearch) should
only be used when absolutely required.

(b) What about availabilities? How would it be possible to link the
products with the availabilities?

If you are only willing to store availabilities, then storing them as a
nested object in the JSON document will just work. However, if you are
going to search on the availabilities, you may need to explore
Elasticsearch support for parent/child relationships[1] and nested
documents[2].

For (a), I thought of sending 1 HTTP Post (see Question #1) to update (or

create) the advertiser or product, then 7 HTTP Posts to update and save the
texts (1 post per language in order not to go beyond the 8K limitation).

Where is this 8K limit enforced? Is it on client side? Elasticsearch itself
doesn't have such a low HTTP POST limit.

Additional information, when the user is querying the website, he might
look for advertisers only, products only but also everything (products and
advertisers) that match the search criteria.

This is usually what Elasticsearch types[3] are used for. By specifying the
type in the URI, you can restrict search to a specific type.

[1] http://www.elasticsearch.org/guide/reference/mapping/parent-field/
[2] http://www.elasticsearch.org/guide/reference/mapping/nested-type/
[3] http://www.elasticsearch.org/guide/reference/mapping/type-field/

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3