Index update strategy

I'm looking for an advice for index update strategy.

I'd like to import data from SQL Server to Elasticsearch in batches.

Let's assume we have data in this format in SQL:

ProjectID	ProductID
---------------------------------
1			1
1			2
1			3
1			4
1			5
2			6
2			7
2			8
2			9
2			10

What happens in SQL is that we wipe data from this table and reinsert data from staging tables into it every weekend on ProjectID basis. I'd like to keep this table in sync with Elasticsearch indices, but cannot think of a strategy.

I was thinking about having index templates and storing each project data in seperate indice, pretty much like that

  • data_project_1
  • data_project_2
  • etc.

I could delete indice for a specific project and then just to re-pump that data from SQL Server using logtash, but then I'd have some data discrepancies and user wouldn't be able to perform searches on that project data for some time (well, this in the end might be acceptable anyway). But how would I delete indice using tools that Elastic provides? Indice needs to be dropped and recreated on demand (most likely having some flag in SQL Server), so there's no use of TTL in indice.

It seems that Curator is suited for tasks like that, but it doesn't seem that it can read from SQL Server.

Any advices?

Can you say something about the numbers here?

Numbers of projects?
Records per project? (median, max)
Do all projects change every weekend?
Percent of project records that change when updates are applied?

Dropping whole indexes is the most efficient form of mass-delete but equally you shouldn't have too many indices as this is inefficient. The right strategy (essentially patch or rebuild) depends on the answers to some of these questions.

Cheers
Mark

Hi Mark,

sure, please see these figures:

ProjectID RecordCount MaxCount MinCount AvgCount
--------- ----------- -------- -------- --------
1         4831835     6492750  7753     643131
2         81007       6492750  7753     643131
3         26941       6492750  7753     643131
4         371923      6492750  7753     643131
5         17905       6492750  7753     643131
6         7753        6492750  7753     643131
7         526569      6492750  7753     643131
8         2172419     6492750  7753     643131
9         37008       6492750  7753     643131
10        110930      6492750  7753     643131
11        606660      6492750  7753     643131
12        37978       6492750  7753     643131
13        1701177     6492750  7753     643131
14        223150      6492750  7753     643131
15        292052      6492750  7753     643131
16        350040      6492750  7753     643131
17        619659      6492750  7753     643131
18        297291      6492750  7753     643131
19        156832      6492750  7753     643131
20        358034      6492750  7753     643131
21        212467      6492750  7753     643131
22        86441       6492750  7753     643131
23        6492750     6492750  7753     643131
24        96780       6492750  7753     643131
25        180985      6492750  7753     643131
26        950142      6492750  7753     643131
27        15126       6492750  7753     643131
28        2103449     6492750  7753     643131
29        210237      6492750  7753     643131
30        492430      6492750  7753     643131
31        168670      6492750  7753     643131
32        570010      6492750  7753     643131
33        19285       6492750  7753     643131
34        132484      6492750  7753     643131
35        89195       6492750  7753     643131
36        1777734     6492750  7753     643131
37        96898       6492750  7753     643131
38        11010       6492750  7753     643131
39        1724006     6492750  7753     643131
40        28381       6492750  7753     643131
41        99449       6492750  7753     643131
42        100781      6492750  7753     643131
43        168784      6492750  7753     643131
44        167828      6492750  7753     643131
45        645294      6492750  7753     643131
46        116276      6492750  7753     643131
  • So we have now 46 of them in total with given amount of records per project.
  • No, not all projects change every weekend. This number varies from 3 to about 8. Usually it's around 5.
  • There's actually not much that changes in those tables, I'd say usually less than 10% of the actual data, perhaps 15%.
  • We run these updates in SQL Server only on the weekends. There are some very rare scenarios when we do that in the working week, but that's a very rare case.

Of course some new publications might happen, but that's pretty much same as index rebuild, except that we don't have to delete anything.

Let me know if you need extra details.

Thanks,
Evaldas

OK so there's a small number of projects but they vary a lot in size:

Overall though there's not a huge amount of data so you should easily fit it in one index. The question then is whether to patch or rebuild the index each weekend from the SQL database. A full rebuild is easiest to configure but may take some time to pull from the db (you can use an elasticsearch alias to manage the flip over between the old and the new index). Alternatively a patch is typically achieved by pulling across only records updated or inserted since the last run using timestamps. Deletes may require a special table populated by a trigger every time a record is deleted. Check out this project to help manage the synching process: https://github.com/jprante/elasticsearch-jdbc

Cheers
Mark

Thanks Mark for your suggestions.

I'll check everything out.

Thanks,
Evaldas