Document Processing

davrob · August 11, 2013, 3:36pm

Hi,

After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.

I'm looking to get an insight into the decisions people have made:

ES Architecture - number of nodes, size of each node in terms of disk
space, CPUs and RAM.
Indexing
i) Initial Indexing of all docs - if I was doing this in my usual way, I
would probably have a process on each node, pulling a defined set of
documents from Documentum, doing a transformation into JSON then bulking
them into ES. Is there a river that anyone knows of, to be honest I
haven't used rivers before because there weren't that many when I first
used ES, so I created my own java processes that reads data, coverts it to
JSON and then bulk index it into ES.
ii) Do you convert the entire document into JSON and push it to ES, or
just take the first 10,000 words and assume anything else will be a repeat?
iii) Indexing New Docs as they are added to the - I guess this is where
rivers help, but I could always code up something by hand, if necessary, to
monitor Documentum and File Systems for changes.
Partioning/Routing - I don't think there is much I can do hear, because
data is not held per user, or easily routed according to dates, but other's
experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

davrob · August 14, 2013, 7:03pm

Bump - just in case someone has some experience they want to share.

On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:

Hi,

After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.

I'm looking to get an insight into the decisions people have made:

ES Architecture - number of nodes, size of each node in terms of disk
space, CPUs and RAM.

Indexing
i) Initial Indexing of all docs - if I was doing this in my usual way, I
would probably have a process on each node, pulling a defined set of
documents from Documentum, doing a transformation into JSON then bulking
them into ES. Is there a river that anyone knows of, to be honest I
haven't used rivers before because there weren't that many when I first
used ES, so I created my own java processes that reads data, coverts it to
JSON and then bulk index it into ES.
ii) Do you convert the entire document into JSON and push it to ES, or
just take the first 10,000 words and assume anything else will be a repeat?
iii) Indexing New Docs as they are added to the - I guess this is where
rivers help, but I could always code up something by hand, if necessary, to
monitor Documentum and File Systems for changes.

Partioning/Routing - I don't think there is much I can do hear, because
data is not held per user, or easily routed according to dates, but other's
experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · August 14, 2013, 8:24pm

David,

Well, I don't have a lot to add, but here are a few things I can think of:

You are writing your own bulk loader in Java. That's great! I've done
the same thing, and it works very well. I can even add my own statistics,
and show a bounded set of errors (for instance, if a thousand errors occur
then showing just the first 100 or so is more than enough noise to get my
attention!). These are things that tend not to be in the rivers.
ES has enough back-end threading to keep you from needing to do much
threading on the bulk load side. For example, I bulk load two sets of data:
One with 97 million small documents, and another with 78 million
medium-sized documents (nothing as large as yours). I load them (and load
subsequent update sets) serially in one thread. ES keeps itself, and the
CPU and disk, plenty busy enough (even on a laptop with one relatively slow
disk doing all the input reading and index writing). Starting with ES
0.90.0, I got my initial load times down to less than 3 hours for each of
these sets... on that laptop!
ES version 0.90.3 is the latest stable release and you should move to
it. Especially since you use Java, some methods were deprecated even from
0.90.0 to 0.90.3 so it's good to keep moving forward to minimize the
breaking changes. Of course, this builds on #1 above: Writing my own
loaders and query tools in Java means that I can rebuild the client side to
match the ES version. Using 3rd-party rivers, you are sometimes locked into
older versions of ES. Bah humbug to that!

ES is fast, small, and flexible enough to experiment. Try loading
increasingly large subsets of your Documentum store and plot the increase
in time and disk space. Tweak and tune, and ask specific questions as you
go along. Then you'll know for sure when you've hit the sweet spot for your
full load and discovered the best strategy. For instance, your thoughts on
creating the JSON for just the first N (10,000?) words and then creating a
link to the full document (implied?) might be best, but if the update rate
is relatively small then perhaps storing the entire thing won't be too bad.
Just a guess though.

Brian

On Wednesday, August 14, 2013 3:03:33 PM UTC-4, davrob2 wrote:

Bump - just in case someone has some experience they want to share.

On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:

Hi,

After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.

I'm looking to get an insight into the decisions people have made:

ES Architecture - number of nodes, size of each node in terms of disk
space, CPUs and RAM.

Indexing
i) Initial Indexing of all docs - if I was doing this in my usual way,
I would probably have a process on each node, pulling a defined set of
documents from Documentum, doing a transformation into JSON then bulking
them into ES. Is there a river that anyone knows of, to be honest I
haven't used rivers before because there weren't that many when I first
used ES, so I created my own java processes that reads data, coverts it to
JSON and then bulk index it into ES.
ii) Do you convert the entire document into JSON and push it to ES, or
just take the first 10,000 words and assume anything else will be a repeat?
iii) Indexing New Docs as they are added to the - I guess this is where
rivers help, but I could always code up something by hand, if necessary, to
monitor Documentum and File Systems for changes.

Partioning/Routing - I don't think there is much I can do hear,
because data is not held per user, or easily routed according to dates, but
other's experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Looking for advice on bulk loading Elasticsearch	6	935	July 6, 2017
Adding millions of documents, performance decay Elasticsearch	6	668	July 6, 2017
Improving Bulk Indexing Elasticsearch	12	4540	July 6, 2017
Offline indexing and expected scaling performance Elasticsearch	4	1832	July 6, 2017
How to approach Indexing for a newbie? Elasticsearch	11	720	July 6, 2017

Document Processing

Related topics