After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.
I'm looking to get an insight into the decisions people have made:
ES Architecture - number of nodes, size of each node in terms of disk
space, CPUs and RAM.
Indexing
i) Initial Indexing of all docs - if I was doing this in my usual way, I
would probably have a process on each node, pulling a defined set of
documents from Documentum, doing a transformation into JSON then bulking
them into ES. Is there a river that anyone knows of, to be honest I
haven't used rivers before because there weren't that many when I first
used ES, so I created my own java processes that reads data, coverts it to
JSON and then bulk index it into ES.
ii) Do you convert the entire document into JSON and push it to ES, or
just take the first 10,000 words and assume anything else will be a repeat?
iii) Indexing New Docs as they are added to the - I guess this is where
rivers help, but I could always code up something by hand, if necessary, to
monitor Documentum and File Systems for changes.
Partioning/Routing - I don't think there is much I can do hear, because
data is not held per user, or easily routed according to dates, but other's
experience would be welcome.
Bump - just in case someone has some experience they want to share.
On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:
Hi,
After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.
I'm looking to get an insight into the decisions people have made:
ES Architecture - number of nodes, size of each node in terms of disk
space, CPUs and RAM.
Indexing
i) Initial Indexing of all docs - if I was doing this in my usual way, I
would probably have a process on each node, pulling a defined set of
documents from Documentum, doing a transformation into JSON then bulking
them into ES. Is there a river that anyone knows of, to be honest I
haven't used rivers before because there weren't that many when I first
used ES, so I created my own java processes that reads data, coverts it to
JSON and then bulk index it into ES.
ii) Do you convert the entire document into JSON and push it to ES, or
just take the first 10,000 words and assume anything else will be a repeat?
iii) Indexing New Docs as they are added to the - I guess this is where
rivers help, but I could always code up something by hand, if necessary, to
monitor Documentum and File Systems for changes.
Partioning/Routing - I don't think there is much I can do hear, because
data is not held per user, or easily routed according to dates, but other's
experience would be welcome.
Well, I don't have a lot to add, but here are a few things I can think of:
You are writing your own bulk loader in Java. That's great! I've done
the same thing, and it works very well. I can even add my own statistics,
and show a bounded set of errors (for instance, if a thousand errors occur
then showing just the first 100 or so is more than enough noise to get my
attention!). These are things that tend not to be in the rivers.
ES has enough back-end threading to keep you from needing to do much
threading on the bulk load side. For example, I bulk load two sets of data:
One with 97 million small documents, and another with 78 million
medium-sized documents (nothing as large as yours). I load them (and load
subsequent update sets) serially in one thread. ES keeps itself, and the
CPU and disk, plenty busy enough (even on a laptop with one relatively slow
disk doing all the input reading and index writing). Starting with ES
0.90.0, I got my initial load times down to less than 3 hours for each of
these sets... on that laptop!
ES version 0.90.3 is the latest stable release and you should move to
it. Especially since you use Java, some methods were deprecated even from
0.90.0 to 0.90.3 so it's good to keep moving forward to minimize the
breaking changes. Of course, this builds on #1 above: Writing my own
loaders and query tools in Java means that I can rebuild the client side to
match the ES version. Using 3rd-party rivers, you are sometimes locked into
older versions of ES. Bah humbug to that!
ES is fast, small, and flexible enough to experiment. Try loading
increasingly large subsets of your Documentum store and plot the increase
in time and disk space. Tweak and tune, and ask specific questions as you
go along. Then you'll know for sure when you've hit the sweet spot for your
full load and discovered the best strategy. For instance, your thoughts on
creating the JSON for just the first N (10,000?) words and then creating a
link to the full document (implied?) might be best, but if the update rate
is relatively small then perhaps storing the entire thing won't be too bad.
Just a guess though.
Brian
On Wednesday, August 14, 2013 3:03:33 PM UTC-4, davrob2 wrote:
Bump - just in case someone has some experience they want to share.
On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:
Hi,
After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.
I'm looking to get an insight into the decisions people have made:
ES Architecture - number of nodes, size of each node in terms of disk
space, CPUs and RAM.
Indexing
i) Initial Indexing of all docs - if I was doing this in my usual way,
I would probably have a process on each node, pulling a defined set of
documents from Documentum, doing a transformation into JSON then bulking
them into ES. Is there a river that anyone knows of, to be honest I
haven't used rivers before because there weren't that many when I first
used ES, so I created my own java processes that reads data, coverts it to
JSON and then bulk index it into ES.
ii) Do you convert the entire document into JSON and push it to ES, or
just take the first 10,000 words and assume anything else will be a repeat?
iii) Indexing New Docs as they are added to the - I guess this is where
rivers help, but I could always code up something by hand, if necessary, to
monitor Documentum and File Systems for changes.
Partioning/Routing - I don't think there is much I can do hear,
because data is not held per user, or easily routed according to dates, but
other's experience would be welcome.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.