Document Processing

Hi,

After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.

I'm looking to get an insight into the decisions people have made:

  1. ES Architecture - number of nodes, size of each node in terms of disk
    space, CPUs and RAM.
  2. Indexing
    i) Initial Indexing of all docs - if I was doing this in my usual way, I
    would probably have a process on each node, pulling a defined set of
    documents from Documentum, doing a transformation into JSON then bulking
    them into ES. Is there a river that anyone knows of, to be honest I
    haven't used rivers before because there weren't that many when I first
    used ES, so I created my own java processes that reads data, coverts it to
    JSON and then bulk index it into ES.
    ii) Do you convert the entire document into JSON and push it to ES, or
    just take the first 10,000 words and assume anything else will be a repeat?
    iii) Indexing New Docs as they are added to the - I guess this is where
    rivers help, but I could always code up something by hand, if necessary, to
    monitor Documentum and File Systems for changes.
  3. Partioning/Routing - I don't think there is much I can do hear, because
    data is not held per user, or easily routed according to dates, but other's
    experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bump - just in case someone has some experience they want to share.

On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:

Hi,

After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.

I'm looking to get an insight into the decisions people have made:

  1. ES Architecture - number of nodes, size of each node in terms of disk
    space, CPUs and RAM.
  2. Indexing
    i) Initial Indexing of all docs - if I was doing this in my usual way, I
    would probably have a process on each node, pulling a defined set of
    documents from Documentum, doing a transformation into JSON then bulking
    them into ES. Is there a river that anyone knows of, to be honest I
    haven't used rivers before because there weren't that many when I first
    used ES, so I created my own java processes that reads data, coverts it to
    JSON and then bulk index it into ES.
    ii) Do you convert the entire document into JSON and push it to ES, or
    just take the first 10,000 words and assume anything else will be a repeat?
    iii) Indexing New Docs as they are added to the - I guess this is where
    rivers help, but I could always code up something by hand, if necessary, to
    monitor Documentum and File Systems for changes.
  3. Partioning/Routing - I don't think there is much I can do hear, because
    data is not held per user, or easily routed according to dates, but other's
    experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David,

Well, I don't have a lot to add, but here are a few things I can think of:

  1. You are writing your own bulk loader in Java. That's great! I've done
    the same thing, and it works very well. I can even add my own statistics,
    and show a bounded set of errors (for instance, if a thousand errors occur
    then showing just the first 100 or so is more than enough noise to get my
    attention!). These are things that tend not to be in the rivers.

  2. ES has enough back-end threading to keep you from needing to do much
    threading on the bulk load side. For example, I bulk load two sets of data:
    One with 97 million small documents, and another with 78 million
    medium-sized documents (nothing as large as yours). I load them (and load
    subsequent update sets) serially in one thread. ES keeps itself, and the
    CPU and disk, plenty busy enough (even on a laptop with one relatively slow
    disk doing all the input reading and index writing). Starting with ES
    0.90.0, I got my initial load times down to less than 3 hours for each of
    these sets... on that laptop!

  3. ES version 0.90.3 is the latest stable release and you should move to
    it. Especially since you use Java, some methods were deprecated even from
    0.90.0 to 0.90.3 so it's good to keep moving forward to minimize the
    breaking changes. Of course, this builds on #1 above: Writing my own
    loaders and query tools in Java means that I can rebuild the client side to
    match the ES version. Using 3rd-party rivers, you are sometimes locked into
    older versions of ES. Bah humbug to that!

ES is fast, small, and flexible enough to experiment. Try loading
increasingly large subsets of your Documentum store and plot the increase
in time and disk space. Tweak and tune, and ask specific questions as you
go along. Then you'll know for sure when you've hit the sweet spot for your
full load and discovered the best strategy. For instance, your thoughts on
creating the JSON for just the first N (10,000?) words and then creating a
link to the full document (implied?) might be best, but if the update rate
is relatively small then perhaps storing the entire thing won't be too bad.
Just a guess though.

Brian

On Wednesday, August 14, 2013 3:03:33 PM UTC-4, davrob2 wrote:

Bump - just in case someone has some experience they want to share.

On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:

Hi,

After several years of using ES for speeding up searches on relational
database objects, I've been asked to quote on a project to index a 10TB
store of documents, mostly held in Documentum. By my rough calculations if
each document is, on average, 1 or 2 MB, then we have about 5 or 10 million
docs.

I'm looking to get an insight into the decisions people have made:

  1. ES Architecture - number of nodes, size of each node in terms of disk
    space, CPUs and RAM.
  2. Indexing
    i) Initial Indexing of all docs - if I was doing this in my usual way,
    I would probably have a process on each node, pulling a defined set of
    documents from Documentum, doing a transformation into JSON then bulking
    them into ES. Is there a river that anyone knows of, to be honest I
    haven't used rivers before because there weren't that many when I first
    used ES, so I created my own java processes that reads data, coverts it to
    JSON and then bulk index it into ES.
    ii) Do you convert the entire document into JSON and push it to ES, or
    just take the first 10,000 words and assume anything else will be a repeat?
    iii) Indexing New Docs as they are added to the - I guess this is where
    rivers help, but I could always code up something by hand, if necessary, to
    monitor Documentum and File Systems for changes.
  3. Partioning/Routing - I don't think there is much I can do hear,
    because data is not held per user, or easily routed according to dates, but
    other's experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.