How to approach Indexing for a newbie?


(IronMike) #1

I have a project that used an old search engine and I would like to move
things to ElasticSearch. I have been doing some reading, and I wanted some
perspective on how to approach the problem.

  • I have bundles(folders) of text/html/pdf/img documents, each folder has
    an average of 50-100 documents, document is about 100K in Size.
  • The number of folders and documents can increase and decrease, mostly
    increase but very slightly.

I understand that txt/html will need to be turned into JSON now, and
somehow I will have to create an index and add these documents to the index
for indexing. I have some questions that I don't fully understand still.
1- How do I know how many indices do I need?
2- How do I know how many shards to allocate when creating the index?
3- How do I know how many nodes needed, and how do I make things scale up
and down? Is there a way to idle things when no indexing is happening?
4- How do I add documents to the index for indexing? I always see example
with JSON snippets, but in reality I have something like
folder1{doc1,doc2,..doc100}, folder2{docA...docN} ...
5- This is probably a dumb question...Is there a preferable language to use
for the indexing calls? If I were to build an app to call the REST API,
which language I need to use to do this if at all?

Thanks again for the help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39e218f3-395c-44b9-bac1-cc2994e26391%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #2

This is getting somewhat old, but is a good example based on your
description:

http://www.scrutmydocs.org/

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/38ca414a-6a50-4574-8290-45705f86088c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #3

I will take a look at this in more details. But is there a simple answer to
this question, lets say I have a folder with 5 json documents locally
doc1...doc5. How do I do about indexing the folder/documents?

On Tuesday, January 14, 2014 2:12:41 PM UTC-5, InquiringMind wrote:

This is getting somewhat old, but is a good example based on your
description:

http://www.scrutmydocs.org/

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48d9e479-954b-4993-a1b1-309ff8d57100%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #4

I will take a look at this in more details. But is there a simple answer to
this question, lets say I have a folder with 5 json documents locally
doc1...doc5. How do I go about indexing the folder/documents?

On Tuesday, January 14, 2014 2:12:41 PM UTC-5, InquiringMind wrote:

This is getting somewhat old, but is a good example based on your
description:

http://www.scrutmydocs.org/

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce7419a1-bae4-4ef1-9833-89dcc7df53ab%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5
  1. Mostly, indexes are result of a partition design outside ES. For
    example, by time, user, data origin. The beauty of ES is that it can host
    as many indexes as you wish.

  2. If your maximum number of nodes (hosts) you want to spend to ES is
    known, use that node number for the number of shards. So you make sure your
    cluster can scale. If the number is not known, try to estimate the total
    number of documents to get indexed, the total volume of that documents, and
    an estimated index volume per shard. Rule of thumb: a shard should be sized
    so it can fit into the Java heap and so that it can be moved between nodes
    in reasonable time (~1-10 GB).

  3. You can scale up by adding nodes - just start ES on another host. Scale
    down is also easy, stop ES on a node.

  4. You have to write a program that traverses your folders, picks up each
    document, and extracts fields from the document to get them indexed. With
    scrutmydocs.org you can experiment how this works by using such a file
    traverser which is already prepared to handle quite a lot of file types
    automatically.

  5. You should consider using one of the standard clients. As ES supports
    HTTP REST, and the standard clients are designed to support a comparable
    set of features, it does not matter what language you use. Just pick your
    favorite language. (My personal favorite is Java, where there is no need to
    use HTTP REST, instead the native transport protocol can be used)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGvSgLthdp8Nk%3DTMVQYymzRYWOnEvAC4HYo14bMH1Ks8g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #6

Wow, this is exactly what I was looking for. I am a bit curious on #5, I am
assuming there is a Java API to access ES, is there any link on how to get
started using Java with ES? I would like to know how to import ES
framework/API into java project.

Thanks again, this is a great clarification!

On Tuesday, January 14, 2014 4:17:31 PM UTC-5, Jörg Prante wrote:

  1. Mostly, indexes are result of a partition design outside ES. For
    example, by time, user, data origin. The beauty of ES is that it can host
    as many indexes as you wish.

  2. If your maximum number of nodes (hosts) you want to spend to ES is
    known, use that node number for the number of shards. So you make sure your
    cluster can scale. If the number is not known, try to estimate the total
    number of documents to get indexed, the total volume of that documents, and
    an estimated index volume per shard. Rule of thumb: a shard should be sized
    so it can fit into the Java heap and so that it can be moved between nodes
    in reasonable time (~1-10 GB).

  3. You can scale up by adding nodes - just start ES on another host. Scale
    down is also easy, stop ES on a node.

  4. You have to write a program that traverses your folders, picks up each
    document, and extracts fields from the document to get them indexed. With
    scrutmydocs.org you can experiment how this works by using such a file
    traverser which is already prepared to handle quite a lot of file types
    automatically.

  5. You should consider using one of the standard clients. As ES supports
    HTTP REST, and the standard clients are designed to support a comparable
    set of features, it does not matter what language you use. Just pick your
    favorite language. (My personal favorite is Java, where there is no need to
    use HTTP REST, instead the native transport protocol can be used)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6586c50-fad0-46e5-8ff5-d624d821d937%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #7

To get an overview what is possible, look at the Elasticsearch test sources
at
https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

There are many code snippets that are useful for learning how to use the
Java API.

You can use Elasticsearch by adding the jar as a dependency in your project
(with Maven it is very easy).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHgvMB5ZNqWRY5amRcm0T2-pN-5HV7X%2BcrtRFvFi4%3D6bQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #8

Thanks. I added the .jar as a dependency in a simple java project using
eclipse.
I get this error when I try to run the program, any clues?

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/util/Version

at org.elasticsearch.Version.(Version.java:42)

at org.elasticsearch.node.internal.InternalNode.(InternalNode.java:121
)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)

at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)

at EntryPoint.main(EntryPoint.java:25)

Caused by: java.lang.ClassNotFoundException: org.apache.lucene.util.Version

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:423)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

... 5 more

On Tuesday, January 14, 2014 5:22:22 PM UTC-5, Jörg Prante wrote:

To get an overview what is possible, look at the Elasticsearch test
sources at
https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

There are many code snippets that are useful for learning how to use the
Java API.

You can use Elasticsearch by adding the jar as a dependency in your
project (with Maven it is very easy).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6e0080d-108c-4eda-af15-9cce9546dca5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #9

Never mind, I just had to import more jars from /lib.

On Tuesday, January 14, 2014 8:26:43 PM UTC-5, ZenMaster80 wrote:

Thanks. I added the .jar as a dependency in a simple java project using
eclipse.
I get this error when I try to run the program, any clues?

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/util/Version

at org.elasticsearch.Version.(Version.java:42)

at org.elasticsearch.node.internal.InternalNode.(
InternalNode.java:121)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)

at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)

at EntryPoint.main(EntryPoint.java:25)

Caused by: java.lang.ClassNotFoundException:
org.apache.lucene.util.Version

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:423)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

... 5 more

On Tuesday, January 14, 2014 5:22:22 PM UTC-5, Jörg Prante wrote:

To get an overview what is possible, look at the Elasticsearch test
sources at
https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

There are many code snippets that are useful for learning how to use the
Java API.

You can use Elasticsearch by adding the jar as a dependency in your
project (with Maven it is very easy).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/43325191-02ec-4c89-b327-726725e374c9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #10

Never mind, I just had to import more jars from /lib.

You can import all jars from /some_base_path/lib (for example) by adding
a /* to the end of the path, and then add that to the -cp / -classpathoption's value, separating multiple paths with semicolons. That single

  • (and not *.jar) is a shorthand to Java to include all jar files in the
    directory. So you do not need to add them one-by-one and never again worry
    when a future version of ES adds new jar files or renames existing jar
    files. In fact, I've only discovered new jar files in ES versions when I
    read about it on this newsgroup; that little asterisk is like magic and
    saves me from ever worrying or caring about the exact set of jar files that
    are bundled with ES.

See the Understanding class path wildcards section at
http://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html
for the full details.

Hope this helps!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a883bcd-30e3-463d-bda8-e8f1434d14c4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ralf Schmitt) #11

"joergprante@gmail.com" joergprante@gmail.com writes:

  1. You have to write a program that traverses your folders, picks up each
    document, and extracts fields from the document to get them indexed.

Or you might use es-nozzle [1], which traverses your folders and indexes
documents into elasticsearch. It uses tika to extract content from
various file formats and will incrementally synchronize the folders
content to the elasticsearch index. I.e. it updates your index with new
documents and deletes documents from elasticsearch if they have been
removed from the folder.

Please visit http://brainbot.com/es-nozzle/doc/ for detailed
documentation. The code lives on github:
https://github.com/brainbot-com/es-nozzle

Please let me know about any problems you run into if you give it a
try. I'm the author of es-nozzle.

Another option might be fsriver: https://github.com/dadoonet/fsriver

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/87zjmxl012.fsf%40systemexit.de.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #12