How to approach Indexing for a newbie?

IronMike · January 14, 2014, 6:50pm

I have a project that used an old search engine and I would like to move
things to ElasticSearch. I have been doing some reading, and I wanted some
perspective on how to approach the problem.

I have bundles(folders) of text/html/pdf/img documents, each folder has
an average of 50-100 documents, document is about 100K in Size.
The number of folders and documents can increase and decrease, mostly
increase but very slightly.

I understand that txt/html will need to be turned into JSON now, and
somehow I will have to create an index and add these documents to the index
for indexing. I have some questions that I don't fully understand still.
1- How do I know how many indices do I need?
2- How do I know how many shards to allocate when creating the index?
3- How do I know how many nodes needed, and how do I make things scale up
and down? Is there a way to idle things when no indexing is happening?
4- How do I add documents to the index for indexing? I always see example
with JSON snippets, but in reality I have something like
folder1{doc1,doc2,..doc100}, folder2{docA...docN} ...
5- This is probably a dumb question...Is there a preferable language to use
for the indexing calls? If I were to build an app to call the REST API,
which language I need to use to do this if at all?

Thanks again for the help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39e218f3-395c-44b9-bac1-cc2994e26391%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · January 14, 2014, 7:12pm

This is getting somewhat old, but is a good example based on your
description:

http://www.scrutmydocs.org/

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/38ca414a-6a50-4574-8290-45705f86088c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 14, 2014, 8:11pm

I will take a look at this in more details. But is there a simple answer to
this question, lets say I have a folder with 5 json documents locally
doc1...doc5. How do I do about indexing the folder/documents?

On Tuesday, January 14, 2014 2:12:41 PM UTC-5, InquiringMind wrote:

This is getting somewhat old, but is a good example based on your
description:

http://www.scrutmydocs.org/

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48d9e479-954b-4993-a1b1-309ff8d57100%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 14, 2014, 8:12pm

I will take a look at this in more details. But is there a simple answer to
this question, lets say I have a folder with 5 json documents locally
doc1...doc5. How do I go about indexing the folder/documents?

On Tuesday, January 14, 2014 2:12:41 PM UTC-5, InquiringMind wrote:

This is getting somewhat old, but is a good example based on your
description:

http://www.scrutmydocs.org/

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce7419a1-bae4-4ef1-9833-89dcc7df53ab%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 14, 2014, 9:17pm

Mostly, indexes are result of a partition design outside ES. For
example, by time, user, data origin. The beauty of ES is that it can host
as many indexes as you wish.
If your maximum number of nodes (hosts) you want to spend to ES is
known, use that node number for the number of shards. So you make sure your
cluster can scale. If the number is not known, try to estimate the total
number of documents to get indexed, the total volume of that documents, and
an estimated index volume per shard. Rule of thumb: a shard should be sized
so it can fit into the Java heap and so that it can be moved between nodes
in reasonable time (~1-10 GB).
You can scale up by adding nodes - just start ES on another host. Scale
down is also easy, stop ES on a node.
You have to write a program that traverses your folders, picks up each
document, and extracts fields from the document to get them indexed. With
scrutmydocs.org you can experiment how this works by using such a file
traverser which is already prepared to handle quite a lot of file types
automatically.
You should consider using one of the standard clients. As ES supports
HTTP REST, and the standard clients are designed to support a comparable
set of features, it does not matter what language you use. Just pick your
favorite language. (My personal favorite is Java, where there is no need to
use HTTP REST, instead the native transport protocol can be used)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGvSgLthdp8Nk%3DTMVQYymzRYWOnEvAC4HYo14bMH1Ks8g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 14, 2014, 9:32pm

Wow, this is exactly what I was looking for. I am a bit curious on #5, I am
assuming there is a Java API to access ES, is there any link on how to get
started using Java with ES? I would like to know how to import ES
framework/API into java project.

Thanks again, this is a great clarification!

On Tuesday, January 14, 2014 4:17:31 PM UTC-5, Jörg Prante wrote:

Mostly, indexes are result of a partition design outside ES. For
example, by time, user, data origin. The beauty of ES is that it can host
as many indexes as you wish.

If your maximum number of nodes (hosts) you want to spend to ES is
known, use that node number for the number of shards. So you make sure your
cluster can scale. If the number is not known, try to estimate the total
number of documents to get indexed, the total volume of that documents, and
an estimated index volume per shard. Rule of thumb: a shard should be sized
so it can fit into the Java heap and so that it can be moved between nodes
in reasonable time (~1-10 GB).

You can scale up by adding nodes - just start ES on another host. Scale
down is also easy, stop ES on a node.

You have to write a program that traverses your folders, picks up each
document, and extracts fields from the document to get them indexed. With
scrutmydocs.org you can experiment how this works by using such a file
traverser which is already prepared to handle quite a lot of file types
automatically.

You should consider using one of the standard clients. As ES supports
HTTP REST, and the standard clients are designed to support a comparable
set of features, it does not matter what language you use. Just pick your
favorite language. (My personal favorite is Java, where there is no need to
use HTTP REST, instead the native transport protocol can be used)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6586c50-fad0-46e5-8ff5-d624d821d937%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 14, 2014, 10:22pm

To get an overview what is possible, look at the Elasticsearch test sources
at
https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

There are many code snippets that are useful for learning how to use the
Java API.

You can use Elasticsearch by adding the jar as a dependency in your project
(with Maven it is very easy).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHgvMB5ZNqWRY5amRcm0T2-pN-5HV7X%2BcrtRFvFi4%3D6bQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 15, 2014, 1:26am

Thanks. I added the .jar as a dependency in a simple java project using
eclipse.
I get this error when I try to run the program, any clues?

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/util/Version

at org.elasticsearch.Version.(Version.java:42)

at org.elasticsearch.node.internal.InternalNode.(InternalNode.java:121
)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)

at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)

at EntryPoint.main(EntryPoint.java:25)

Caused by: java.lang.ClassNotFoundException: org.apache.lucene.util.Version

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:423)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

... 5 more

On Tuesday, January 14, 2014 5:22:22 PM UTC-5, Jörg Prante wrote:

To get an overview what is possible, look at the Elasticsearch test
sources at
https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

There are many code snippets that are useful for learning how to use the
Java API.

You can use Elasticsearch by adding the jar as a dependency in your
project (with Maven it is very easy).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6e0080d-108c-4eda-af15-9cce9546dca5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 15, 2014, 1:43am

Never mind, I just had to import more jars from /lib.

On Tuesday, January 14, 2014 8:26:43 PM UTC-5, ZenMaster80 wrote:

Thanks. I added the .jar as a dependency in a simple java project using
eclipse.
I get this error when I try to run the program, any clues?

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/util/Version

at org.elasticsearch.Version.(Version.java:42)

at org.elasticsearch.node.internal.InternalNode.(
InternalNode.java:121)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)

at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)

at EntryPoint.main(EntryPoint.java:25)

Caused by: java.lang.ClassNotFoundException:
org.apache.lucene.util.Version

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:423)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

... 5 more

On Tuesday, January 14, 2014 5:22:22 PM UTC-5, Jörg Prante wrote:

To get an overview what is possible, look at the Elasticsearch test
sources at
https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

There are many code snippets that are useful for learning how to use the
Java API.

You can use Elasticsearch by adding the jar as a dependency in your
project (with Maven it is very easy).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/43325191-02ec-4c89-b327-726725e374c9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · January 15, 2014, 4:09am

Never mind, I just had to import more jars from /lib.

You can import all jars from /some_base_path/lib (for example) by adding
a /* to the end of the path, and then add that to the -cp / -classpathoption's value, separating multiple paths with semicolons. That single

(and not *.jar) is a shorthand to Java to include all jar files in the
directory. So you do not need to add them one-by-one and never again worry
when a future version of ES adds new jar files or renames existing jar
files. In fact, I've only discovered new jar files in ES versions when I
read about it on this newsgroup; that little asterisk is like magic and
saves me from ever worrying or caring about the exact set of jar files that
are bundled with ES.

See the Understanding class path wildcards section at
http://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html
for the full details.

Hope this helps!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a883bcd-30e3-463d-bda8-e8f1434d14c4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ralf_Schmitt · January 15, 2014, 3:47pm

"joergprante@gmail.com" joergprante@gmail.com writes:

You have to write a program that traverses your folders, picks up each
document, and extracts fields from the document to get them indexed.

Or you might use es-nozzle [1], which traverses your folders and indexes
documents into elasticsearch. It uses tika to extract content from
various file formats and will incrementally synchronize the folders
content to the elasticsearch index. I.e. it updates your index with new
documents and deletes documents from elasticsearch if they have been
removed from the folder.

Please visit http://brainbot.com/es-nozzle/doc/ for detailed
documentation. The code lives on github:
GitHub - brainbot-com/es-nozzle: es-nozzle synchronizes directories into ElasticSearch

Please let me know about any problems you run into if you give it a
try. I'm the author of es-nozzle.

Another option might be fsriver: GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/87zjmxl012.fsf%40systemexit.de.
For more options, visit https://groups.google.com/groups/opt_out.