Video files


(Doug Wolfgram) #1

I am new to ES. I have a need to index and search hundreds of millions of
video files. (think Youtube). I am assuming that I could create the entire
system in ES with one of the keys being the url to to video file itself.
Pretty straight forward. On thing that confuses me is the persistence of
ES. From the videos, it seems that if I stop all instances I lose my data.
Is that true? Seems rather odd. Sorry to be so infantile in my questions
but I have very little exposure to ES at this point. I am assuming that I
would simply use ES as I would MongoDB but ES has better, faster searching.
Is this the general consensus?

Also, I have been working with GenieDB for really fast multi-datacenter
replication. Is something like this available for ES?

Cheers! Looking forward to a solid relationship with ES!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #2

Data is persisted to disk after it has been indexed...you don't lose data
when you restart nodes. If the server goes up in flames, you may lose data
if you don't have replicas spreading the data across the cluster, but
that's a different problem =)

Elasticsearch is built on top of Lucene, which is arguably the most
advanced information retrieval and search library available (including both
open source and commercial products). Elasticsearch benefits directly from
the search capabilities of Lucene - it's pretty powerful.

There isn't really a good tool for multi-datacenter push replication at the
moment. When 1.0 is released, the Snapshot/Restore feature can fill that
role. For now, you'll have to do it manually with something like rsync.
It should be noted this is for delayed synchronization between two
clusters - it isn't recommended to span a single cluster between two
datacenters. The latency makes distributed systems very difficult to work
with, even under the best circumstances.

-Zach

On Monday, September 16, 2013 11:35:48 AM UTC-4, Doug Wolfgram wrote:

I am new to ES. I have a need to index and search hundreds of millions of
video files. (think Youtube). I am assuming that I could create the entire
system in ES with one of the keys being the url to to video file itself.
Pretty straight forward. On thing that confuses me is the persistence of
ES. From the videos, it seems that if I stop all instances I lose my data.
Is that true? Seems rather odd. Sorry to be so infantile in my questions
but I have very little exposure to ES at this point. I am assuming that I
would simply use ES as I would MongoDB but ES has better, faster searching.
Is this the general consensus?

Also, I have been working with GenieDB for really fast multi-datacenter
replication. Is something like this available for ES?

Cheers! Looking forward to a solid relationship with ES!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Doug Wolfgram) #3

Thanks. That explains a lot. GenieDB has solved the performance issue for
the most part for MySQL with their self-healing methodology, but of course
the two servers are never perfectly in sync. But for most applications, a 1
or 2 second delay is acceptable. The trick is a private VPN between the
servers. For searching across multiple data centers, that could prove to be
problematic.

Thanks for the info. One quick question, do I have to create indexes for
every possible keyword search, or is free-form search reasonably fast? The
primary problem I am trying to solve is this. "Select all the videos who's
description contains the word 'alien%.' Your basic, old-fashioned, slow
text comparison, wildcard search.

On Monday, September 16, 2013 11:48:35 AM UTC-7, Zachary Tong wrote:

Data is persisted to disk after it has been indexed...you don't lose data
when you restart nodes. If the server goes up in flames, you may lose data
if you don't have replicas spreading the data across the cluster, but
that's a different problem =)

Elasticsearch is built on top of Lucene, which is arguably the most
advanced information retrieval and search library available (including both
open source and commercial products). Elasticsearch benefits directly from
the search capabilities of Lucene - it's pretty powerful.

There isn't really a good tool for multi-datacenter push replication at
the moment. When 1.0 is released, the Snapshot/Restore feature can fill
that role. For now, you'll have to do it manually with something like
rsync. It should be noted this is for delayed synchronization between two
clusters - it isn't recommended to span a single cluster between two
datacenters. The latency makes distributed systems very difficult to work
with, even under the best circumstances.

-Zach

On Monday, September 16, 2013 11:35:48 AM UTC-4, Doug Wolfgram wrote:

I am new to ES. I have a need to index and search hundreds of millions of
video files. (think Youtube). I am assuming that I could create the entire
system in ES with one of the keys being the url to to video file itself.
Pretty straight forward. On thing that confuses me is the persistence of
ES. From the videos, it seems that if I stop all instances I lose my data.
Is that true? Seems rather odd. Sorry to be so infantile in my questions
but I have very little exposure to ES at this point. I am assuming that I
would simply use ES as I would MongoDB but ES has better, faster searching.
Is this the general consensus?

Also, I have been working with GenieDB for really fast multi-datacenter
replication. Is something like this available for ES?

Cheers! Looking forward to a solid relationship with ES!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #4

In elasticsearch, indexes are more of a logical namespace. They are
somewhat akin to database table except a lot more flexible. If you were
referring to "indexes" as in the type of index you specify in a relational
database (e.g. index a column which builds a B-Tree under the covers, etc),
there is no concept of a pre-specified index in elasticsearch. All fields
are searchable, because all fields are converted into an inverted-indexhttp://en.wikipedia.org/wiki/Inverted_indexwhich enables fast lookups.

If your search simply requires any field that has the token "alien" inside
it, a simple Match queryhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query/will work for you. If you don't need scoring, a Term
filter http://www.elasticsearch.org/guide/reference/query-dsl/term-filter/will be even faster. If you need prefix matching, so all documents that
have "ali" in them (which will match "alien", "alison", "alias", etc), you
can use Prefix Query.http://www.elasticsearch.org/guide/reference/query-dsl/prefix-query/

There are plenty of more advanced queries you can use too, such as phrase
matching or partial, fuzzy matches. Take a few minutes and look over the
queries that elasticsearch offers. With the right combination of analyzers
and queries, almost any behavior can be created.

Lastly, all of the above queries are very fast. Even the
prefix/fuzzy/wildcard queries are super fast, thanks to some fairly
advanced finite state automatons that operate under the covers.

-Zach

On Monday, September 16, 2013 3:58:02 PM UTC-4, Doug Wolfgram wrote:

Thanks. That explains a lot. GenieDB has solved the performance issue for
the most part for MySQL with their self-healing methodology, but of course
the two servers are never perfectly in sync. But for most applications, a 1
or 2 second delay is acceptable. The trick is a private VPN between the
servers. For searching across multiple data centers, that could prove to be
problematic.

Thanks for the info. One quick question, do I have to create indexes for
every possible keyword search, or is free-form search reasonably fast? The
primary problem I am trying to solve is this. "Select all the videos who's
description contains the word 'alien%.' Your basic, old-fashioned, slow
text comparison, wildcard search.

On Monday, September 16, 2013 11:48:35 AM UTC-7, Zachary Tong wrote:

Data is persisted to disk after it has been indexed...you don't lose data
when you restart nodes. If the server goes up in flames, you may lose data
if you don't have replicas spreading the data across the cluster, but
that's a different problem =)

Elasticsearch is built on top of Lucene, which is arguably the most
advanced information retrieval and search library available (including both
open source and commercial products). Elasticsearch benefits directly from
the search capabilities of Lucene - it's pretty powerful.

There isn't really a good tool for multi-datacenter push replication at
the moment. When 1.0 is released, the Snapshot/Restore feature can fill
that role. For now, you'll have to do it manually with something like
rsync. It should be noted this is for delayed synchronization between two
clusters - it isn't recommended to span a single cluster between two
datacenters. The latency makes distributed systems very difficult to work
with, even under the best circumstances.

-Zach

On Monday, September 16, 2013 11:35:48 AM UTC-4, Doug Wolfgram wrote:

I am new to ES. I have a need to index and search hundreds of millions
of video files. (think Youtube). I am assuming that I could create the
entire system in ES with one of the keys being the url to to video file
itself. Pretty straight forward. On thing that confuses me is the
persistence of ES. From the videos, it seems that if I stop all instances I
lose my data. Is that true? Seems rather odd. Sorry to be so infantile in
my questions but I have very little exposure to ES at this point. I am
assuming that I would simply use ES as I would MongoDB but ES has better,
faster searching. Is this the general consensus?

Also, I have been working with GenieDB for really fast multi-datacenter
replication. Is something like this available for ES?

Cheers! Looking forward to a solid relationship with ES!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Doug Wolfgram) #5

Thanks for taking the time with the newbie. after 20 years+ of relational
databases, it is time to get on with things. :slight_smile:

On Monday, September 16, 2013 2:07:23 PM UTC-7, Zachary Tong wrote:

In elasticsearch, indexes are more of a logical namespace. They are
somewhat akin to database table except a lot more flexible. If you were
referring to "indexes" as in the type of index you specify in a relational
database (e.g. index a column which builds a B-Tree under the covers, etc),
there is no concept of a pre-specified index in elasticsearch. All fields
are searchable, because all fields are converted into an inverted-indexhttp://en.wikipedia.org/wiki/Inverted_indexwhich enables fast lookups.

If your search simply requires any field that has the token "alien" inside
it, a simple Match queryhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query/will work for you. If you don't need scoring, a Term
filterhttp://www.elasticsearch.org/guide/reference/query-dsl/term-filter/will be even faster. If you need prefix matching, so all documents that
have "ali" in them (which will match "alien", "alison", "alias", etc), you
can use Prefix Query.http://www.elasticsearch.org/guide/reference/query-dsl/prefix-query/

There are plenty of more advanced queries you can use too, such as phrase
matching or partial, fuzzy matches. Take a few minutes and look over the
queries that elasticsearch offers. With the right combination of analyzers
and queries, almost any behavior can be created.

Lastly, all of the above queries are very fast. Even the
prefix/fuzzy/wildcard queries are super fast, thanks to some fairly
advanced finite state automatons that operate under the covers.

-Zach

On Monday, September 16, 2013 3:58:02 PM UTC-4, Doug Wolfgram wrote:

Thanks. That explains a lot. GenieDB has solved the performance issue for
the most part for MySQL with their self-healing methodology, but of course
the two servers are never perfectly in sync. But for most applications, a 1
or 2 second delay is acceptable. The trick is a private VPN between the
servers. For searching across multiple data centers, that could prove to be
problematic.

Thanks for the info. One quick question, do I have to create indexes for
every possible keyword search, or is free-form search reasonably fast? The
primary problem I am trying to solve is this. "Select all the videos who's
description contains the word 'alien%.' Your basic, old-fashioned, slow
text comparison, wildcard search.

On Monday, September 16, 2013 11:48:35 AM UTC-7, Zachary Tong wrote:

Data is persisted to disk after it has been indexed...you don't lose
data when you restart nodes. If the server goes up in flames, you may lose
data if you don't have replicas spreading the data across the cluster, but
that's a different problem =)

Elasticsearch is built on top of Lucene, which is arguably the most
advanced information retrieval and search library available (including both
open source and commercial products). Elasticsearch benefits directly from
the search capabilities of Lucene - it's pretty powerful.

There isn't really a good tool for multi-datacenter push replication at
the moment. When 1.0 is released, the Snapshot/Restore feature can fill
that role. For now, you'll have to do it manually with something like
rsync. It should be noted this is for delayed synchronization between two
clusters - it isn't recommended to span a single cluster between two
datacenters. The latency makes distributed systems very difficult to work
with, even under the best circumstances.

-Zach

On Monday, September 16, 2013 11:35:48 AM UTC-4, Doug Wolfgram wrote:

I am new to ES. I have a need to index and search hundreds of millions
of video files. (think Youtube). I am assuming that I could create the
entire system in ES with one of the keys being the url to to video file
itself. Pretty straight forward. On thing that confuses me is the
persistence of ES. From the videos, it seems that if I stop all instances I
lose my data. Is that true? Seems rather odd. Sorry to be so infantile in
my questions but I have very little exposure to ES at this point. I am
assuming that I would simply use ES as I would MongoDB but ES has better,
faster searching. Is this the general consensus?

Also, I have been working with GenieDB for really fast multi-datacenter
replication. Is something like this available for ES?

Cheers! Looking forward to a solid relationship with ES!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #6

No problem =)

I came from a non-Lucene, non-search background originally...so I totally
understand the learning curve associated with getting started in a true
search engine (as opposed to a database with search bolted on). Stay
strong, it get's easier once you've played around with it for a while! =)

-Zach

On Mon, Sep 16, 2013 at 6:34 PM, Doug Wolfgram dwolfgram@gmail.com wrote:

Thanks for taking the time with the newbie. after 20 years+ of relational
databases, it is time to get on with things. :slight_smile:

On Monday, September 16, 2013 2:07:23 PM UTC-7, Zachary Tong wrote:

In elasticsearch, indexes are more of a logical namespace. They are
somewhat akin to database table except a lot more flexible. If you were
referring to "indexes" as in the type of index you specify in a relational
database (e.g. index a column which builds a B-Tree under the covers, etc),
there is no concept of a pre-specified index in elasticsearch. All fields
are searchable, because all fields are converted into an inverted-indexhttp://en.wikipedia.org/wiki/Inverted_indexwhich enables fast lookups.

If your search simply requires any field that has the token "alien"
inside it, a simple Match queryhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query/will work for you. If you don't need scoring, a Term
filterhttp://www.elasticsearch.org/guide/reference/query-dsl/term-filter/will be even faster. If you need prefix matching, so all documents that
have "ali" in them (which will match "alien", "alison", "alias", etc), you
can use Prefix Query.http://www.elasticsearch.org/guide/reference/query-dsl/prefix-query/

There are plenty of more advanced queries you can use too, such as phrase
matching or partial, fuzzy matches. Take a few minutes and look over the
queries that elasticsearch offers. With the right combination of analyzers
and queries, almost any behavior can be created.

Lastly, all of the above queries are very fast. Even the
prefix/fuzzy/wildcard queries are super fast, thanks to some fairly
advanced finite state automatons that operate under the covers.

-Zach

On Monday, September 16, 2013 3:58:02 PM UTC-4, Doug Wolfgram wrote:

Thanks. That explains a lot. GenieDB has solved the performance issue
for the most part for MySQL with their self-healing methodology, but of
course the two servers are never perfectly in sync. But for most
applications, a 1 or 2 second delay is acceptable. The trick is a private
VPN between the servers. For searching across multiple data centers, that
could prove to be problematic.

Thanks for the info. One quick question, do I have to create indexes for
every possible keyword search, or is free-form search reasonably fast? The
primary problem I am trying to solve is this. "Select all the videos who's
description contains the word 'alien%.' Your basic, old-fashioned, slow
text comparison, wildcard search.

On Monday, September 16, 2013 11:48:35 AM UTC-7, Zachary Tong wrote:

Data is persisted to disk after it has been indexed...you don't lose
data when you restart nodes. If the server goes up in flames, you may lose
data if you don't have replicas spreading the data across the cluster, but
that's a different problem =)

Elasticsearch is built on top of Lucene, which is arguably the most
advanced information retrieval and search library available (including both
open source and commercial products). Elasticsearch benefits directly from
the search capabilities of Lucene - it's pretty powerful.

There isn't really a good tool for multi-datacenter push replication at
the moment. When 1.0 is released, the Snapshot/Restore feature can fill
that role. For now, you'll have to do it manually with something like
rsync. It should be noted this is for delayed synchronization between two
clusters - it isn't recommended to span a single cluster between two
datacenters. The latency makes distributed systems very difficult to work
with, even under the best circumstances.

-Zach

On Monday, September 16, 2013 11:35:48 AM UTC-4, Doug Wolfgram wrote:

I am new to ES. I have a need to index and search hundreds of millions
of video files. (think Youtube). I am assuming that I could create the
entire system in ES with one of the keys being the url to to video file
itself. Pretty straight forward. On thing that confuses me is the
persistence of ES. From the videos, it seems that if I stop all instances I
lose my data. Is that true? Seems rather odd. Sorry to be so infantile in
my questions but I have very little exposure to ES at this point. I am
assuming that I would simply use ES as I would MongoDB but ES has better,
faster searching. Is this the general consensus?

Also, I have been working with GenieDB for really fast
multi-datacenter replication. Is something like this available for ES?

Cheers! Looking forward to a solid relationship with ES!

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/XDQb_YpD5FU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7