Questions TTL/GET/


(ddorian43) #1

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are they
deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #2
  1. I don't fully understand your question. Elasticsearch has a binary
    serialization protocol - the document is retrieved from the shard,
    serialized across the network to the coordinating node, and then sent to
    the client as JSON

  2. The TTL cycle runs every 60s by default. Expired documents are
    marked as deleted and removed during merges. Lucene has no concept of
    random-access writes - segments are immutable. So the only way to delete a
    document is to mark it as deleted and remove it during a subsequent merge.
    This is also why TTL is generally not recommended for most scenarios...it
    is much slower than simply deleting a time-based index

  3. Filtered_query. The top-level filter should only be used when
    working with facets and you require fine-grained filter control of search
    vs. facet results

-Zach

On Tuesday, September 17, 2013 1:05:56 PM UTC-4, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(ddorian43) #3

What i mean:
What i mean is how is _uid indexed? Reversed index?

In mongodb TTL, the timestamp is indexed and the db runs a query every x
seconds to find the expired documents and delete .
While in Cassandra, expired cells are only deleted on compactions. There is
not another index that scans like mongodb.

Which TTL type does elasticsearch support?

It looks like you can't range_query on _id field without indexing ? (like
get and multiget doesn't require indexing)

Also when documents are filtered and no sorting is specified, are documents
sorted by _id ?

Thanks

On Tuesday, September 17, 2013 7:05:56 PM UTC+2, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #4

Answers inline:

What i mean is how is _uid indexed? Reversed index?

_uids are indexed into a bloom filter. The process looks like this:

  1. Find the appropriate shard via routing, forward request to a node
    with the shard
  2. Sequentially iterate over each segment in the shard
    1. Check bloom filter if DocID exists in this segment. If false,
      move to next shard (blooms guarantee negative results)
    2. If yes, perform search through segment, as blooms have some amount
      of false positive

In mongodb TTL, the timestamp is indexed and the db runs a query every x

seconds to find the expired documents and delete .
While in Cassandra, expired cells are only deleted on compactions. There
is not another index that scans like mongodb.

Which TTL type does elasticsearch support?

See my answer in the first email. Elasticsearch's background process runs
every 60s, compares timestamps and then marks documents as deleted. The
physical deletion of data doesn't happen until a merge later removes it.

It looks like you can't range_query on _id field without indexing ? (like

get and multiget doesn't require indexing)

Correct. Queries only operate on inverted index data, so they require the
_id to be indexed along with the other fields. Get/Multiget use a
different mechanism (detailed above w/ bloom filter), they aren't queries,
so they don't require an inverted index.

Also when documents are filtered and no sorting is specified, are documents

sorted by _id ?

If docs are filtered and have no score, the score is automatically set to

  1. Since all documents have the same score, the resulting order is
    effectively random. In actuality it is the order of the documents in the
    segments, but the merge process tends to shuffle them enough that you can
    consider it random.

Hope that helps!
-Zach

Thanks

On Tuesday, September 17, 2013 7:05:56 PM UTC+2, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(ddorian43) #5

Everything helps.

  1. If i make a filter + sort, only the filtered_documents.sorting_value
    must reside in memory or all documents?(i'm guessing all)
  2. Is the doc_type saved as a string or some kind of internal id in the
    _uid field (so if i have many small documents i should keep this short)?
  3. If i disable the "_all" field(because i don't want to duplicate
    indexes), how i can simulate it (by searching on all the indexed fields?) ?
  4. If i want to use autocomplete on tags[] field, should i use ngrams or
    suggester (when starting to type show a list of tags that are like the
    characters, looks like suggester)?
  5. Looks like suggester doesn't return documents ?
  6. If i use a field.type(number), and i wont do range queries(only
    equality), what precision_step should i use for this field (0 or 1?) ?

On Fri, Sep 20, 2013 at 10:17 PM, Zachary Tong zacharyjtong@gmail.comwrote:

Answers inline:

What i mean is how is _uid indexed? Reversed index?

_uids are indexed into a bloom filter. The process looks like this:

  1. Find the appropriate shard via routing, forward request to a node
    with the shard
  2. Sequentially iterate over each segment in the shard
    1. Check bloom filter if DocID exists in this segment. If false,
      move to next shard (blooms guarantee negative results)
    2. If yes, perform search through segment, as blooms have some
      amount of false positive

In mongodb TTL, the timestamp is indexed and the db runs a query every x

seconds to find the expired documents and delete .
While in Cassandra, expired cells are only deleted on compactions. There
is not another index that scans like mongodb.

Which TTL type does elasticsearch support?

See my answer in the first email. Elasticsearch's background process runs
every 60s, compares timestamps and then marks documents as deleted. The
physical deletion of data doesn't happen until a merge later removes it.

It looks like you can't range_query on _id field without indexing ? (like

get and multiget doesn't require indexing)

Correct. Queries only operate on inverted index data, so they require the
_id to be indexed along with the other fields. Get/Multiget use a
different mechanism (detailed above w/ bloom filter), they aren't queries,
so they don't require an inverted index.

Also when documents are filtered and no sorting is specified, are

documents sorted by _id ?

If docs are filtered and have no score, the score is automatically set to

  1. Since all documents have the same score, the resulting order is
    effectively random. In actuality it is the order of the documents in the
    segments, but the merge process tends to shuffle them enough that you can
    consider it random.

Hope that helps!
-Zach

Thanks

On Tuesday, September 17, 2013 7:05:56 PM UTC+2, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DnyEVBUOv5U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6