Strategy For Limiting Search To "Folder Path"

Noob here, just getting started with elasticsearch and search in general.

In the system I am building, users can organize documents for browsing within a folder tree. This will be maintained in a relational database, using a structure something like this:

  • User
    • Document
      • Folder
        • Parent Folder (recursive)

In elsaticsearch, each user will have their own index. There will be no searches that span users. As of now there will only be one document type, call it "record." The elasticsearch id for the document will be kept in the Document row. So from a Document row we can resolve the elasticsearch URI like so:

[host]/[User.Username]/record/[Document.SearchId]

This is all working fine. Users can browse for documents within the folder structure and they can search globally across their index. However, I want to be able to constrain searches to a folder node. So lets say I have the folder structure:

  • Financial
    • Tax
      • 2010
      • 2009
      • 2008
  • Legal

I want to be able constrain searches to, for example, Financial or Tax or 2009. One idea is to add a field to each elasticsearch document, FolderPath. For example, a document might have the FolderPath:

Financial\Tax\2009

For a query on Tax, for example, I would have a condition that FolderPath must start with "Financial/Tax". (Not quite sure yet how to do a strict starts-with condition for a field, but I'm sure it must be possible.)

How does this sound? Any other ideas? As I said, I'm a noob, so any guidance you can offer is heartily appreciated!

One way to do it is to index a "path" field with the full path, and then do wildcard matching on it.

Back to the index per user decision, how many users are there going to be? Note that each an index comes with an overhead (even with a single shard with 1 replica) since each shard is a lucene index.

-shay.banon
On Sunday, December 12, 2010 at 12:02 AM, timscott wrote:

Noob here, just getting started with elasticsearch and search in general.

In the system I am building, users can organize documents for browsing
within a folder tree. This will be maintained in a relational database,
using a structure something like this:

  • User
  • Document
  • Folder
  • Parent Folder (recursive)

In elsaticsearch, each user will have their own index. There will be no
searches that span users. As of now there will only be one document type,
call it "record." The elasticsearch id for the document will be kept in the
Document row. So from a Document row we can resolve the elasticsearch URI
like so:

[host]/[User.Username]/record/[Document.SearchId]

This is all working fine. Users can browse for documents within the folder
structure and they can search globally across their index. However, I want
to be able to constrain searches to a folder node. So lets say I have the
folder structure:

  • Financial
  • Tax
  • 2010
  • 2009
  • 2008
  • Legal

I want to be able constrain searches to, for example, Financial or Tax or
2009. One idea is to add a field to each elasticsearch document,
FolderPath. For example, a document might have the FolderPath:

Financial\Tax\2009

For a query on Tax, for example, I would have a condition that FolderPath
must start with "Financial/Tax". (Not quite sure yet how to do a strict
starts-with condition for a field, but I'm sure it must be possible.)

How does this sound? Any other ideas? As I said, I'm a noob, so any
guidance you can offer is heartily appreciated!

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Strategy-For-Limiting-Search-To-Folder-Path-tp2070674p2070674.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Thanks for the fast reply. My comment about querying on "starts-with" I think it's the same as wildcard matching.

Regarding indexes, the system could grow to lots of users if successful (say, thousands to tens of thousands or more). Each user would have a relatively small number of documents (5k - 10k maybe). By "user" I mean "tenant." If not with separate indexes, what is the recommended way to handle multi-tenancy when there will be a large number of micro-tenants?

In this case, a single index with a "userid" in the doc, and have queries filtered by the userid is better. Also, make sure you use the userid as the routing value, and when searching, use the userid as the routing value, this will speed things up. (Routing allows the client to control where a doc will be placed, which shard, and then, when you search using that routing value, the search will only happen on that shard).

If its mainly user level queries, you can start with a large number of shards (you will need to play with it a bit). But, since the number of shards can't change once an index gets created, you can, assuming things work really well, have an index per a (very large) group of users.

-shay.banon
On Sunday, December 12, 2010 at 1:24 AM, timscott wrote:

Thanks for the fast reply. My comment about querying on "starts-with" I
think it's the same as wildcard matching.

Regarding indexes, the system could grow to lots of users if successful
(say, thousands to tens of thousands or more). Each user would have a
relatively small number of documents (5k - 10k maybe). By "user" I mean
"tenant." If not with separate indexes, what is the recommended way to
handle multi-tenancy when there will be a large number of micro-tenants?

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Strategy-For-Limiting-Search-To-Folder-Path-tp2070674p2070986.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Shay, Per your suggestion, I'm trying the wildcard search, but must be doing it wrong.

There is a document in the index with FolderPath="Tax/2009". As part of a boolean query I have tried each of these:

"must":[{"wildcard":{"FolderPath":{"value":"Tax/2009*"}}}]
"must":[{"wildcard":{"FolderPath":{"value":"Tax*"}}}]
"must":[{"wildcard":{"FolderPath":{"value":"T*"}}}]

The document is not returned in any case. I am quite sure that this document passes all other parts of the boolean query. I proved this by changing the value to pure wildcard, like so:

"must":[{"wildcard":{"FolderPath":{"value":"*"}}}]

And the document is returned. What am I doing wrong?

Hey,

My guess is that you don't have the field marked as "index" : "not_analyzed", and, using the standard analyzer, it breaks on / characters. Here is a gist of some samples I did: gist:739365 · GitHub.

As a side note, there are cool faceting results that can be done on path "types" because of the nested structure of it. Its not there, but, if we add a path type, then they can be implemented for it.

-shay.banon
On Monday, December 13, 2010 at 2:21 AM, timscott wrote:

Shay, Per your suggestion, I'm trying the wildcard search, but must be doing
it wrong.

There is a document in the index with FolderPath="Tax/2009". As part of a
boolean query I have tried each of these:

"must":[{"wildcard":{"FolderPath":{"value":"Tax/2009*"}}}]
"must":[{"wildcard":{"FolderPath":{"value":"Tax*"}}}]
"must":[{"wildcard":{"FolderPath":{"value":"T*"}}}]

The document is not returned in any case. I am quite sure that this
document passes all other parts of the boolean query. I proved this by
changing the value to pure wildcard, like so:

"must":[{"wildcard":{"FolderPath":{"value":"*"}}}]

And the document is returned. What am I doing wrong?

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Strategy-For-Limiting-Search-To-Folder-Path-tp2070674p2075884.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Shay, Still having trouble with this. Based on your nice example I added a mapping. I seemed to work. It returned: { "ok": true, "acknowledged": true }

When I asked the status of the index, the result contained:

"index.IndexDocument.properties.FolderPath.index":"not_analyzed",
"index.IndexDocument.properties.FolderPath.type":"string"

So I put a document into the index with FolderPath="Tax/2008" and ran a boolean query that includes:

"must":[{"wildcard":{"FolderPath":{"value":"T*"}}}]

The document was not returned. When I removed the must clause, the document was returned.

I then put a document into the Index with FolderPath="Tax" and ran the query with the must condition again. Again, it was not returned. That tells me that maybe it's something besides just the analysis, since there's no slash.

Any other thoughts?

Maybe this is a clue. This returns the document:

{query : {"field":{"FolderPath":"Tax/2008"}}}

These do not:

{query : {"wildcard":{"FolderPath":"Tax/200*"}}}
{query : {"term":{"FolderPath":"Tax/2008"}}}
{query : {"field":{"FolderPath":"Tax"}}}
{query : {"field":{"FolderPath":"2008"}}}
{query : {"field":{"FolderPath":"Tax 2008"}}}

Hey Shay,

it would be cool to see more faceting stuff for nested structures (like path
for example). Is there any ticket for this opened so that I can vote on it?

Regards,
Lukas

On Mon, Dec 13, 2010 at 7:27 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

My guess is that you don't have the field marked as "index" :
"not_analyzed", and, using the standard analyzer, it breaks on /
characters. Here is a gist of some samples I did:
gist:739365 · GitHub.

As a side note, there are cool faceting results that can be done on path
"types" because of the nested structure of it. Its not there, but, if we add
a path type, then they can be implemented for it.

-shay.banon

On Monday, December 13, 2010 at 2:21 AM, timscott wrote:

Shay, Per your suggestion, I'm trying the wildcard search, but must be
doing
it wrong.

There is a document in the index with FolderPath="Tax/2009". As part of a
boolean query I have tried each of these:

"must":[{"wildcard":{"FolderPath":{"value":"Tax/2009*"}}}]
"must":[{"wildcard":{"FolderPath":{"value":"Tax*"}}}]
"must":[{"wildcard":{"FolderPath":{"value":"T*"}}}]

The document is not returned in any case. I am quite sure that this
document passes all other parts of the boolean query. I proved this by
changing the value to pure wildcard, like so:

"must":[{"wildcard":{"FolderPath":{"value":"*"}}}]

And the document is returned. What am I doing wrong?

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Strategy-For-Limiting-Search-To-Folder-Path-tp2070674p2075884.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Are you sure that you set the mappings? It looks like you called the PUT index API with the mappings (its called with index level settings).
On Tuesday, December 14, 2010 at 7:18 AM, timscott wrote:

Maybe this is a clue. This returns the document:

{query : {"field":{"FolderPath":"Tax/2008"}}}

These do not:

{query : {"wildcard":{"FolderPath":"Tax/200*"}}}
{query : {"term":{"FolderPath":"Tax/2008"}}}
{query : {"field":{"FolderPath":"Tax"}}}
{query : {"field":{"FolderPath":"2008"}}}
{query : {"field":{"FolderPath":"Tax 2008"}}}

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Strategy-For-Limiting-Search-To-Folder-Path-tp2070674p2083803.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Yes, that was it. I was making the _mapping call incorrectly. Thanks.

Hey,
Have you done it with the path field?
How did you handle in such cases of renaming a parent folder name or moving one to another?
If you have a bunch of child documents, you need to update the path. It would be so hard to maintain the path field.
I'm still struggling searching documents from a parent to sub.
Any workaround for this? or any good mapping strategy?

1 Like