Elasticsearch Modeling Data

Elasticsearch, like most NoSQL databases, treats the world as though it were flat. An index is a flat collection of independent documents. A single document should con‐ tain all of the information that is required to decide whether it matches a search request.

Denormalizing your Data

The way to get the best search performance out of Elasticsearch is to use it as it is intended, by denormalizing your data at index time. Having redundant copies of data in each document that requires access to it removes the need for joins.

If we want to be able to find a blog post by the name of the user who wrote it, include the user’s name in the blog-post document itself.

PUT /users/blogpost/2
{
   "title": "Today Spirit", 
   "body": "Let's go!", 
   "user": {
     "id": 1,
     "name": "Folau Kaveinga" 
    }
}

Of course, data denormalization has downsides too. The first disadvantage is that the index will be bigger because the _source document for every blog post is bigger, and there are more indexed fields. This usually isn’t a huge problem. The data written to disk is highly compressed, and disk space is cheap. Elasticsearch can happily cope with the extra data.

The more important issue is that, if the user were to change his name, all of his blog posts would need to be updated too. Fortunately, users don’t often change names. Even if they did, it is unlikely that a user would have written more than a few thou‐ sand blog posts, so updating blog posts with the scroll and bulk APIs would take less than a second.

Nested Objects

Given the fact that creating, deleting, and updating a single document in Elasticsearch is atomic, it makes sense to store closely related entities within the same document. For instance, we could store an order and all of its order lines in one document, or we could store a blog post and all of its comments together, by passing an array of comments.

Note that each nested object is indexed as a hidden separate document. Because nested objects are indexed as separate hidden documents, we can’t query them directly. Instead, we have to use the nested query or nested filter to access them.

By indexing each nested object separately, the fields within the object maintain their relationships. We can run queries that will match only if the match occurs within the same nested object.

Not only that, because of the way that nested objects are indexed, joining the nested documents to the root document at query time is fast—almost as fast as if they were a single document.

These extra nested documents are hidden; we can’t access them directly. To update, add, or remove a nested object, we have to reindex the whole document. It’s impor‐ tant to note that, the result returned by a search request is not the nested object alone; it is the whole document.

When should you user nested objects?

Nested objects are useful when there is one main entity, like our user, with a limited number of closely related but less important entities, such as addresses. It is useful to be able to find addresses based on the content of the street or zipcode, and the nested query and filter provide for fast query-time joins.

Retiring Data

As time-based data ages, it becomes less relevant. It’s possible that we will want to see what happened last week, last month, or even last year, but for the most part, we’re interested in only the here and now. The nice thing about an index per time frame is that it enables us to easily delete old data: just delete the indices that are no longer relevant.

 




Subscribe To Our Newsletter
You will receive our latest post and tutorial.
Thank you for subscribing!

required
required


Leave a Reply

Your email address will not be published. Required fields are marked *