Elasticsearch Data Types

Keyword is used for structured content such as IDs, email addresses, hostnames, status codes, zip codes, or tags. Content of keyword data type does not get broken down like the text data type. Only a search for the exact email “folaukaveinga@gmail.com” would return it as a result.

Consider mapping a numeric identifier as a keyword if:

  • You don’t plan to search for the identifier data using range queries.
  • Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields.

If you’re unsure which to use, you can use a multi-field to map the data as both a keyword and a numeric data type.

Text data type is a field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms(broken down) before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field. Text fields are not used for sorting and seldom used for aggregations.

text fields are searchable by default, but by default are not available for aggregations, sorting, or scripting. If you try to sort, aggregate, or access values from a script on a text field, you will see this exception:

Fielddata is disabled on text fields by default. Set fielddata=true on your_field_name in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.

Field data is the only way to access the analyzed tokens from a full text field in aggregations, sorting, or scripting. For example, a full text field like New York would get analyzed as new and york. To aggregate on these tokens requires field data.

PUT doctor/_mapping
{
  "properties": {
    "email": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

Numeric data type can be long, integer, short, byte, double, float, and unsign_float. As far as integer types (byteshortinteger and long) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.

For floating-point types, it is often more efficient to store floating-point data into an integer using a scaling factor, which is what the scaled_float type does under the hood. For instance, a price field could be stored in a scaled_float with a scaling_factor of 100. All APIs would work as if the field was stored as a double, but under the hood Elasticsearch would be working with the number of cents, price*100, which is an integer. This is mostly helpful to save disk space since integers are way easier to compress than floating points. scaled_float is also fine to use in order to trade accuracy for disk space. For instance imagine that you are tracking cpu utilization as a number between 0 and 1. It usually does not matter much whether cpu utilization is 12.7% or 13%, so you could use a scaled_float with a scaling_factor of 100 in order to round cpu utilization to the closest percent in order to save space.

If scaled_float is not a good fit, then you should pick the smallest type that is enough for the use-case among the floating-point types: doublefloat and half_float. Here is a table that compares these types in order to help make a decision.

Date date type in Elasticsearch can either be:

  • strings containing formatted dates, e.g. "2015-01-01" or "2015/01/01 12:10:30"
  • a number representing milliseconds-since-the-epoch.
  • a number representing seconds-since-the-epoch

Dates will always be rendered as strings, even if they were initially supplied as a long in the JSON document. Date formats can be customised, but if no format is specified then it uses the default.

Multiple formats can be specified by separating them with || as a separator. Each format will be tried in turn until a matching format is found. The first format will be used to convert the milliseconds-since-the-epoch value back into a string.

{
  "mappings": {
    "properties": {
      "date": {
        "type":   "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

Nested data type is a specialised version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other. 

When ingesting key-value pairs with a large, arbitrary set of keys, you might consider modeling each key-value pair as its own nested document with key and value fields. Instead, consider using the flattened data type, which maps an entire object as a single field and allows for simple searches over its contents. Nested documents and queries are typically expensive, so using the flattened data type for this use case is a better option.

Nested documents can be:

Because nested documents are indexed as separate documents, they can only be accessed within the scope of the nested query, the nested/reverse_nested aggregations, or nested inner hits.

There is a limit of 50 nested documents which can be changed to a higher value.

 

Flattened data type provides an alternative approach, where the entire object is mapped as a single field. Given an object, the flattened mapping will parse out its leaf values and index them into one field as keywords. The object’s contents can then be searched through simple queries and aggregations.

This data type can be useful for indexing objects with a large or unknown number of unique keys. Only one field mapping is created for the whole JSON object, which can help prevent a mappings explosion from having too many distinct field mappings.

On the other hand, flattened object fields present a trade-off in terms of search functionality. Only basic queries are allowed, with no support for numeric range queries or highlighting.

The flattened mapping type should not be used for indexing all document content, as it treats all values as keywords and does not provide full search functionality. The default approach, where each subfield has its own entry in the mappings, works well in the majority of cases.

Join data type is a special field that creates parent/child relation within documents of the same index. The relations section defines a set of possible relations within the documents, each relation being a parent name and a child name.

The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.

The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products and offers for these products. In the case that offers significantly outnumbers the number of products then it makes sense to model the product as parent document and the offer as child document.

Parent-join restrictions:

  • Only one join field mapping is allowed per index.
  • Parent and child documents must be indexed on the same shard. This means that the same routing value needs to be provided when gettingdeleting, or updating a child document.
  • An element can have multiple children but only one parent.
  • It is possible to add a new relation to an existing join field.
  • It is also possible to add a child to an existing element but only if the element is already a parent.

 

Multi Fields

It is often useful to index the same field in different ways for different purposes. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations. Alternatively, you could index a text field with the standard analyzer, the english analyzer, and the french analyzer.

This is the purpose of multi-fields. Most field types support multi-fields via the fields parameter.

 

 




Subscribe To Our Newsletter
You will receive our latest post and tutorial.
Thank you for subscribing!

required
required


Leave a Reply

Your email address will not be published. Required fields are marked *