Elasticsearch Sorting

By default, search results are returned sorted by relevance, with the most relevant docs first.

Relevance Score

The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incor‐ porate the percentage of terms that were found. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a full- text field are to a full-text query string.

The standard similarity algorithm used in Elasticsearch is known as term frequency/ inverse document frequency, or TF/IDF, which takes the following factors into account. The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.

Order

Sorting allows you to add one or more sorts on specific fields. Each sort can be reversed(ascending or descending) as well. The sort is defined on a per field level, with special field name for _score to sort by score, and _doc to sort by index order.

The order option can have either asc or desc.

The order defaults to desc when sorting on the _score, and defaults to asc when sorting on anything else.

GET users/_search
{
     "query" : {
            "filtered" : {
                "filter" : { "term" : { "id" : 1 }}
            }
     },
     "sort": { "date": { "order": "desc" }}
}

Perhaps we want to combine the _score from a query with the date, and show all matching results sorted first by date, then by relevance.

GET /_search
{
   "query" : {
            "filtered" : {
                "query":   { "match": { "description": "student" }},
                "filter" : { "term" : { "id" : 2 }}
            }
   }, 
   "sort": [
            {
             "date": {"order":"desc"}
             },
            { 
              "_score": { "order": "desc" }
            }
   ]
}

Order is important. Results are sorted by the first criterion first. Only results whose first sort value is identical will then be sorted by the second criterion, and so on. Multilevel sorting doesn’t have to involve the _score. You could sort by using several different fields, on geo-distance or on a custom value calculated in a script.

Elasticsearch supports sorting by array or multi-valued fields. The mode option controls what array value is picked for sorting the document it belongs to. The mode option can have the following values.

min Pick the lowest value.
max Pick the highest value.
sum Use the sum of all values as sort value. Only applicable for number based array fields.
avg Use the average of all values as sort value. Only applicable for number based array fields.
median Use the median of all values as sort value. Only applicable for number based array fields.

The default sort mode in the ascending sort order is min — the lowest value is picked. The default sort mode in the descending order is max — the highest value is picked.

Note that filters have no bearing on _score, and the missing-but-implied match_all query just sets the _score to a neutral value of 1 for all documents. In other words, all documents are considered to be equally relevant.

 

Sorting Numeric Fields

For numeric fields it is also possible to cast the values from one type to another using the numeric_type option. This option accepts the following values: ["double", "long", "date", "date_nanos"] and can be useful for searches across multiple data streams or indices where the sort field is mapped differently.

Geo Distance Sorting

Sometimes you want to sort by how close a location is to a single point(lat/long). You can do this in elasticsearch.

GET elasticsearch_learning/_search
{
"sort":[{
  "_geo_distance" : {
    "addresses.location" : [
      {
        "lat" : 40.414897,
        "lon" : -111.881186
      }
    ],
    "unit" : "m",
    "distance_type" : "arc",
    "order" : "desc",
    "nested" : {
      "path" : "addresses",
      "filter" : {
        "geo_distance" : {
          "addresses.location" : [
            -111.881186,
            40.414897
          ],
          "distance" : 1609.344,
          "distance_type" : "arc",
          "validation_method" : "STRICT",
          "ignore_unmapped" : false,
          "boost" : 1.0
        }
      }
    },
    "validation_method" : "STRICT",
    "ignore_unmapped" : false
  }
}]
}

 

/**
 * https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-nested-query.html<br>
 * https://www.elastic.co/guide/en/elasticsearch/reference/7.3/search-request-body.html#geo-sorting<br>
 * Sort results based on how close locations are to a certain point.
 */
@Test
void sortQueryWithGeoLocation() {

    int pageNumber = 0;
    int pageSize = 10;

    SearchRequest searchRequest = new SearchRequest(database);
    searchRequest.allowPartialSearchResults(true);
    searchRequest.indicesOptions(IndicesOptions.lenientExpandOpen());

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.from(pageNumber * pageSize);
    searchSourceBuilder.size(pageSize);
    searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));
    /**
     * fetch only a few fields
     */
    searchSourceBuilder.fetchSource(new String[]{"id", "firstName", "lastName", "rating", "dateOfBirth", "addresses.street", "addresses.zipcode", "addresses.city"}, new String[]{""});

    /**
     * Lehi skate park: 40.414897, -111.881186<br>
     * get locations/addresses close to skate park(from a radius).<br>
     */

    searchSourceBuilder.sort(new GeoDistanceSortBuilder("addresses.location", 40.414897,
            -111.881186).order(SortOrder.DESC)
           .setNestedSort(
                   new NestedSortBuilder("addresses").setFilter(QueryBuilders.geoDistanceQuery("addresses.location").point(40.414897, -111.881186).distance(1, DistanceUnit.MILES))));
    
    log.info("\n{\n\"sort\":{}\n}", searchSourceBuilder.sorts().toString());

    searchRequest.source(searchSourceBuilder);

    searchRequest.preference("nested-address");

    try {
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        log.info("hits={}, isTimedOut={}, totalShards={}, totalHits={}", searchResponse.getHits().getHits().length, searchResponse.isTimedOut(), searchResponse.getTotalShards(),
                searchResponse.getHits().getTotalHits().value);

        List<User> users = getResponseResult(searchResponse.getHits());

        log.info("results={}", ObjectUtils.toJson(users));

    } catch (IOException e) {
        log.warn("IOException, msg={}", e.getLocalizedMessage());
        e.printStackTrace();
    } catch (Exception e) {
        log.warn("Exception, msg={}", e.getLocalizedMessage());
        e.printStackTrace();
    }

}

 

Query with explain

Adding explain produces a lot of output for every hit, which can look overwhelming, but it is worth taking the time to understand what it all means. Don’t worry if it doesn’t all make sense now; you can refer to this section when you need it. We’ll work through the output for one hit bit by bit.

GET users/_search?explain
{
   "query" :{"match":{"description":"student"}} }
}

Producing the explain output is expensive. It is a debugging tool only. Don’t leave it turned on in production.

Fielddata

To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata. Elasticsearch doesn’t just load the values for the documents that matched a particular query. It loads the values from every docu‐ ment in your index, regardless of the document type.

The reason that Elasticsearch loads all values into memory is that uninverting the index from disk is slow. Even though you may need the values for only a few docs for the current request, you will probably need access to the values for other docs on the next request, so it makes sense to load all the values into memory at once, and to keep them there.

All you need to know is what fielddata is, and to be aware that it can be memory hungry. We will talk about how to determine the amount of memory that fielddata is using, how to limit the amount of memory that is available to it, and how to preload fielddata to improve the user experience.

Source Code on Github

 




Subscribe To Our Newsletter
You will receive our latest post and tutorial.
Thank you for subscribing!

required
required


Leave a Reply

Your email address will not be published. Required fields are marked *