The goal of filters is to reduce the number of documents that have to be examined by the query. Queries have to not only find matching documents, but also calculate how relevant each document is, which typically makes queries heavier than filters. Also, query results are not cachable. Filter is quick to calculate and easy to cache in memory, using only 1 bit per document. These cached filters can be reused efficiently for subsequent requests.
When to Use filter vs query?
As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filter clauses for everything else.
There are two ways to filter search results.
filter
clause. Search requests apply boolean filters to both search hits and aggregations.post_filter
parameter. Search requests apply post filters only to search hits, not aggregations. You can use a post filter to calculate aggregations based on a broader result set, and then further narrow the results. A post filter has no impact on the aggregation results.Term filter
The term filter is used to filter by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields. Here we are filtering out all users whose rating is not 5. In other words, only retrieve users with rating of 5.
GET elasticsearch_learning/_search { "query": { "bool": { "filter": [ { "term": { "rating": 5 }} ] } } }
/** * https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html */ @Test void filterQuery() { int pageNumber = 0; int pageSize = 5; SearchRequest searchRequest = new SearchRequest(database); searchRequest.allowPartialSearchResults(true); searchRequest.indicesOptions(IndicesOptions.lenientExpandOpen()); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.from(pageNumber * pageSize); searchSourceBuilder.size(pageSize); searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); /** * fetch only a few fields */ searchSourceBuilder.fetchSource(new String[]{"id", "firstName", "lastName", "rating", "dateOfBirth"}, new String[]{""}); BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); boolQuery.filter(QueryBuilders.termQuery("rating", 5)); searchSourceBuilder.query(boolQuery); searchRequest.source(searchSourceBuilder); searchRequest.preference("rating"); if (searchSourceBuilder.query() != null && searchSourceBuilder.sorts() != null && searchSourceBuilder.sorts().size() > 0) { log.info("\n{\n\"query\":{}, \"sort\":{}\n}", searchSourceBuilder.query().toString(), searchSourceBuilder.sorts().toString()); } else { log.info("\n{\n\"query\":{}\n}", searchSourceBuilder.query().toString()); } try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); log.info("isTimedOut={}, totalShards={}, totalHits={}", searchResponse.isTimedOut(), searchResponse.getTotalShards(), searchResponse.getHits().getTotalHits().value); List<User> users = getResponseResult(searchResponse.getHits()); log.info("results={}", ObjectUtils.toJson(users)); } catch (IOException e) { log.warn("IOException, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } catch (Exception e) { log.warn("Exception, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } }
Range filter
The range filter allows you to find numbers or dates that fall into a specified range. Here we are filtering out all users whose rating is either a 2, 3, or 4.
GET elasticsearch_learning/_search { "query":{ "bool" : { "filter" : [ { "range" : { "rating" : { "from" : 2, "to" : 4 } } } ] } } }
/** * https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html */ @Test void filterQueryWithRange() { int pageNumber = 0; int pageSize = 5; SearchRequest searchRequest = new SearchRequest(database); searchRequest.allowPartialSearchResults(true); searchRequest.indicesOptions(IndicesOptions.lenientExpandOpen()); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.from(pageNumber * pageSize); searchSourceBuilder.size(pageSize); searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); /** * fetch only a few fields */ searchSourceBuilder.fetchSource(new String[]{"id", "firstName", "lastName", "rating", "dateOfBirth"}, new String[]{""}); BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); boolQuery.filter(QueryBuilders.rangeQuery("rating").gte(2).lte(4)); searchSourceBuilder.query(boolQuery); searchRequest.source(searchSourceBuilder); searchRequest.preference("rating"); if (searchSourceBuilder.query() != null && searchSourceBuilder.sorts() != null && searchSourceBuilder.sorts().size() > 0) { log.info("\n{\n\"query\":{}, \"sort\":{}\n}", searchSourceBuilder.query().toString(), searchSourceBuilder.sorts().toString()); } else { log.info("\n{\n\"query\":{}\n}", searchSourceBuilder.query().toString()); } try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); log.info("isTimedOut={}, totalShards={}, totalHits={}", searchResponse.isTimedOut(), searchResponse.getTotalShards(), searchResponse.getHits().getTotalHits().value); List<User> users = getResponseResult(searchResponse.getHits()); log.info("results={}", ObjectUtils.toJson(users)); } catch (IOException e) { log.warn("IOException, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } catch (Exception e) { log.warn("Exception, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } }
Exists Filter
The exists and missing filters are used to find documents in which the specified field either has one or more values (exists) or doesn’t have any values (missing). It is similar in nature to IS_NULL (missing) and NOT IS_NULL (exists)in SQL.
{ "exists": { "field": "name" } }
Here we are filtering out all users that have logged into the system.
GET elasticsearch_learning/_search { "query":{ "bool" : { "filter" : [ { "exists" : { "field" : "lastLoggedInAt" } } ] } }
/** * https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html */ @Test void filterQueryWithExists() { int pageNumber = 0; int pageSize = 5; SearchRequest searchRequest = new SearchRequest(database); searchRequest.allowPartialSearchResults(true); searchRequest.indicesOptions(IndicesOptions.lenientExpandOpen()); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.from(pageNumber * pageSize); searchSourceBuilder.size(pageSize); searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); /** * fetch only a few fields */ searchSourceBuilder.fetchSource(new String[]{"id", "firstName", "lastName", "rating", "dateOfBirth"}, new String[]{""}); BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); boolQuery.filter(QueryBuilders.existsQuery("lastLoggedInAt")); searchSourceBuilder.query(boolQuery); searchRequest.source(searchSourceBuilder); searchRequest.preference("rating"); if (searchSourceBuilder.query() != null && searchSourceBuilder.sorts() != null && searchSourceBuilder.sorts().size() > 0) { log.info("\n{\n\"query\":{}, \"sort\":{}\n}", searchSourceBuilder.query().toString(), searchSourceBuilder.sorts().toString()); } else { log.info("\n{\n\"query\":{}\n}", searchSourceBuilder.query().toString()); } try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); log.info("isTimedOut={}, totalShards={}, totalHits={}", searchResponse.isTimedOut(), searchResponse.getTotalShards(), searchResponse.getHits().getTotalHits().value); List<User> users = getResponseResult(searchResponse.getHits()); log.info("results={}", ObjectUtils.toJson(users)); } catch (IOException e) { log.warn("IOException, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } catch (Exception e) { log.warn("Exception, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } }
Imagine having to execute a long running process and waiting for it to finish, most likely you are connected via ssh and you can’t do anything else but waiting for that process to finish. When the process is taking too long your connection will time out and you will be kicked out. When that happens your process will stop its execution and you will have to rerun it. This is painful and here is where screen comes into the picture.
For example, you are downloading a big file or a database using mysql dump command. This usually takes 10 to 20 minutes.
You can use the Linux screen
command to push running terminal applications or processes to the background and pull them forward when you want to see them. It also supports split-screen displays and works over SSH connections, even after you disconnect and reconnect.
The usual operation with screen is to create a new window with a shell in it, run a command, and then push the window to the background (called “detaching”). When you want to see how your process is doing, you can pull the window to the foreground again (“reattach”) and use it again. This is great for long processes you don’t want to accidentally terminate by closing the terminal window.
Once you’ve got a screen
session running, you can create new windows and run other processes in them. You can easily hop between windows to monitor their progress. You can also split your terminal window into vertical or horizontal regions, and display your various screen
windows in one window.
You can connect to a remote machine, start a screen
session, and launch a process. You can disconnect from the remote host, reconnect, and your process will still be running.
You can share a screen
session between two different SSH connections so two people can see the same thing, in real-time.
Intall screen
# ubuntu sudo apt update sudo apt install screen # centos or fedora sudo yum install screen
Start a screen session
screen # start screen session with a name screen -S loading-database
This will open a screen session, create a new window, and start a shell in that window.
Most used commands for a screen window
Ctrl+a
c
Create a new window (with shell).Ctrl+a
"
List all windows.Ctrl+a
0
Switch to window 0 (by number).Ctrl+a
A
Rename the current window.Ctrl+a
S
Split current region horizontally into two regions.Ctrl+a
|
Split current region vertically into two regions.Ctrl+a
tab
Switch the input focus to the next region.Ctrl+a
Ctrl+a
Toggle between the current and previous windowsCtrl+a
Q
Close all regions but the current one.Ctrl+a
X
Close the current region.Detach from a screen session
Ctrl+a Ctrl+d
The program running in the screen session will continue to run after you detach from the session.
Reattach to a screen session
# resume current screen session, assuming one screen session screen -r
In case you have multiple screen sessions running on your machine, you will need to append the screen session ID after the r
switch.
# list the current running screen sessions screen -ls # output There is a screen on: 30093.pts-0.ip-172-31-8-213 (09/24/21 16:07:23) (Attached) 1 Socket in /run/screen/S-ubuntu. # command screen -r 30093
Terminate a screen session
Make sure you on the screen session and then run the exit command.
exit
When running queries in your production environment, you have to be careful so you don’t bring it down or make a mistake that will cause your environment to act abnormally.
Here are some things you need to be careful about:
The MySQL READ UNCOMMITTED, also known as “dirty read” because it is the lowest level of isolation. If we specify a table hint then it will override the current default isolation level. MySQL default isolation level is REPEATABLE READ which means locks will be placed for each operation, but multiple connections can read data concurrently.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ; SELECT * FROM TABLE_NAME ; SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
Do everything you can to run query againts dev, qa, or backup database before running your query against a production database. And when you do run your query against your production database, make sure to do it first against backup tables and then the actual tables. A lot of times you will find that as a safety net in case you make a mistake on first try, you will be safe as that would be against the backup and not the actual live data. Once everything looks good in the backup tables then you can run your query against the actual live data.
Make sure that when running queries that your dev database server has the same version as your production database server. If they have to be in different versions the production server must be in higher version than the dev server. In most cases where the version is not the same the dev server is in higher version so be careful about that. The reason for this check is that you may have a query that works in dev and won’t work in production because the query uses functions that the production server does not support. For example. JSON_ARRAYAGG
is not in mysql version 5.7. My dev server was in mysql 8 and production server was in 5.7. Not until release day that I realized my database migration was not going to work in production and had to pull it back.
If possible, run insert/update migration scripts during slow or off hours. There is a good chance you will run into issues especially with updating data based on existing data in the database. You might not be reading live data to perform updates as users are using and updating your data.
You can never go wrong with having help and verifying your query is doing what is intended to do. In most cases, you work in a team, you should have a team member or your team lead look at your query to make sure it’s doing what is intended to do. You can’t take a risk of running a query that may change thousands of records without having someone else review it.
MySQL supports the native JSON data type since version 5.7.8. The native JSON data type allows you to store JSON documents more efficiently than the JSON text format in the previous versions.
MySQL stores JSON documents in an internal format that allows quick read access to document elements. The JSON binary format is structured in the way that permits the server to search for values within the JSON document directly by key or array index, which is very fast.
The storage of a JSON document is approximately the same as the storage of LONGBLOB
or LONGTEXT
data.
CREATE TABLE events ( ... browser_info JSON, ... );
Insert into json column
INSERT INTO events(browser_info) VALUES ( '{ "name": "Safari", "os": "Mac", "resolution": { "x": 1920, "y": 1080 } }' )
Automatic validation of JSON documents stored in JSON
columns. Invalid documents produce an error.
Evaluates a (possibly empty) list of key-value pairs and returns a JSON object containing those pairs. An error occurs if any key name is NULL
or the number of arguments is odd.
SELECT JSON_OBJECT('id',u.id, 'firstName',u.first_name, 'lastName',u.first_name) as jsonUser FROM user as u;
Evaluates a (possibly empty) list of values and returns a JSON array containing those values.
SELECT JSON_ARRAY(u.id, u.first_name, u.first_name) as jsonUser FROM user as u;
Json Object Agg
Return result set as a single JSON object
Takes two column names or expressions as arguments, the first of these being used as a key and the second as a value, and returns a JSON object containing key-value pairs. Returns NULL
if the result contains no rows, or in the event of an error. An error occurs if any key name is NULL
or the number of arguments is not equal to 2.
SELECT JSON_OBJECTAGG(u.id, u.firstName, u.lastName) as jsonData FROM user as u; // output { "id": 1, "firstName": "John", "lastName": "Peter" }
Return result set as a single JSON array
Aggregates a result set as a single JSON
array whose elements consist of the rows. The order of elements in this array is undefined. The function acts on a column or an expression that evaluates to a single value. Returns NULL
if the result contains no rows, or in the event of an error.
SELECT JSON_PRETTY(JSON_OBJECT('userId', u.id, 'cards', cardList)) as jsonData FROM user as u LEFT JOIN (SELECT c.user_id, JSON_ARRAYAGG( JSON_OBJECT( 'cardId', c.id, 'cardNumber', c.card_number) ) as cardList FROM card as c GROUP BY c.user_id) as cards ON u.id = cards.user_id;
{ "cards": [ { "cardId": 4, "cardNumber": "2440531" }, { "cardId": 11, "cardNumber": "4061190" } ], "userId": 1 }
How to accomplish JSON_ARRAYAGG before version 5.7.8
SELECT JSON_PRETTY(JSON_OBJECT('userId', u.id, 'cards', cardList)) as jsonData FROM user as u LEFT JOIN (SELECT c.user_id, CONCAT('[', GROUP_CONCAT( JSON_OBJECT( 'cardId', c.id, 'cardNumber', c.card_number) ), ']') as cardList FROM card as c GROUP BY c.user_id) as cards ON u.id = cards.user_id;
Provides pretty-printing of JSON values similar to that implemented in PHP and by other languages and database systems. The value supplied must be a JSON value or a valid string representation of a JSON value.
SELECT JSON_PRETTY(JSON_OBJECT('id',u.id, 'firstName',u.first_name, 'lastName',u.first_name)) as jsonUser FROM user as u;
json_extract(
json_doc
,
path
[,
path
] ...)
Returns data from a JSON document, selected from the parts of the document matched by the path
arguments. Returns NULL
if any argument is NULL
or no paths locate a value in the document. An error occurs if the json_doc
argument is not a valid JSON document or any path
argument is not a valid path expression.
By default, search results are returned sorted by relevance, with the most relevant docs first.
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incor‐ porate the percentage of terms that were found. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a full- text field are to a full-text query string.
The standard similarity algorithm used in Elasticsearch is known as term frequency/ inverse document frequency, or TF/IDF, which takes the following factors into account. The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
Sorting allows you to add one or more sorts on specific fields. Each sort can be reversed(ascending or descending) as well. The sort is defined on a per field level, with special field name for _score
to sort by score, and _doc
to sort by index order.
The order
option can have either asc or desc.
The order defaults to desc
when sorting on the _score
, and defaults to asc
when sorting on anything else.
GET users/_search { "query" : { "filtered" : { "filter" : { "term" : { "id" : 1 }} } }, "sort": { "date": { "order": "desc" }} }
Perhaps we want to combine the _score from a query with the date, and show all matching results sorted first by date, then by relevance.
GET /_search { "query" : { "filtered" : { "query": { "match": { "description": "student" }}, "filter" : { "term" : { "id" : 2 }} } }, "sort": [ { "date": {"order":"desc"} }, { "_score": { "order": "desc" } } ] }
Order is important. Results are sorted by the first criterion first. Only results whose first sort value is identical will then be sorted by the second criterion, and so on. Multilevel sorting doesn’t have to involve the _score. You could sort by using several different fields, on geo-distance or on a custom value calculated in a script.
Elasticsearch supports sorting by array or multi-valued fields. The mode
option controls what array value is picked for sorting the document it belongs to. The mode
option can have the following values.
min |
Pick the lowest value. |
max |
Pick the highest value. |
sum |
Use the sum of all values as sort value. Only applicable for number based array fields. |
avg |
Use the average of all values as sort value. Only applicable for number based array fields. |
median |
Use the median of all values as sort value. Only applicable for number based array fields. |
The default sort mode in the ascending sort order is min
— the lowest value is picked. The default sort mode in the descending order is max
— the highest value is picked.
Note that filters have no bearing on _score, and the missing-but-implied match_all query just sets the _score to a neutral value of 1 for all documents. In other words, all documents are considered to be equally relevant.
For numeric fields it is also possible to cast the values from one type to another using the numeric_type
option. This option accepts the following values: ["double", "long", "date", "date_nanos"
] and can be useful for searches across multiple data streams or indices where the sort field is mapped differently.
Sometimes you want to sort by how close a location is to a single point(lat/long). You can do this in elasticsearch.
GET elasticsearch_learning/_search { "sort":[{ "_geo_distance" : { "addresses.location" : [ { "lat" : 40.414897, "lon" : -111.881186 } ], "unit" : "m", "distance_type" : "arc", "order" : "desc", "nested" : { "path" : "addresses", "filter" : { "geo_distance" : { "addresses.location" : [ -111.881186, 40.414897 ], "distance" : 1609.344, "distance_type" : "arc", "validation_method" : "STRICT", "ignore_unmapped" : false, "boost" : 1.0 } } }, "validation_method" : "STRICT", "ignore_unmapped" : false } }] }
/** * https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-nested-query.html<br> * https://www.elastic.co/guide/en/elasticsearch/reference/7.3/search-request-body.html#geo-sorting<br> * Sort results based on how close locations are to a certain point. */ @Test void sortQueryWithGeoLocation() { int pageNumber = 0; int pageSize = 10; SearchRequest searchRequest = new SearchRequest(database); searchRequest.allowPartialSearchResults(true); searchRequest.indicesOptions(IndicesOptions.lenientExpandOpen()); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.from(pageNumber * pageSize); searchSourceBuilder.size(pageSize); searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); /** * fetch only a few fields */ searchSourceBuilder.fetchSource(new String[]{"id", "firstName", "lastName", "rating", "dateOfBirth", "addresses.street", "addresses.zipcode", "addresses.city"}, new String[]{""}); /** * Lehi skate park: 40.414897, -111.881186<br> * get locations/addresses close to skate park(from a radius).<br> */ searchSourceBuilder.sort(new GeoDistanceSortBuilder("addresses.location", 40.414897, -111.881186).order(SortOrder.DESC) .setNestedSort( new NestedSortBuilder("addresses").setFilter(QueryBuilders.geoDistanceQuery("addresses.location").point(40.414897, -111.881186).distance(1, DistanceUnit.MILES)))); log.info("\n{\n\"sort\":{}\n}", searchSourceBuilder.sorts().toString()); searchRequest.source(searchSourceBuilder); searchRequest.preference("nested-address"); try { SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); log.info("hits={}, isTimedOut={}, totalShards={}, totalHits={}", searchResponse.getHits().getHits().length, searchResponse.isTimedOut(), searchResponse.getTotalShards(), searchResponse.getHits().getTotalHits().value); List<User> users = getResponseResult(searchResponse.getHits()); log.info("results={}", ObjectUtils.toJson(users)); } catch (IOException e) { log.warn("IOException, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } catch (Exception e) { log.warn("Exception, msg={}", e.getLocalizedMessage()); e.printStackTrace(); } }
Adding explain produces a lot of output for every hit, which can look overwhelming, but it is worth taking the time to understand what it all means. Don’t worry if it doesn’t all make sense now; you can refer to this section when you need it. We’ll work through the output for one hit bit by bit.
GET users/_search?explain { "query" :{"match":{"description":"student"}} } }
Producing the explain output is expensive. It is a debugging tool only. Don’t leave it turned on in production.
To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata. Elasticsearch doesn’t just load the values for the documents that matched a particular query. It loads the values from every docu‐ ment in your index, regardless of the document type.
The reason that Elasticsearch loads all values into memory is that uninverting the index from disk is slow. Even though you may need the values for only a few docs for the current request, you will probably need access to the values for other docs on the next request, so it makes sense to load all the values into memory at once, and to keep them there.
All you need to know is what fielddata is, and to be aware that it can be memory hungry. We will talk about how to determine the amount of memory that fielddata is using, how to limit the amount of memory that is available to it, and how to preload fielddata to improve the user experience.