Elastic search for terms with spaces
The query_string query that is used on the q
parameter parses the query string first by splitting it on spaces. You need to replace it with something else that preserves spaces. The match
query would be a good choice here. I would also use different analyzer for searching - you don't need to apply ngram there:
curl -XPUT http://localhost:9200/autocomplete/ -d '
{
"index": {
"analysis": {
"analyzer": {
"placeNameIndexAnalyzer" : {
"type": "custom",
"tokenizer": "keyword",
"filter" : ["trim", "lowercase", "asciifolding", "left_ngram"]
},
"placeNameSearchAnalyzer" : {
"type": "custom",
"tokenizer": "keyword",
"filter" : ["trim", "lowercase", "asciifolding"]
}
},
"filter": {
"left_ngram": {
"type" : "edgeNGram",
"side" : "front",
"min_gram" : 3,
"max_gram" : 12
}
}
}
}
}'
curl -XPUT http://localhost:9200/autocomplete/geo/_mapping/ -d '
{
"geo": {
"properties": {
"application_id": {
"type": "string"
},
"alias": {
"type": "string",
"index_analyzer": "placeNameIndexAnalyzer",
"search_analyzer": "placeNameSearchAnalyzer"
},
"name": {
"type": "string"
},
"object_type": {
"type": "string"
}
}
}
}'
curl -XPOST "http://localhost:9200/autocomplete/geo?refresh=true" -d '
{
"application_id":"982",
"name":"Buenos Aires",
"alias":["bue", "buenos aires", "bsas", "bs as", "baires"],
"object_type":"cities"
}'
curl -XGET 'localhost:9200/autocomplete/geo/_search' -d '{
"query": {
"match": {
"alias": "bs as"
}
}
}'
Creating a whitespace character filter
You can simply use the whitespace tokenizer in your custom analyzer definition, below is the example of custom_analyzer
which uses it.
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": { --> name of custom analyzer
"type": "custom",
"tokenizer": "whitespace", --> note this
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_custom_analyzer" --> note this
}
}
}
}
Search keywords in ElasticSearch 7.6 with whitespace
You need to define a custom analyzer using char filter to remove the whitespace and hyphen (-
) so that your generated tokens match your requirements.
Index def
{
"settings": {
"analysis": {
"char_filter": {
"my_space_char_filter": {
"type": "mapping",
"mappings": [
"\\u0020=>", -> whitespace
"\\u002D=>" --> for hyphen(-)
]
}
},
"analyzer": {
"splcharanalyzer": {
"char_filter": [
"my_space_char_filter"
],
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings" :{
"properties" :{
"title" :{
"type" : "text",
"analyzer" : "splcharanalyzer"
}
}
}
}
Tokens generated by custom splcharanalyzer
POST myindex/_analyze
{
"analyzer": "splcharanalyzer",
"text": "toronto, new mexico, paris, lisbona, new york, sedro-woolley"
}
{
"tokens": [
{
"token": "toronto",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "newmexico",
"start_offset": 9,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "paris",
"start_offset": 21,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "lisbona",
"start_offset": 28,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "newyork",
"start_offset": 37,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "sedrowoolley",
"start_offset": 47,
"end_offset": 60,
"type": "<ALPHANUM>",
"position": 5
}
]
}
Diff Search query
{
"query": {
"match" : {
"title" : {
"query" : "sedro-woolley"
}
}
}
}
Search result
"hits": [
{
"_index": "white",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "toronto, new mexico, paris, lisbona, new york, sedro-woolley"
}
}
]
Searching for new
or york
will not yield any result.{
"query": {
"match" : {
"title" : {
"query" : "york"
}
}
}
}
Result "hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
ElasticSearch | How to search a list of strings containing whitespace?
Ok Here is what I did
1) created the mappings for the data
2) indexed 3 documents. One document is same one as you posted above and one
is completely irrelevant data, and the third document has one search field
matching, so less relevance than the first document but more relevance
than the other document
3) the search query
when I ran the search, the most relavent document showed up top with most match and then the second document.Please also see that I am passing multiple strings as you expected using double quotes and single quotes in the search query. You can build a array of strings or a string with concatenated strings (with spaces as you wanted etc) ..should work
Here is the mappings
PUT ugi-index2
{
"mappings": {
"_doc": {
"properties":{
"skills": {"type": "text"},
"languages": {"type": "keyword"}
}
}
}
}
and the three documents that I indexed POST /ugi-index2/_doc/3
{
"skills" : [
"no skill",
"Facebook ads",
"not related"
],
"languages": [
"ab",
"cd"
]
}
POST /ugi-index2/_doc/2
{
"skills" : [
"no skill",
"test skill",
"not related"
],
"languages": [
"ab",
"cd"
]
}
POST /ugi-index2/_doc/1
{
"skills" : [
"Online strategi",
"Facebook Ads",
"Google Ads"
],
"languages": [
"da",
"en"
]
}
And the search query GET /ugi-index2/_search
{
"query":{
"multi_match": {
"query": "'Online Strate', 'Facebook'",
"fields": ["skills"]
}
}
}
look at the query above for multi strings with spaces (for search)and here is the response
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "ugi-index2",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"skills" : [
"Online strategi",
"Facebook Ads",
"Google Ads"
],
"languages" : [
"da",
"en"
]
}
},
{
"_index" : "ugi-index2",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.2876821,
"_source" : {
"skills" : [
"no skill",
"Facebook ads",
"not related"
],
"languages" : [
"ab",
"cd"
]
}
}
]
}
}
Elasticsearch can't search the word before dot(.) with whitespace analyzer
You need to use the mapping char filter where you can remove the .
character and this should solve your issue.
Below is the working example:
GET http://localhost:9200/_analyze
{
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
".=>"
]
}
],
"text": "This is test data."
}
returns below tokens{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "test",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "data",
"start_offset": 13,
"end_offset": 18,
"type": "word",
"position": 3
}
]
}
Or you can modify your current pattern replace character filter as"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "\\.", // note this
"replacement": ""
}
}
Related Topics
When Is The Enumerator::Yielder#Yield Method Useful
Bundler Using Wrong Ruby Version
Ruby Readline Fails If Process Started with Arguments
Where Are Keywords Defined in Ruby
Difference Between Render and Yield in Rails
Browsing Ruby Code a La Smalltalk
Ruby/Rails 3.1: Given a Url String, Remove Path
Ruby: How to Escape Url with Square Brackets [ and ]
Hw Impossibility: "Create a Rock Paper Scissors Program in Ruby Without Using Conditionals"
How to "Break" Out of a Case...While in Ruby
Problem with Vim's Ruby Plugin
Why The Unit Test Frameworks in Fortran Rely on Ruby Instead of Fortran Itself
Can You Specify The Http Method to Use with Sinatra's Redirect
Why Should @@Class_Variables Be Avoided in Ruby
How to Access HTML Request Parameters for a .Rhtml Page Served by Webrick