Using Elasticsearch to Filter Through Tags with Whitespace

Elastic search for terms with spaces

The query_string query that is used on the q parameter parses the query string first by splitting it on spaces. You need to replace it with something else that preserves spaces. The match query would be a good choice here. I would also use different analyzer for searching - you don't need to apply ngram there:

curl -XPUT http://localhost:9200/autocomplete/ -d '
{
"index": {
"analysis": {
"analyzer": {
"placeNameIndexAnalyzer" : {
"type": "custom",
"tokenizer": "keyword",
"filter" : ["trim", "lowercase", "asciifolding", "left_ngram"]
},
"placeNameSearchAnalyzer" : {
"type": "custom",
"tokenizer": "keyword",
"filter" : ["trim", "lowercase", "asciifolding"]
}
},
"filter": {
"left_ngram": {
"type" : "edgeNGram",
"side" : "front",
"min_gram" : 3,
"max_gram" : 12
}
}
}
}
}'
curl -XPUT http://localhost:9200/autocomplete/geo/_mapping/ -d '
{
"geo": {
"properties": {
"application_id": {
"type": "string"
},
"alias": {
"type": "string",
"index_analyzer": "placeNameIndexAnalyzer",
"search_analyzer": "placeNameSearchAnalyzer"
},
"name": {
"type": "string"
},
"object_type": {
"type": "string"
}
}
}
}'
curl -XPOST "http://localhost:9200/autocomplete/geo?refresh=true" -d '
{
"application_id":"982",
"name":"Buenos Aires",
"alias":["bue", "buenos aires", "bsas", "bs as", "baires"],
"object_type":"cities"
}'

curl -XGET 'localhost:9200/autocomplete/geo/_search' -d '{
"query": {
"match": {
"alias": "bs as"
}
}
}'

Creating a whitespace character filter

You can simply use the whitespace tokenizer in your custom analyzer definition, below is the example of custom_analyzer which uses it.

{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": { --> name of custom analyzer
"type": "custom",
"tokenizer": "whitespace", --> note this
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_custom_analyzer" --> note this
}
}
}
}

Search keywords in ElasticSearch 7.6 with whitespace

You need to define a custom analyzer using char filter to remove the whitespace and hyphen (-) so that your generated tokens match your requirements.

Index def

{
"settings": {
"analysis": {
"char_filter": {
"my_space_char_filter": {
"type": "mapping",
"mappings": [
"\\u0020=>", -> whitespace
"\\u002D=>" --> for hyphen(-)
]
}
},
"analyzer": {
"splcharanalyzer": {
"char_filter": [
"my_space_char_filter"
],
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings" :{
"properties" :{
"title" :{
"type" : "text",
"analyzer" : "splcharanalyzer"
}
}
}
}

Tokens generated by custom splcharanalyzer

POST myindex/_analyze

{
"analyzer": "splcharanalyzer",
"text": "toronto, new mexico, paris, lisbona, new york, sedro-woolley"
}

{
"tokens": [
{
"token": "toronto",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "newmexico",
"start_offset": 9,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "paris",
"start_offset": 21,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "lisbona",
"start_offset": 28,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "newyork",
"start_offset": 37,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "sedrowoolley",
"start_offset": 47,
"end_offset": 60,
"type": "<ALPHANUM>",
"position": 5
}
]
}

Diff Search query

{
"query": {
"match" : {
"title" : {
"query" : "sedro-woolley"
}
}
}
}

Search result

 "hits": [
{
"_index": "white",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "toronto, new mexico, paris, lisbona, new york, sedro-woolley"
}
}
]

Searching for new or york will not yield any result.

{
"query": {
"match" : {
"title" : {
"query" : "york"
}
}
}
}

Result

 "hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}

ElasticSearch | How to search a list of strings containing whitespace?

Ok Here is what I did

1) created the mappings for the data
2) indexed 3 documents. One document is same one as you posted above and one
is completely irrelevant data, and the third document has one search field
matching, so less relevance than the first document but more relevance
than the other document
3) the search query

when I ran the search, the most relavent document showed up top with most match and then the second document.

Please also see that I am passing multiple strings as you expected using double quotes and single quotes in the search query. You can build a array of strings or a string with concatenated strings (with spaces as you wanted etc) ..should work

Here is the mappings

  PUT ugi-index2
{
"mappings": {
"_doc": {
"properties":{
"skills": {"type": "text"},
"languages": {"type": "keyword"}
}
}
}
}

and the three documents that I indexed

   POST /ugi-index2/_doc/3
{
"skills" : [
"no skill",
"Facebook ads",
"not related"
],
"languages": [
"ab",
"cd"
]

}

POST /ugi-index2/_doc/2
{
"skills" : [
"no skill",
"test skill",
"not related"
],
"languages": [
"ab",
"cd"
]

}

POST /ugi-index2/_doc/1
{
"skills" : [
"Online strategi",
"Facebook Ads",
"Google Ads"
],
"languages": [
"da",
"en"
]

}

And the search query

  GET /ugi-index2/_search
{
"query":{
"multi_match": {
"query": "'Online Strate', 'Facebook'",
"fields": ["skills"]
}
}
}

look at the query above for multi strings with spaces (for search)

and here is the response

{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "ugi-index2",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"skills" : [
"Online strategi",
"Facebook Ads",
"Google Ads"
],
"languages" : [
"da",
"en"
]
}
},
{
"_index" : "ugi-index2",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.2876821,
"_source" : {
"skills" : [
"no skill",
"Facebook ads",
"not related"
],
"languages" : [
"ab",
"cd"
]
}
}
]
}
}

Elasticsearch can't search the word before dot(.) with whitespace analyzer

You need to use the mapping char filter where you can remove the . character and this should solve your issue.

Below is the working example:

GET http://localhost:9200/_analyze

{
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
".=>"
]
}
],
"text": "This is test data."
}

returns below tokens

{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "test",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "data",
"start_offset": 13,
"end_offset": 18,
"type": "word",
"position": 3
}
]
}

Or you can modify your current pattern replace character filter as

"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "\\.", // note this
"replacement": ""
}
}


Related Topics



Leave a reply



Submit