How to Use Elasticsearch to Get Join Functionality as in SQL

How to use elasticsearch to get JOIN functionality as in SQL?

Elasticsearch does not support JOINs, that's the whole purpose of NoSQL technologies in the first place. There are ways to create some relationships between your data using parent/child relationships (as you've noticed), nested objects and also using a special terms lookup optimization, but that's about it.

However, in order to take the most out of Elasticsearch, the main idea is to denormalize your data as much as possible and store self-contained documents. This means that you can take a single document and it contains all the info it needs. You don't care about the fact that data is going to be duplicated.

For instance, suppose you have the following JOIN query in your SQL database for retrieving all people with their address, city and country (i.e. 4 tables):

SELECT p.first_name, p.last_name, 
       a.street_name, a.street_num,
       c.name, c2.name
  FROM person p
  JOIN address a ON a.id = p.addr_id
  JOIN city c ON c.id = p.city_id
  JOIN country c2 ON c2.id = p.country_id

In Elasticsearch, you would create a document containing exactly the fields that are returned by the above query, i.e.

 {
     "first_name": "John",
     "last_name": "Doe",
     "street_num": 34,
     "street_name": "Main Street",
     "city": "San Francisco",
     "country": "United States"
 }

So, one way to see this, is that in Elasticsearch you're going to store the same (or very similar) set of fields that you would get as a result of running your query in your relational database.

Taking the step from RDBMS to Elasticsearch is a paradigm shift. If you are ever going to take that step, you need to think different in terms of your data.

Join query in ElasticSearch

It depends what you intend when you say JOIN. Elasticsearch is not like regular database that supports JOIN between tables. It is a text search engine that manages documents within indexes.

On the other hand you can search within the same index over multiple types using a fields that are common to every type.

For example taking your data I can create an index with 2 types and their data like follows:

curl -XPOST localhost:9200/product -d '{
    "settings" : {
        "number_of_shards" : 5
    }
}'

curl -XPOST localhost:9200/product/type1/_mapping -d '{
        "type1" : {
            "properties" : {
                "product_id" : { "type" : "string" },
                "price" : { "type" : "integer" },
                "stock" : { "type" : "integer" }
            }
        }   
}'              

curl -XPOST localhost:9200/product/type2/_mapping -d '{
        "type2" : {
            "properties" : {
                "product_id" : { "type" : "string" },
                "category" : { "type" : "string" },
                "manufacturer" : { "type" : "string" }
            }
        }
}'  

curl -XPOST localhost:9200/product/type1/1 -d '{
        product_id: "1111", 
        price: "23",
        stock: "100"
}'

curl -XPOST localhost:9200/product/type2/1 -d '{
        product_id: "1111",
        category: "iPhone case",
        manufacturer: "Belkin"
}'

I effectively created one index called product with 2 type type1 and type2.
Now I can do the following query and it will return both documents:

curl -XGET 'http://localhost:9200/product/_search?pretty=1' -d '{
    "query": {
        "query_string" : {
            "query" : "product_id:1111"
        }
    }
}'

{
  "took" : 95,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.5945348,
    "hits" : [ {
      "_index" : "product",
      "_type" : "type1",
      "_id" : "1",
      "_score" : 0.5945348, "_source" : {
    product_id: "1111",
    price: "23",
    stock: "100"
}
    }, {
      "_index" : "product",
      "_type" : "type2",
      "_id" : "1",
      "_score" : 0.5945348, "_source" : {
    product_id: "1111",
    category: "iPhone case",
    manufacturer: "Belkin"
}
    } ]
  }
}

The reason is because Elasticsearch will search over all documents within that index regardless of their type. This is still different than a JOIN in the sense Elasticsearch is not going to do a Cartesian product of the documents that belong to each type.

Hope that helps

ElasticSearch join data within the same index

What I understood is that you would want to collate list of documents having same app_id along with the status as either STARTED or FINISHED.

I do not think Elasticsearch is not meant to perform JOIN operations. I mean you can but then you have to design your documents as mentioned in this link.

What you would need is an Aggregation query.

Below is the sample mapping, documents, the aggregation query and the response as how it appears, which would actually help you get the desired result.

Mapping:

PUT mystatusindex
{
  "mappings": {
    "properties": {
      "username":{
        "type": "keyword"
      },
      "app_id":{
        "type": "keyword"
      },
      "event_type":{
        "type":"keyword"
      },
      "ts":{
        "type": "date"
      }
    }
  }
}

Sample Documents

POST mystatusindex/_doc/1
{
    "username" : "mapred",
    "app_id" : "application_1569623930006_490200",
    "event_type" : "STARTED",
    "ts" : "2019-10-02T08:11:53Z"
}

POST mystatusindex/_doc/2
{
    "username" : "mapred",
    "app_id" : "application_1569623930006_490200",
    "event_type" : "FINISHED",
    "ts" : "2019-10-02T08:12:53Z"
}

POST mystatusindex/_doc/3
{
    "username" : "mapred",
    "app_id" : "application_1569623930006_490201",
    "event_type" : "STARTED",
    "ts" : "2019-10-02T09:30:53Z"
}

POST mystatusindex/_doc/4
{
    "username" : "mapred",
    "app_id" : "application_1569623930006_490202",
    "event_type" : "STARTED",
    "ts" : "2019-10-02T09:45:53Z"
}

POST mystatusindex/_doc/5
{
    "username" : "mapred",
    "app_id" : "application_1569623930006_490202",
    "event_type" : "FINISHED",
    "ts" : "2019-10-02T09:45:53Z"
}

POST mystatusindex/_doc/6
{
  "username" : "mapred",
  "app_id" : "application_1569623930006_490203",
  "event_type" : "STARTED",
  "ts" : "2019-10-03T09:30:53Z"
}

POST mystatusindex/_doc/7
{
  "username" : "mapred",
  "app_id" : "application_1569623930006_490203",
  "event_type" : "FINISHED",
  "ts" : "2019-10-03T09:45:53Z"
}

Query:

POST mystatusindex/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "ts": {
              "gte": "2019-10-02T00:00:00Z",
              "lte": "2019-10-02T23:59:59Z"
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "event_type": "STARTED"
          }
        },
        {
          "match": {
            "event_type": "FINISHED"
          }
        }
      ]
    }
  },
  "aggs": {
    "application_IDs": {
      "terms": {
        "field": "app_id"
      },
      "aggs": {
        "ids": {
          "top_hits": {
            "size": 10,
            "_source": ["event_type", "app_id"],
            "sort": [
              { "event_type": { "order": "desc"}}
              ]
          }
        }
      }
    }
  }
}

Notice that for filtering I've made use of Range Query as you only want to filter documents for that date and also added a bool should logic to filter based on STARTED and FINISHED.

Once I have the documents, I've made use of Terms Aggregation and Top Hits Aggregation to get the desired result.

Result

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "application_IDs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "application_1569623930006_490200",       <----- APP ID
          "doc_count" : 2,
          "ids" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "mystatusindex",
                  "_type" : "_doc",
                  "_id" : "1",                     <--- Document with STARTED status
                  "_score" : null,
                  "_source" : {
                    "event_type" : "STARTED",     
                    "app_id" : "application_1569623930006_490200"
                  },
                  "sort" : [
                    "STARTED"
                  ]
                },
                {
                  "_index" : "mystatusindex",
                  "_type" : "_doc",
                  "_id" : "2",                    <--- Document with FINISHED status
                  "_score" : null,
                  "_source" : {
                    "event_type" : "FINISHED",     
                    "app_id" : "application_1569623930006_490200"
                  },
                  "sort" : [
                    "FINISHED"
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "application_1569623930006_490202",
          "doc_count" : 2,
          "ids" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "mystatusindex",
                  "_type" : "_doc",
                  "_id" : "4",
                  "_score" : null,
                  "_source" : {
                    "event_type" : "STARTED",
                    "app_id" : "application_1569623930006_490202"
                  },
                  "sort" : [
                    "STARTED"
                  ]
                },
                {
                  "_index" : "mystatusindex",
                  "_type" : "_doc",
                  "_id" : "5",
                  "_score" : null,
                  "_source" : {
                    "event_type" : "FINISHED",
                    "app_id" : "application_1569623930006_490202"
                  },
                  "sort" : [
                    "FINISHED"
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "application_1569623930006_490201",
          "doc_count" : 1,
          "ids" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "mystatusindex",
                  "_type" : "_doc",
                  "_id" : "3",
                  "_score" : null,
                  "_source" : {
                    "event_type" : "STARTED",
                    "app_id" : "application_1569623930006_490201"
                  },
                  "sort" : [
                    "STARTED"
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Note that the last document with only STARTED appears in the aggregation result as well.

Updated Answer

{ 
   "size":0,
   "query":{ 
      "bool":{ 
         "must":[ 
            { 
               "range":{ 
                  "ts":{ 
                     "gte":"2019-10-02T00:00:00Z",
                     "lte":"2019-10-02T23:59:59Z"
                  }
               }
            }
         ],
         "should":[ 
            { 
               "term":{ 
                  "event_type.keyword":"STARTED"   <----- Changed this 
               }
            },
            { 
               "term":{ 
                  "event_type.keyword":"FINISHED"  <----- Changed this 
               }
            }
         ]
      }
   },
   "aggs":{ 
      "application_IDs":{ 
         "terms":{ 
            "field":"app_id.keyword"               <----- Changed this 
         },
         "aggs":{ 
            "ids":{ 
               "top_hits":{ 
                  "size":10,
                  "_source":[ 
                     "event_type",
                     "app_id"
                  ],
                  "sort":[ 
                     { 
                        "event_type.keyword":{    <----- Changed this 
                           "order":"desc"
                        }
                     }
                  ]
               }
            }
         }
      }
   }
}

Note the changes I've made. Whenever you would need exact matches or want to make use of aggregation, you would need to make use of keyword type.

In the mapping you've shared, there is no username field but two event_type fields. I'm assuming its just a human err and that one of the field should be username.

Now if you notice carefully, the field event_type has a text and its sibling keyword field. I've just modified the query to make use of the keyword field and when I am doing that, I'm use Term Query.

Try this out and let me know if it helps!

Inner Join operations between two Elastic Types inside a single Index

If this is main usecase for your data retrieval, you may consider using parent-child relation as well.

join in elastic search

If you can't combine your search term into a single query, you can use dis_max function to create a union of subqueries.

{
    "dis_max" : {
        "tie_breaker" : 0.7,
        "boost" : 1.2,
        "queries" : [
            {
                "term" : { "name" : "john" }
            },
            {
                "term" : { "name" : "mary" }
            }
        ]
    }
}

Kibana - Joining data from multiple types in visualization

There is an elasticsearch plugin for logstash. This will allow a search of elasticsearch, and give you the ability to create a 'join', by editing the config file.

searching Mysql table with Elasticsearch

So my API doesn't even query the DB and return an array of documents
directly from ES?

Yes, As you are doing query to elasticsearch, you will get result only from Elasticsearch. Another way is, just get id from Elasticsearch and use id to retrive documeents from MySQL, but this might impact response time.

Is this considered good practice? replicating the whole table to
elastic? What about table relations... What If I want to filter by
nested table relation?...

It is not about good practice or bad practice, it is all about what type of functionality and use case you want to implement and based on that technology stack can be used and data can be duplicated. There is lots of company using Elasticsearch as secondary data source where they have duplicated data just because there usecase is best fit with Elasticsearh or other NoSQL db.

Elasticsearch is NoSQL DB and it is not mantain any relationship between data. Hence, you need to denormalize your data before indexing to the Elasticsearch. You can read this article for more about denormalizetion and why it is required.

ElasticSearch provide Nested and Join data type for parent child relationship but both have some limitation and performance impact.

Below is what they have mentioned for join field type:

The join field shouldn’t be used like joins in a relation database. In
Elasticsearch the key to good performance is to de-normalize your data
into documents. Each join field, has_child or has_parent query adds a
significant tax to your query performance. It can also trigger global
ordinals to be built.

Below is what they have mentioned for nested field type:

When ingesting key-value pairs with a large, arbitrary set of keys,
you might consider modeling each key-value pair as its own nested
document with key and value fields. Instead, consider using the
flattened data type, which maps an entire object as a single field and
allows for simple searches over its contents. Nested documents and
queries are typically expensive, so using the flattened data type for
this use case is a better option.

most articles I read suggest replicating the table records (using
logstash pipe or other methods) to elastic index.

Yes, You can use logstash or any language client like java, python etc, to sync data from DB to Elasticsearch. You can check this SO answer for more information on this.

Your Search Requirements

If you go ahead with Elasticsearch then you can use N-Gram Tokenizer or Regex Query and achieve your search requirements.

How to Use Elasticsearch to Get Join Functionality as in SQL