What Is an Index in Elasticsearch

What is an index in Elasticsearch

Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.

Indices for Relations

The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.

MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties

An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).

So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:

People
Cars
Spare_Parts

Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).

Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]

So to retrieve the Subaru document, I may do this:

  $ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza

Indices for Logging

Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.

To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:

logs-2013-02-22
logs-2013-02-21
logs-2013-02-20

ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:

  $ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"

Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.

Indices for Users

Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:

Zach's Index
- Hobbies Type
- Friends Type
- Pictures Type
Fred's Index
- Hobbies Type
- Friends Type
- Pictures Type

Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.

Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.

do we need to create index specifically in elasticsearch?

You were right that the index in elasticsearch internally mapped to shards and replicas but you are getting confused with the naming/meaning of an index wrt to RDBMS(in RDBMS its created to increase the search efficiency).

While in Elatsicsearch, an index contains a mapping(where you define various fields, their data-type, and analyzer), for more info refer to Mapping in elasticsearch

Now, when you define a field in elasticsearch it accepts a few parameters and one of which param is also called index, which according to the same official doc is

The index option controls whether field values are indexed. It accepts
true or false and defaults to true. Fields that are not indexed are
not queryable.

This means, the field will be part of Elasticsearch inverted index, which is used to query/search the data, and obviously without this option you will not be able to search this particular field and as this is part of the inverted index(faster data structure to increase search performance), you will implicitly get the better search performance :)

As mentioned in the official doc default value of index param is true, so to answer your question, you don't need to define the index specifically in Elasticsearch.

What is the difference between an elastic search index and an index in a relational database?

There is unfortunate usage of the word "index" which means slightly (edit: VERY) different things in ES and relational databases as they are optimized for different use cases.

An "index" in database is a secondary data structure which makes WHERE queries and JOINs fast, and they typically store values exactly as they appear in the table. You can still have columns which aren't indexed, but then WHEREs require a full table scan which is slow on large tables.

An "index" in ES is actually a schematic collection of documents, similar to a database in the relational world. You can have different "types" of documents in ES, quite similar to tables in dbs. ES gives you the flexibility of defining for each document's field whether you want to be able to retrieve it, search by it or both. Some details on these options can be found from for example here, also related to _source field (the original JSON which was submitted to ES).

ES uses an inverted index to efficiently find matching documents, but most importantly it typically "normalizes" strings into tokens so that accurate free-text search can be performed. For example sentences might be splitted into individual words, words are normalized to lower case etc. so that searching for "holland" would match the text "Vacation at Holland 2015".

If a field does not have an inverted index, you cannot perform any searching on it (unlike dbs' full table scan). Interestingly you can also define fields so that you can use them for searching but you cannot retrieve them back, it is mainly beneficial when minimizing in disk and RAM usage is important.

What does the Type mean in Elasticsearch?

Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.

The type is just another field in Elasticsearch, at the very basic level. When you do GET /my_index/my_type/_search ES will run a pre-filter for my_type value for field _type - it's like an automatic filter.

Don't think about indices and types as databases and tables in SQL world, because they are not that.

If you have type1 with fields f1 and f2 and type2 with fields f1 and f3 in the index there will be documents with fields f1, f2, f3. Why this matters - when the score for a document will be calculated with queries that search for values in field f1 the terms frequencies in field f1 will be global (both type1 and type2) so if you search some value in f1 from type1 then the score you get back is slightly influenced, also, by the values of f1 in type2.

Also, please, don't translate a set of SQL tables to ES by simply following the primary key/foreign key approach to define parent/child relationships in ES.

Elastic Search - When to use another index?

You can think about it as a Schema in SQL database.

A Schema contains the data for a given use case. An index holds the data for the use case.

The cool thing is that search can be done on multiple indices in one single request.

It's hard to tell you more without any information about the use case.
It depends on many factors: do you need to remove some data after a period (let's say every year)? How many documents will you index and what is the size of a document?

For example, let's say that you want to index logs and keep on line 3 months of logs. You will basically create one index per month and one alias on top of the 3 current months.

When a month is over, create a new index for the new month, modify the alias and remove the old index. Removing an index is efficient performance and disk space wise!

So basically in that case I would recommend using more than one index.

Imagine another situation. Let's say you are launching a game and you don't know exactly if you will be successful or not. So start with an index1 with only one shard and create an alias index on top of it. You launch the game and you find that you will need more power (more machines) as your response time is increasing dramatically. Create a new index index2 with two shards and add it to your alias index.

This way you can scale out easily.

The key point here is IMHO aliases. Use aliases for search from the start of your project. It will help you a lot in the future.

Another use case could be that you are working for different customers. Customers don't want to have their data mixed with other customers. So may be you need in that case to create one index per customer?

The fact is that elasticsearch is very flexible and helps you to design your architecture as you need.

Hope this helps.

What is the use case for index closing in ElasticSearch?

If your index is closed, you obviously cannot read/search from it. Some operations, like changing index analyzers, require you to close the index before doing so and reopen it afterwards.

Other than that, if you know you'll need to read/search from your old indexes, then simply keep them open. It makes no sense to close/reopen them every time you need to read from them.

If you really want to optimize for writes, what you can do is implement hot/warm architecture and move your old indexes to warm nodes, while keeping the new one you're writing to on hot nodes.

You have a handful of other best practices you can implement if you want to optimize your indexing speed.

Elasticsearch - What is the indexing process?

Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process

Making Text Searchable

Every word in a text field needs to be searchable,

The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.

Updating Index :

First of all, please do note that a "lucene index is immutable"

Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.

Indexing Process

New documents are collected in an in-memory indexing buffer.
Every so often, the buffer is commited:
- A new segment—a supplementary inverted index—is written to disk.
- A new commit point is written to disk, which includes the name of the new segment.
- The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
- The new segment is opened, making the documents it contains visible to search.
The in-memory buffer is cleared, and is ready to accept new documents.

What happens in case of Delete

Segments are immutable, so documents cannot be removed from older segments.

When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.

When is it actually removed

In Segment Merging, deleted documents are purged from the filesystem.

References :

Elasticsearch Docs

Inverted Index

Lucene Talks

What are elasticsearch indices?

Indices is the plural of index. If you have more than one index you call them indices.
http://www.thefreedictionary.com/index

What Is an Index in Elasticsearch