Amazon Elasticsearch Domain Size Calculation

Introduction

In this article, I would be describing the ES(Elasticsearch) terms necessary for understanding the calculation. After acquainting you about Amazon ES, I will tell you about how to calculate the minimum storage size and number of shards required for Elasticsearch with example.

Elasticsearch

Elasticsearch is an open-source analytics engine for use cases such as log analytics, real-time application monitoring and clickstream analytics.

Node

A Node is an EC2 instance.

Elasticsearch Cluster

Elasticsearch Cluster is a collection of one or more node(server) that holds the entire data and provides federated indexing and search capabilities across all nodes.

Amazon Elasticsearch

Amazon Elasticsearch is a managed service, which makes it easy to deploy, operate and scale Elasticsearch clusters in the AWS cloud.

What Amazon Elasticsearch manages

  • Amazon ES provisions all the resources for the Elasticsearch cluster and launches it. 
  • It automatically detects and replaces any failed Elasticsearch node, reducing the overhead associated with self-managed infrastructures.
  • Provides an easy way to scale your Elasticsearch cluster by clicking a single API or doing a few clicks in the console.

Instance

An instance is simply an EC2 machine(EC2 instance) in the AWS cloud.

Amazon Elasticsearch Domain

An Amazon ES domain is synonymous with an Elasticsearch cluster.
It is an Elasticsearch cluster with settings, instance types, instance counts, and storage resources.

Source Data

Source data is the data received by ES in a given time. 

Replica

In Elasticsearch each replica is a full copy of an index and needs the same amount of disk space. By default, each Elasticsearch index has one replica. It is recommended to have at least one to prevent data loss. Replicas also improve search performance, so you might want more if you have a read-heavy workload.

Document

A document in Elasticsearch is a unit of search and index. In database terminology, it basically corresponds to a table row and field corresponds to a column.
For e.g.,
{"John", 34, "New York"}
is a document.

Shard

A shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster. Data in Elasticsearch is organized into indices and each index is made up of one or more shards. 

Index

It a collection of documents. An index is the logical namespace that maps to one or more primary shards and can have zero or more replica shards. An index is like a database in a relational database. 
The index is split into shards which are evenly distributed across nodes in Elasticsearch cluster.

Index Overhead

In Elasticsearch index overhead is the extra disk space taken by source data. The on-disk size of an index varies but is often 10% larger than the source data.

Operating System Overhead

It is the disk space taken by OS. By default, Linux reserves 5% of the file system for the root user for critical processes, system recovery, and to safeguard against disk fragmentation problems.

Amazon Elasticsearch Overhead

It is the disk space taken by the Amazon Elasticsearch. Amazon ES reserves 20% of the storage space of each instance (up to 20 GiB) for segment merges, logs, and other internal operations.

Source Data Size Calculation

It's the size of source data for the given retention period.
For e.g., If ES is receiving 100MiB of data every hour then it's 4.7 GiB per day and for two weeks of the retention period total size would be 66GiB at a given time.

Factors that sizing of Amazon ES depends upon

Size of source data for a given retention period
Number of replicas (of the same size as source data for given retention period)
ES index overhead (10% more than the size of source data for a given retention period)
Operating system overhead (5% of the file system)
Amazon ES overhead (20% of the storage space of each instance, up to 20GiB)


Minimum Storage Size Calculation

Formula:
Let's say
Size of source data for a given retention period as SD
Number of Replicas as NOR
Indexing Overhead as IO
Linux Reserved Space as LRS
Amazon ES Overhead as AEO
minimum storage size as MSS

So,
MSS = (SD)*(1 + NOR)*(1 + IO)/(1 - LRS)/(1 - AEO)

After simplifying it can be reduced to:
MSS = (SD)*(1 + NOR)*1.45

Example:
If you have 66 GiB of data at any given time and retention period and want one replica
According to formula:
SD = 66 GiB
NOR = 1

MSS = 66*(1+1)*1.45
MSS = 191 GiB approx

Number of Shards Calculation

Formula:
Let's say
Size of source data for a given retention period as SD
Extra data or Room to grow data as ED
Indexing Overhead as IO
The approximate number of shards as ANS
Desired shard size as DSS

So,
ANS = (SD+ED)*(1+IO)/DSS

Example:
Suppose you have 66 GiB of data. You don't expect that number to increase over time, and you want to keep your shards around 30 GiB each.
Here:
SD = 66 GiB
ED = 0 (As we are not expecting any more data to increase in future)
IO = 0.1 (10/100, because it is 10% of SD)
DSS = 30GiB

ANS  = (66 +0)* (1+0.1) / 30 
ANS = 3

While calculating the approximate number of shards following things should be kept into consideration
  • It's difficult to change the number of shards after indexing, so, it becomes important to decide about shard count carefully before indexing the first document.
  • Shards should not be too large in size(more than 50GiB) because large shards can make it difficult for Elasticsearch to recover from failure.
  • There should not be too many small shards as it can have a bad impact on performance and can cause out of memory errors.
  • A good rule of thumb is to try to keep shard size between 10–50 GiB.

Please comment for any question and share if you liked.









Comments

Popular posts from this blog

Rclone: Sync files from ftp server to AWS S3 bucket

Diagnostic Interrupt - A way to debug and perform root cause analysis of unresponsive or unreachable AWS EC2 instance

Query S3 bucket using AWS Athena service