Query S3 bucket using AWS Athena service




Before going into details of how to use Athena service to query S3 bucket, let me introduce you all about what AWS Athena is actually.

What is AWS Athena?
Athena is query service which uses standard SQL to fetch data from s3 bucket.

Use case
Suppose there are 3 files stored in s3 bucket (athena-sample-bucket)
  • samplefile1.log.gz(1gb)
  • samplefile2.log.gz(700mb)
  • samplefile3.log.gz600mb

Each log.gz file mentioned above has log data in CSV structure with comma-separated fields.

Now, what will we do if we want to get data record which belongs to one of the files in s3 bucket?

Approach 1
Read each file in s3 bucket and write logic to search the record which we want to fetch.
This approach is slow and complex.

Approach 2
Use AWS Athena service to query required record from s3 bucket.
Athena uses standard SQL.

Let's use Athena step by step
  • Goto services section of AWS website and search for Athena. Select Athena service from the search result.
  • This is how it looks like

  • In the above image, you can see Athena console. There is a default database, 0 tables, 0 views and query editor.
  • Now let's create a new database.

  • Above image has a query CREATE DATABASE IF NOT EXISTS sampledb in query editor which creates a new database sampledb by clicking Run query button on the console.
  • After creating a database, a table needs to be created.
  • To create table run following query in the query editor
    • CREATE EXTERNAL TABLE IF NOT EXISTS `sampledb.sampletable
    • Below is the image of the query
  • CREATE query creates a schema and locates s3 bucket where all the data is stored.
  • Now any record can be fetched using simple SQL, for e.g. suppose, to get the record for occupation equal to software engineer run following query
           SELECT * FROM "sampledb"."sampletable" WHERE occupation = 'Software Engineer';
  • The result can be downloaded in CSV format by clicking the first icon on the top right corner of Results section
Cost
Click here for detailed cost information

Concepts
  • Athena uses schema-read approach.
  • In schema-read approach schema is used at the time data is read, i.e. when data is actually queried.
  • Athena uses Apache Hive for data query.
  • Apache Hive is a data warehouse software project for data query and analysis. It uses a SQL-like interface to query data stored in databases and file systems that integrate by Hadoop.
  • When database and table are created in Athena, a schema and location of s3 get described for read-query in future.
Note 
  • The query can be done only on the bucket and not on any particular file. While creating table Athena query accepts only bucket path.
  • The query string length must not be greater than 256 KB.

Hope you got some understanding of how Athena can be used and how it works.
Please share if you liked it.
Do comment for any queries.


Comments

Popular posts from this blog

Rclone: Sync files from ftp server to AWS S3 bucket

Diagnostic Interrupt - A way to debug and perform root cause analysis of unresponsive or unreachable AWS EC2 instance

Amazon Elasticsearch Domain Size Calculation