Turn your AWS S3 data into insights you always wanted

Rakesh Gupta
7 min readAug 27, 2019

--

We are living in the era where data is precious than gold. Every other day we hear about one or another data science/AI algorithm, people and companies are using to make their life easy.

Keeping data at one place is not going to help us until we choose to create insights and then take action accordingly. Data is growing at a rapid pace all around and so other tools have been created to analyse and visualize them.

In order to analyse or visualize data, we learn all these different tools to stitch them together. It’s a long process, time consuming and expensive. Startups even can’t think about these expensive tools.

This was past and thanks to cloud computing, this is becoming so easy day by day. With the advent of #Serverless, we only pay for the compute time (time when, your code is actually doing something and not sitting idle) and this is the power of Serverless.

Few years back #AWS came with Athena that promised to get data directly from S3. Was it a new beast?

What made AWS Athena powerful is able to write interactive SQL queries the way we all know (being a tech guy, this is my humble assumption).

I am sure excel guys will be more than happy as they can bring their data at one place from different departments.

SQL queries give us access to find any insights we wanted, don’t they? We just have to frame them using SQL queries (I say plain english)

In technical and functional terms, AWS Athena talks to S3 data directly and give us the desired insights. Below diagram might help a bit, however full picture is on the way.

Simple AWS Athena to S3 Architecture

What is AWS Athena?

Amazon Athena is an interactive query service and makes it easy for us to analyse data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, we can point Athena at the data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena is serverless, so there is no infrastructure to set up or manage, and we pay only for the queries you run. Athena scales automatically — executing queries in parallel — so results are fast, even with large data-sets and complex queries.

One of the greatest advantages using AWS Athena is cost. For 1 TB scan by the SQL queries, AWS charges $5. This is really inexpensive and affordable way to get insight against massive data. Other DDL commands (for example CREATE, UPDATE, DROP etc.) don’t incur any cent. Just the scan. This strategy from AWS is awesome.

Athena uses HiveQL data definition language (DDL) statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood

How do we do now?

That’s the fun part. We just have to upload the file into S3 bucket and good to go. All the supported file format can be found at https://docs.aws.amazon.com/athena/latest/ug/supported-format.html.

For the sake of a demonstration, we are going to analyze sample insurance portfolio dataset and it can be found at http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip.

Unzip it and find a file called “FL_insurance_sample.csv”. Following a convention would be really great. For example, this file has been uploaded under insurance-portfolio -> florida

File upload in S3 — Convention

Next, we should go to Athena page and create a table called “insurance_portfolio” and make sure to create a table from “from S3 bucket data”.

Create table — Athena

Next step talks about which database, table name and the location of the csv file. Make sure not to include the csv file name in the location textbox input. for example s3://insurance-portfolio/florida/

Create table — insurance-portfolio

Next screen asks about the type of file, we want to analyse. In this case, we choose csv. Please make a note of other file formats too. We can in-fact query JSON file too. Isn’t it amazing? I know, what you are thinking :-)

S3 file format — Athena

We are going to create a schema next. Suggested way for the sake of this demonstration would be to copy column names from the csv file and a data type against them.

Add schema — Athena

Athena also supports “Bulk add columns” and this can also be used, if we choose to.

Bulk add columns — Athena

For a seasoned DBA (they love writing SQL scripts, don’t they :-]), following SQL scripts can be executed directly from the query window. Make a not of the last SQL ‘skip.header.line.count’=’1'. This tells Athena to ignore 1st row from the CSV as they are just column names.

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.insurance_portfolio (
`policyID` int,
`statecode` string,
`county` string,
`eq_site_limit` int,
`hu_site_limit` int,
`fl_site_limit` int,
`fr_site_limit` int,
`tiv_2011` bigint,
`tiv_2012` bigint,
`eq_site_deductible` int,
`hu_site_deductible` int,
`fl_site_deductible` int,
`fr_site_deductible` int,
`point_latitude` float,
`point_longitude` float,
`line` string,
`point_granularity` string
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘serialization.format’ = ‘,’,
‘field.delim’ = ‘,’
) LOCATION ‘s3://insurance-portfolio/florida/’
TBLPROPERTIES (‘has_encrypted_data’=’false’, ‘skip.header.line.count’=’1')

We are now ready to execute the create table script and Athena just creates a mapping with the S3 location and ready to query and analyse data.

Let’s analyze data against county “SUWANNEE COUNTY”.

SELECT * FROM “sampledb”.”insurance_portfolio” WHERE county = ‘SUWANNEE COUNTY’;

This gives us all the data for “SUWANNEE COUNTY” county. We can form many complex SQL queries, if choose to and can get the desired analysed data back. Oh, we don’t have to wait many hours but few seconds.

Data against SUWANNEE COUNTY county

Next, we want to know how many records are in insurance_portfolio and here we go.

SELECT COUNT(*) AS TotalRows from “sampledb”.”insurance-portfolio”;

I hope, it started makes sense to you by this time. This is evident, what we can achieve using AWS Athena and how powerful it is. On the top of everything, this is completely Serverless. We don’t have to manage any servers and no need to worry about the server’s idle time. And remember the cost $5/TB of scan. IT team will have their peace of mind and happy too (cost wise, oh yes).

Athena offers to do many other things like create a view and you guessed it right; yep, same SQL jargon.

Now we know how to analyse data using Athena with S3. Visualization is the next step.

AWS has a product called QuickSight, however we can go beyond this product and create or use the existing tools in the market.

Overall AWS Glue Architecture

We can in-fact expose data via AWS API endpoints (I love them) and that can be used from variety of visualization tools.

Data consumption from various tools

Power BI, Tableau to name a few data visualization tools. API endpoints allow us to connect this from many other places. For example, if we want; we can build our own web visualization using Angular, React etc. (remember D3JS), mobile apps can use this endpoints too, we can ask Alexa many different questions and she will be happy answering those, provided Alexa skills have been developed to ask questions.

A bit of advice

I am sure, you are bit enthusiastic by this time and wanted to get your hands dirty, however as with other tools and technologies; there are caveats, you should consider. With small data-set, it is easy to start and also cost effective. When data is growing (Athena is designed for massive data. I would suggest to read more about AWS S3 offerings), you would want to have your SQL queries taking less time and cost effective.

Try to use AWS Athena best practices and use partitions to make your SQL queries run faster (and ultimately cost effective). See if you can compress your data size. CSV data can be compressed to Parquet (https://en.wikipedia.org/wiki/Apache_Parquet) and subsequently compressed more to gzip or other compressing format. For an example — 1.35 TB of dataset can be compressed to 125 GB in Parquet. That is a massive win in terms of running Athena interactive SQL queries and also significant less cost (1 TB of scan results to $5 and 125 GB of scan results to $0.46 and that is massive. Cost can be further be minimized with partitions). You can also use AWS Glue Crawler to decide partitions for you, if you are not comfortable with partitions.

AWS Athena opens up massive opportunity to analyze your data and since it is serverless; you don’t manage any servers (so not paying up for your hardware and resource sitting idle).

With this all, it is left on us; what we can do with Athena. Doing things with Athena is limitless and I will leave this to your curiosity.

By the way, a great use story at https://read.acloud.guru/how-we-built-a-big-data-analytics-platform-on-aws-for-100-large-users-for-under-2-a-month-b37425b6cc4 from A Cloud Guru

Happy data visualization!!!

Please feel free to reach me at rakesh@sketchview.com or my twitter handle https://twitter.com/gkrakesh.

--

--

Rakesh Gupta
Rakesh Gupta

Written by Rakesh Gupta

Founder and IT Consultant, SketchMyView (www.sketchmyview.com). Reach me here: linkedin.com/in/grakeshk

No responses yet