Using Amazon S3 Select to retrieve data from a AWS S3 Bucket — With Java example
A large amount of existing data lakes, remains with their root location in a S3 Bucket. Sometimes when you don’t have a structure in a company that enable your applications to search data in a good way, with indexed data doing requests with performance and being scalable, and you need to retrieve this data fast (I’m talking in the sense of preparing the application structure), you can use S3 Select to take this data.
Amazon S3 Select is a feature that allows you to perform simple SQL operations against your raw data stored in S3. One of the requirements is that your data needs to be in a structured format, i.e. JSON, CSV, Parquet. Do note that it will also work on compressed version of files so you don’t need to decompress them before reading.
S3 Select is a completely serverless solution. You don’t provision any servers or databases to make this run.
If you plan to adopt S3 Select to take data in a S3 Bucket, keep in mind that he have a big limitation related to advanced SQL Queries. He only supports the SELECT clause in SQL. That means no joins, no groupings, and no other sophisticated SQL operations. In your SELECT statement, only the following other clauses are supported:
- FROM
- WHERE
- LIMIT
Also note that nested json fields can also be accessed, so if you have deeply nested objects you should still be good to go.
The other, most important constraint of S3 SELECT is that you can only perform a query on one object at time. So S3 select is not the best choice for some use cases that require parse one entire bucket paths or collections.
In terms of pricing, S3 SELECT is priced on a couple different dimensions, also outlined below:
- Number of SELECT requests ($.0004 per 1000 requests)
- Data Scanned ($.0002 per GB)
- Data Returned ($.00007 per GB)
If you want more details about prices, check out the link here.
So, in some cases I would consider S3 select to be an easy way to extract specific portions of your data stored in S3 using SQL without having to retrieve the whole object. This can be integrated at runtime your applications.
But, let’s talk about how to use it in a Java Applications. I’ve made a simple Java + Spring Boot application using the AWS Java API to access a S3 public bucket in my account, that contains all the Nobel Prizes given in the whole history.
Json structure containing data of all Nobel Prizes in the bucket:
We can search for every value that these fields contains — If you want to search some data that is in a Array structure, you can also set a expression in your request to FlatMap these data in Json Objects [*]:
For example, if you want to search data about the Laureates, you can set this expression:
The App that I’ve built, enable to do search based in a query param, and you can search data related to the laureates, the request has been prepared in this way:
And here’s a example of the request:
curl “http://localhost:8080/api/items?field=firstname&value=Benjamin"
The response will be showed like that:
You can find the code in my Github:
Thank you!