Getting started with Amazon DynamoDB and Java

Konstantin Tsykulenko
Universal Language
Published in
8 min readAug 5, 2015

--

Preface

In 2015, there is no lack of NoSQL technologies and most of them claim to be “scalable”, “highly available” and whatever else seems to be trendy at the moment.

So why consider Amazon DynamoDB? It’s not the newest technology on the market, and has been around for a while. In fact, the classic DynamoDB whitepaper (which is in itself a pretty good read on fundamentals of distributed databases even nowadays) was released as early as 2007, and the principles described there were used in many modern distributed databases or clustering frameworks like Akka Cluster. It only opened for public use as a service in 2012 however. Is it no longer relevant? Of course not! It is being constantly updated, and just recently new features like streams and cross-region replication have been released. In our opinion you should consider using DynamoDB when:

  • You’re using AWS. If you’re not using AWS you can stop here.
  • You want a fully managed database that has good integration with the other parts of AWS stack, like Elastic Map Reduce, Lambda etc.
  • You want a highly scalable schemaless database with high availability and eventual consistency. DynamoDB does support strongly consistent reads, but not for all cases.

If that sounds like something you want to know more about, let’s get our hands dirty and do some coding.

Setting up the environment

DynamoDB comes with a local distribution, which comes in handy for development and testing purposes. Alternatively, you can also use your amazon account and the real db. We’ll be using Spring Boot and Gradle for our test application. Full example code is available on github. First of all, we need to grab spring boot and AWS dynamodb SDK as our local dependencies:

We’re using 1.10.8 DynamoDB SDK version, which includes the newest features, such as streams.

Let’s set up the DynamoDB client next:

and finally the Spring Boot application.properties file:

That’s it! Let’s move on to creating tables for our example application.

Low level table API

Tables for DynamoDB are often created programmatically. It is also possible to do this by simply using the JSON API provided by DynamoDB or by using the AWS Console UI. We’ll focus on the programmatic case however. DynamoDB is a schemaless database, so we only need to define key attributes, as well as some common table configuration like provisioned throughput. Let’s take a look at how it’s done:

  1. We create an attribute definition of a string attribute called id.
  2. We create a hash key definition with the name id. It will use the attribute defined above.
  3. We construct a CreateTableRequest using key and attribute definitions above. We also set ProvisionedThroughput for the table. It defines the number of read (1st argument) and write (2nd argument) units provisioned for the table. It defines the predicted load for your table. If these are exceeded, a ThrottlingException will be thrown.
  4. We issue a request to create the table.
  5. We wait for the table to be created. Tables are not created instantaneously in DynamoDB and we need to be sure it’s in the “ACTIVE” state before using it.

Note that we do not define any fields that would contain the actual data as each record can have an arbitrary set of fields.

DynamoDBMapper

A little bit more high level approach to table creation would be to use the DynamoDBMapper. It’s a mapping utility which allows you to convert DynamoDB items to POJOs, but it also allows to generate table definitions. Let’s first take a look at how our entity might look like:

This is a very simple entity that only defines a HashKey (which is autogenerated — this is only supported for string types and it will get a UUID value) and two attributes. @DynamoDBTable tells us that this entity is a table and provides the table name.

Given this entity definition, the code from our previous example would turn into:

We construct the CreateTableRequest using the mapper and our Customer class. This will populate all the key and index information provided by the passed entity class. Note that ProvisionedThroughput (which is required) is not set by the mapper. This is because no throughput information is provided by the entity mapping. One might consider an annotation that would specify it, but this is probably a bad idea since throughput belongs to the configuration and should not be coupled with the entity mapping. You may implement throughput configuration lookup based on the table name of using any other mechanism you deem suitable.

  1. We set the ProvisionedThroughput
  2. We create the table
  3. We wait for the table to be active

More on DynamoDB keys

All DynamoDB tables should have a unique key defined. These come in two types:

  • Hash key — a single attribute which uniquely identifies the item. You can retrieve the item using its hash key. Item distribution across the nodes is mostly dependant on the hash key values, so if some of the hash keys are accessed way more than others this will result in non-uniform load, which in turn would not allow you to efficiently utilize your provisioned read/write capacity.
  • Hash and Range key — this is a composite unique key, consisting of the hash and the range attributes. You can retrieve singular items by using both hash and range keys or you can make a queries against the sorted range index like (pseudocode): hashKey=hashKeyVal and rangeKey > x.

For more info on the keys and related best practices please refer to the official documentation. We’ll return to this a bit later.

Scans and Queries — Low level API

Scans read all items in a table or an index. Because this number can be pretty large, scans only return paginated results. Let’s write a scan that will retrieve our customers by their name:

We create a map with the query parameters. :val can later be used in a filter expression.

  1. For our ScanRequest, we set the map with the query params we’ve just created.
  2. We set a filter expression that would limit the returned items to the ones with the given firstName.
  3. We set scan projection. Here we can specify which attributes we want to retrieve. Here, we only retrieve ids.
  4. We execute the scan.
  5. We print the results.

To read all pages of the Scan, we need to use the lastEvaluatedKey property of the ScanResult:

and keep reading until we run out of pages. It is possible to set the page size on the scan request:

Unlike Scans Queries only allow you to use table’s keys or indexes — you can not put restrictions on arbitrary item attributes. However these are much more efficient. Let’s add another Customer attribute — premium. We’d like to have a query to efficiently retrieve all premium customers in our database. For this, we’ll introduce a Global Secondary Index. Our new Customer definition would look like:

Note that our index consists of a hash and a range key now. We can query the index using the hash key, and return all the records regardless of the range key. It’s also important that we’re using a nullable ‘premium’ value here. Null value won’t actually be saved as an attribute of an item, the item will lack this attribute completely. SInce it’s also our hash key, only items that actually have it will be added to the index. This is called a ‘sparse’ index and it’s quite efficient for retrieving items. This will work nicely if a relatively small number of users are premium.

Let’s take a look at how a Query to retrieve all premium users would look like:

Overall, it’s fairly similar to the scan we did earlier, but not that we perform it on the index rather than the table.

We would also need to modify our table definition:

Global indexes have their own provisioned read/write throughput that we need to specify (In an actual app we would probably put a little more thought into this rather than setting everything to the same value). For indexes, you can also optionally specify a projection — the attributes that would get copied to the index and thus would be returned by the queries on that index.

Short intro into Local and Global Indexes

DynamoDB indexes come in two flavours — Local and Global. We’ve used a Global Index in our previous example. It’s worth to understand some basic differences between the two.

  • Local index — Basically an alternative range key for the table hash key. Useful when you need range queries for additional attributes in the table.
  • Global index — Contains a selection of attributes from the table, but they are organized using a different primary key than the ones in the table. This can be a hash key or a hash + range key. Global indexes only support eventually consistent reads.

For more info on global and local indexes and best practices please refer to the official documentation.

Working with items — low level API

One way to do CRUD operations on items is to use the low level API. Here’s an example for inserting an item into the customers table.

This obviously has a disadvantage of dealing with raw map-like entities.

As with creating tables, you can also use the DynamoDBMapper our entity class:

There’s an even simpler way of doing CRUD, queries and scans — Spring Data.

Spring Data DynamoDB

Spring Data DynamoDB is a community project, as Pivotal has no DynamoDB implementation. You can find it here on github. Using it is pretty simple, just as with all Spring Data projects.

First of all, we need to put an @EnableDynamoDBRepositories annotation on our context class:

And then create the repository interface:

Note the @EnableScan annotation. All methods of the repository are implemented as queries by default. however, if there is no index to query, these will fail. So you need to explicitly enable the scan operations with @EnableScan. This looks much easier (as expected from spring data) than any other APIs that we’ve seen. Keep in mind, however, that this is not an official Pivotal project thus it might not contain the latest AWS DynamoDB features and the bugs might not be fixed fast enough, so proceed with caution.

Integration Testing

For integration testing it’s possible just use the same local DynamoDB distribution we’ve used for development. If you’re using Maven there’s a plugin for dynamodb startup and shutdown for testing, but with Gradle there’s really no need for that. If we copy the local distribution to the db folder in our root project folder our test configuration would look like:

We’re taking advantage of the inMemory option, although we could alternatively point the db to store the data in a temporary test folder. We can also set up a script to download the distribution prior to the tests if we do not want to commit it to the repository.

A simple integration test for our spring data repository would then look very straightforward:

Conclusion

We’ve walked through some of the available Java APIs and tools for working with DynamoDB as well as discussed some of its features and limitations. However, there’s still much to learn if you want to build a solid application using DynamoDB. We advise you to start from the official developer’s guide we’ve linked throughout this article as it contains plenty of useful information on features, best practices and limitations. We hope that the information provided here will help you get on track faster, knowing the basic approaches and tooling you can use.

Originally published at tech.smartling.com on August 5, 2015.

--

--