blog

Getting acquainted with DynamoDB

DynamoDB is not designed to be as flexible as common SQL databases, regarding to making Joins, selecting anything you want, and making arbitrary indexes: it is designed to handle big data (hundreds of gigabytes or terabytes), so it is natural that operations like "select all the movies from 1993, 1995 and 1998" would be discouraged - the operation would just be too costly. You can still do them, but it would involve scanning the whole database and filtering the results. Having this in mind, DynamoDB appears to be useful only if you are working with Big Data, if not, you'll probably be better with a more usual database.

So, what is the deal with queries with Secondary Indexes, exactly (I mentioned them in my previous post)? To explain this, it is good to understand how indexes work for DynamoDB, this way we can understand why they are so important.

Suppose we have this table, where id is a primary key:

id (PK) title year category
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 2
4 The Imitation Game 2014 2

In this case, we could search "movies which the id is 3", but not movies which the id is less than 3, more than 3, different than 3, or between 1 and 3 - this is because the primary key must always be a hash. Although it is a number, due to the way that this ID gets indexed (probably in a binary tree) makes it impossible to be searched by criteria that demand sorting, it can only be an exact value.

Now, I already explained that in order to make queries, we always need to use the primary key. This is true, but not entirely: you can create "secondary primary keys" (global secondary indexes), so you can search based on them, and for secondary indexes, they do not have to be unique. I will explain what are "local secondary indexes" later, for now I'll focus on global indexes: we could make a global secondary index in the category of the movie:

id (PK) title year category (GSIH)
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 2
4 The Imitation Game 2014 2

Where GSIH = Global secondary index, hash. Indexes need a name, so I will call this one "CategoryIndex".

Now that we have a secondary index, we can use it to make queries:

{
TableName : "Movies",
IndexName: "CategoryIndex",
ProjectionExpression:"id, title, year",
KeyConditionExpression: "#cat = :v",
ExpressionAttributeNames: { "#cat": "category" },
ExpressionAttributeValues: { ":v": 2 }
}

This will get us the movie The Godfather and GoldenEye. The attribute "category", however, is still a hash, and this means we can only search it with absolute values.

Not very intuitively, indexes can actually have 2 fields, the second one being optional: a hash (in the examples I showed, id and category), and a range. Ranges are stored sorted, meaning that we can perform searches with operators such as larger than, smaller than, between, etc - but, you still need to use the hash in the query. For instance, if we wanted to get the movies from category 2 from 1995 to 2005, we could turn the attribute year into a range, belonging to the index CategoryIndex:

id (PK) title year (GSIR) category (GSIH)
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 2
4 The Imitation Game 2014 2

Where GSIH = Global secondary index hash, and GSIR = Global secondary index range.

{
TableName : "Movies",
IndexName: "CategoryIndex",
ProjectionExpression:"id, title, year",
KeyConditionExpression: "#cat = :v and #ye between :y and :z",
ExpressionAttributeNames: { "#cat": "category", "#ye": "year" },
ExpressionAttributeValues: { ":v": 2, ":y": 1995, ":z": 2005 }
}

This would give us the movie Pirates of Sillicon Valley. Global secondary indexes can be created/deleted whenever you want: you can have up to 5 of them in your table.

Local secondary indexes are the almost the same, the differences are: instead of creating a hash and an optional range, the primary key is the hash, meaning it will have to appear in the query. They are also used to partition your table, meaning that they cannot be changed after the table is created.

But after all, why do we still need to divide our data into smaller categories to search? Well, because if you are working with big data, you should divide your data into a smaller piece somehow, otherwise it will be just too hard to search. How can you divide it? Just find something in common that would separate the data nicely into homogenous groups.

Remember my other example, when I only wanted to search movies from 1992 to 1999, but without scanning the whole table? How could we do this? Let's think a bit about this example: why would you query this? If you are querying this because your website offers a list of "all movies released from the year X to Y in the Z decade", you could make use of this common ground, create an attribute for it, and index it like this (I'll call it DecadeIndex):

id (PK) title decade (GSIH) year (GSIR) category
1 The Godfather 70 1972 1
2 GoldenEye 90 1995 1
3 Pirates of Silicon Valley 90 1999 2
4 The Imitation Game 00 2014 2

Now look: we have a hash index (decade) that covers all the possible results that we want, and we also have a range field (year). We can search it with:

{
TableName : "Movies",
IndexName: "DecadeIndex",
ProjectionExpression:"id, title, year",
KeyConditionExpression: "#dec = :v and #ye between :y and :z",
ExpressionAttributeNames: { "#dec": "decade", "#ye": "year" },
ExpressionAttributeValues: { ":v": 90, ":y": 1992, ":z": 1999 }
}

If I didn't type anything wrong, we would get the movies GoldenEye and Pirates of Silicon Valley.

If you are like me, you are probably thinking: "Ok, but what if I wanted movies from 1992 to 2005? This will span more than 1 decade". This also simple to solve: if this is a possibility, you could have another index with the same functionality, or simply query once per decade - it seems costly, but since the entries are indexed, the operation will still be infinitely faster than doing a scan (and probably faster than doing the same operation in an SQL database).

In conclusion, DynamoDB seems to be extremely efficient for operations in tables with enormous amounts of data, but it comes with a price: you must plan the structure of your database well and create indexes wisely, having in mind what searches you will be doing.