hcoelho.com

my blog

Title only : Full post

Why DynamoDB is awesome and why it is not

:
I made a new post about DynamoDB and how to solve its limitations: Getting acquainted with DynamoDB

We still don't know for sure which technologies we are going to be using for our API, including the technologies for the databases; the two main technologies we are focusing right now is DynamoDB and PostgreSQL. Most developers are already familiar with PostgreSQL: it is an open-source, free SQL database, similar to MySQL; DynamoDB, however, is a No-SQL, proprietary database that belongs to Amazon.

We made our research and tried both, these are our impressions and the main differences:

DynamoDB PostgreSQL
Structure NoSQL SQL
Documentation Misleading and confusing Good
Price on AWS [1] Cheap and flexible Fair, but not flexible
Syntax Has its own syntax SQL
Easiness to use [2] Fair Very easy
Scalability on AWS Excellent Good
Performance on AWS Excellent Good
  • [1] AWS = Amazon Web Services.
  • [2] May be misleading, since we come with an SQL perspective, so there is not much to learn in order to use PostgreSQL. In fairness, DynamoDB does a good job on being intuitive.

It seems that DynamoDB is a fair competitor, however, it may have a dealbreaker: the way it handles indexes and queries. To explain this, let's suppose we have the following table called Movies (NoSQL doesn't have tables, I know, can you stop being pedantic pls? Besides, this is actually the correct name for tables in DynamoDB: table):

id (PK) title year category
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 1
4 The Imitation Game 2014 1

Just a disclaimer before I start explaining the differences: the documentation for DynamoDB is very obscure, so it is possible that I am missing some pieces of information or simply misunderstood them. So, in DynamoDB, your primary key must be a hash field - it is unique, but cannot be searched as a range (you can't search for "id between 1 and 5", for instance). You can, however, specify another column to be a range (year could be a range). For this example, there is only one index: id.

In order to select all the data from the table, this is how we could do in SQL:

SELECT id, title, year FROM Movies;

This is how we could do with Dynamo (it may have an error somewhere, I can't test it now, just bear with me, ok?).

{
TableName: "Movies",
ProjectionExpression: "id, title, year"
}

Nothing incredible, right? ProjectionExpression are the fields we are looking for. This kind of operation is called a scan - it scans all the table and gets all the results. So how would we search for a specific ID, say, ID 3? In SQL:

SELECT id, title, year FROM Movies WHERE id=3;

In DynamoDB:

{
TableName : "Movies",
ProjectionExpression:"#k, title, year",
KeyConditionExpression: "#k = :v",
ExpressionAttributeNames: { "#k": "title" },
ExpressionAttributeValues: { ":v": 3 }
}

Weird, right? But the idea is actually simple: #k and :v are placeholders - #k is 'id' and :v is '3', just like variables and their values. KeyConditionExpression is the condition, ExpressionAttributeNames are the "map" for the keys, ExpressionAttributeValues are the "map"for the values.

So far so good, but here is the catch: when you create a table in DynamoDB, you have to specify a primary key which is also the index, and you cannot make a query that doesn't use the key in the condition. What I mean by this is that, say you want to "find the movies made in the 90s", putting the condition in the query... Well, in principle, you can't, simple as that - because you are not using the primary key in the condition. There are, however, workarounds for it: doing scan and filter, and using secondary indexes.

The first alternative is doing a scan in the database (getting all the data) and then filtering it like this:

{
TableName: "Movies",
ProjectionExpression: "id, title, year",
FilterExpression: "#yr between :start_yr and :end_yr",
ExpressionAttributeNames: { "#yr": "year", },
ExpressionAttributeValues: { ":start_yr": 1990, ":end_yr": 1999 }
}

Seems simple, but it has a big drawback: you will actually pull ALL the data from the database and then filter it - this is often unacceptable if you have large quantities of data.

The other alternative is doing what they call "secondary indexes", and this is where things get complicated: secondary indexes can be local or global - local indexes can be queried, but must still be dependent on the original hash key (the primary key), global indexes can be queried, HOWEVER, they must rely on another hash, one that is not the primary key. If we made a global secondary index for year that used the category as the hash, we could query for the "movies made between 1990 and 1999 which belong to category 1" (assuming that category is the hash and year is the range) like this:

{
TableName: "Movies",
IndexName: "CategoryYearIndex",
ProjectionExpression: "id, title, year",
KeyConditionExpression: "category = :cat and year between :ys and :ye",
ExpressionAttributeValues: { ":ys": 1990, ":ye": 1999, ":cat": 1 }
}

Which is reasonable, HOWEVER, global secondary indexes also have problems: you are still tied to a hash, and you have to pay to use them.

Alright... But that does not really answer the question: how can I use it to "select movies from 1990 to 1999", without using another hash? Well, as we understood from the documentation, the only way around this is scanning your whole table and filtering it. Not ideal, HOWEVER, local secondary indexes kind of solve this: I read in another blog post that if you do scans filtering secondary queries, it is still very performant and won't be as costly as fetching all the data. HOWEVER, local secondary indexes can only be made in the moment of creation of the table: you cannot change or add them, which is not exactly scalable.

It seems that DynamoDB is really powerful and easy to use if you want to make simple, fast queries to only retrieve values without too many conditions - it will really shine in these situations. But if your system requires more unusual queries and your indexes may change over time, I don't think DynamoDB is a good choice; you can work around these limitations, but I feel like you will just be swimming against the current.

cdot dynamo 

Planning and technology limbo

:

In the last few days we spent a lot of time planning the system: iterating over the database schema, how to implement the APIs, how to implement the client modules, and how all these pieces fit together. This often means that one part will influence the other, until we finally find a setting that fits together and works.

I usually enjoy this part of the project, planning involves a lot of thinking and good strategy - like solving a puzzle, but it can be very stressful sometimes. What I don't like about planning is that it takes time, and during this time, you end up floating in limbo: you can't make concrete plans because you don't know if they will hold up in the long term. The technologies we are considering now for the project are MySQL, AWS Lambda + Gateway, and AWS Elastic Search.

The capabilities of PostgreSQL that I described in the previous post seem to be supported in MySQL 5.7, which makes it a suitable candidate for a database; however, we need to make sure it is capable of enduring the traffic. For the past few days, I've tried n times and failed n-1 times (for now) to create a suitable testing scenario for MySQL. The scenario is simple: set up a VM with MySQL and bombard it with hundreds of millions of rows (with numbers and JSON data) and see what happens - if it behaves as it should, we query it until it breaks. Seems simple, but the universe was trying to thwart my plans (and succeeding) in the past few days:

  • 1st try: The internet was strangely slow that day, when I started the download of the VM. One hour later, it finished: the download was corrupted and I had to start over.
  • 2nd try: VM installed, but the version of the MySQL was wrong and I had to update it - obviously, I broke the installation beyond repair and I just rebuilt a new VM.
  • 3rd try: VM installed and MySQL updated. I also made a little, neat script that inserts batches of 1,000 random registers in the database and let it run for a while. The result: 55,000,000 rows inserted. "Great! Now I can start testing" - I messaged myself mentally. After some informal tests, it was time to go home and I had to insert more registers; we decided to let the script run overnight, but first, "does it stop when we lock the computer?" - we thought, and decided to try. Any sensible person would backup the database before doing this, but 55 million rows really takes a while to download; besides, we are beyond sensible, so we locked it anyway: that's how we corrupted the third VM.
  • 4th try: We quickly set-up the database again (just made a new table) and left the scrip running overnight. During the commute, we were betting the results: I bet 80% on the VM being corrupted, 15% the script broke somehow, 10% someone turned the computer off, and 5% it worked - the fact that it sums up to 110 does not matter, what matters is the general idea. The database was corrupted.
  • 5th try: New VM made, we left the script running for a few hours (reaching around 80 million rows) until the VM completely ran out of space; with a few adjustments, we increased the space and let it run a little more. Tomorrow we will run it again to insert more registers.

So that was the MySQL saga until now. The other limbo that we are floating in is the technology regarding the API: the client suggested that we used AWS Lambda + Gateway, and maybe AWS Elastic Search. These services are really nice (I will probably post about them, if we get to learn more about it), but Lambda + Gateway seem to be a very simplified "framework" for an API - I am afraid that in the future we will have to modify it to be more robust and it will just not work. Although I would like to use them, I fear that the bureaucracy of AWS and its not-intuitiveness will hurt us more than help.

cdot 

Why PostgreSQL is awesome

:

I was supposed to post this update on Friday (September 9th), but I forgot, so I decided to post it on Saturday (September 10th), but I forgot again; so then I decided to post it on Sunday (September 11th) and I forgot again; so I'll post it today.

One of the most important (or maybe the only one) features behind the popularity of NoSQL is its ability to store data without a schema, in other words: forget about tables, you store anything you want in JSON. This flexibility comes in really handy when the data you need to store is a bit unpredictable, but still needs to be indexed and searched - normally we can overcome this by doing some complicated workarounds with the schemas, but that is where NoSQL really shines. Where it doesn't shine, however, is where you actually need relational schemas and organizing data in a very cohesive way.

Myself, I've never been a big fan of NoSQL: I love JSON, and I love to store information in JSON, but NoSQL never gave me the confidence of actually being reliable; thankfully, newer databases already support similar features for storing JSON. PostgreSQL accepts the data formats JSON and JSONB, which recognizes and treats JSON objects as actual columns.

For instance, the entry below contains a JSON object with the data of a person called John Doe, 54 years old, that lives in Toronto/Ontario.

TABLE people

 id | doc
----+----------------------------------------------------------------
1   | {
    |   “city”: “Toronto”,
    |   “province”: “Ontario”,
    |   “person”: {
    |     “name”: { “first”: “John”, “last”: “Doe” },
    |     “age”: 54
    |   }
    | }

His first and last name could be retrieved using the following SQL query:

SELECT doc->'person'->'name'->>'first', doc->'person'->'name'->>'last'
FROM people WHERE id=1;

The syntax is fairly simple, and almost self explanatory, only with one detail: the arrow '->' is used to retrieve a value of the object, while '->>' retrieves it as a string.

The nice thing about this feature is that SQL can now be a mix of both worlds, it also means that instead of pulling query results from the database and computing/filtering them in the API, if necessary, this can be done directly in the SQL statement.

cdot postgresql