hcoelho.com

my blog

Title only : Full post

Thoughts on API design

:

Captain's Log, Stardate 29092016.5. For the past few days, we have been working on the API that will be responsible for pushing the data sent by the client into the database. It is all very simple in theory, but it is always a challenge to come up with a good, consistent flow for an application - when it is just a simple task, such as pushing data, it is all very straight forward, but when it depends on foreign keys and relations, then we start having problems. For instance, it is easy to receive some information about a user (like login, password and email), and just put in the database; but if the information is stored in multiple tables (like login, password, email, list of friends, list of messages), we need to be more careful: what if one of them fails? Which one should we do first? Ensuring the correct flow for the application, while keeping modularity, and recording and treating errors correctly suddenly becomes very hard.

OOP versus Reality

We are talking about APIs, not OOP, but you got the idea

The problem is common and it is not easy to avoid it, but it gets much easier to deal with when the parts require minimal effort for using, and are powerful enough to perform operations consistently. It is tempting to make APIs that are very simple: they just put/get/delete a little piece of information, and we are done. It's easy and quick to get this job done, easy to test, easy to understand, and simplicity is good. But as a consequence, the code outside of it will pay the price: all the logic that you saved in the API may have to go somewhere else. For instance: say you need to design an API that logs items into the database. Easy, just design something like this:

http://www.api.com/insertItem?item=myItem1

It takes one item and records it. Done. The problem comes when you realize that you may receive a list of items to insert. What do you do then? You will have to make a loop that does this call over and over again - if it is an ajax request, you will have to use promises or nest several callbacks just to insert several items. It would have been much better to design an API that is capable of accepting a list of comma-separated items after all:

http://www.api.com/insertItem?item=myItem1,myItem2,myItem3

If you are using an SQL database, it is easy to insert items in bulk; if you are using NoSQL, just split the string and you'll have an array ready to be inserted. Simplicity is nice, it should always be present, but we also have to make sure that what we are designing fulfills the scenarios we are expecting: if you expect to receive several items, don't design something that only receives one and has to be called over and over again. People try to save time by making something very simple, but when it is way too simple, you are likely to pay the price later. It takes more time to build this base, but you will be able to code the rest of the application in a breeze.

cdot api 

Cookies, Third-Party Cookies, and Local/Session Storage

:

In this post I will make a brief introduction to cookies, but more importantly, I want to talk about third-party cookies: What are they? Where do they live? How are they born? What do they eat? And what are the alternatives?

Due to the nature of HTTP (based request and response), we don't really have a good way to store sessions (have a fixed, persistent memory between visits in a webpage), this is solved by using cookies: the website creates a text file on the client's computer, and this cookie can be accessed again by the same website. This solves the problem of having to ask the client's username and password for every single page, for instance, or storing information about his "shopping cart" (when not using databases). There is also a security feature: a website cannot access cookies that it didn't create, in other words, the cookie is only available to the domain that created it.

Cookie monster

According to scientists, this is how a cookie looks like

HTTP Cookies first appeared in the Mosaic Netscape navigator (the first version of Netscape), being followed by Internet Explorer 2. They can both be set on the server-side (with PHP, for instance) or on the client-side, using JavaScript. There are several types of cookies:

  • Session cookies are deleted when the browser is closed
  • Persistent cookies are not deleted when the browser is closed, but expire after a specific time
  • Secure cookies can only be transmitted over HTTPS
  • HttpOnly cookies cannot be accessed on the client side (JavaScript)
  • SameSite cookies can only be sent when originating from the same domain as the target domain
  • Supercookie is a cookie with a "top level" domain, such as .com and are accessible to all websites within that domain
  • Zombie cookie gets automatically recreated after being deleted
  • Third-party cookies (I will talk about them now)

Third-party cookies

Cookies are a lot more powerful than they seem, despite the obvious limitation of only being available to the domain: let's suppose I have a website called foo.com and I decide to put some ads in other websites: bar.com, baz.com and qux.com. Instead of simply offering these websites a static image with my advertisement, I could pass a PHP file that generates an image:

<a href="..."><img src="https://www.foo.com/banner.php" /></a>

This PHP file would just generate an image dynamically, but it could do more: it could send JavaScript to the clients and set cookies in the websites that I announced. When someone access the website baz.com, my script could detect the website address (baz.com) using JavaScript (window.location), and record this information in a cookie. When the user navigates to other websites with my ads, such as baz.com or qux.com, my script would repeat the process. This information would be accessible to me: I would know exactly which websites the user visited.

Third party cookies

Courtesy from Wikipedia

Problems with privacy and blocking

Needless to say, people who are slightly concerned with privacy do not like cookies; especially for non-techie people, this is a very convenient witch to hunt - I am surprised that magazines and newspapers are not abusing it. Most modern web browsers can block third-party cookies, which is a concern if you are planning a service that relies entirely on this feature.

It's not easy to find statistics about cookie usage, but I got one from Gibson Research Corporation:

Browser usage

Browser usage

Cookie configuration by browser

Cookie configuration by browser, where: FP = First-Party cookie and TP = Third-Party cookie.

It seems that Third-Party cookies are disabled on Safari by default, while other web browsers are also getting more strict about them. Despite still being used, it seems that this practice is reaching a dead-end. On top of that, cookies are also not being able of tracking users across different devices.

Alternative: Local/Session Storage

Apparently, cookies are dying. It may be a little too early to say this, but we don't want to create something that will be obsolete in 5 years, so it is a good idea to plan ahead. What is the future, then?

Probably, the most promising tool is called Local and Session Storage, it also seems to be supported in the newest browsers:

Compatibility for Local and Session storage

Compatibility for Local and Session storage

The way Local and Session storage work is very simple: they behave as a little database in the browser, storing key-value pairs of plain text. While Local Storage is persistent (does not get deleted), Session storage lasts only while the browser is open (it is deleted when the browser is closed, but not when the page is refreshed). It is great for storing persistent and non-sensitive data, but they are not accessible from the server: the storage is only accessible from the client-side - if the server must have access to it, it must be sent manually.

Using the local storage, it is possible to build a similar system to Third-Party cookies, with methods similar to the ones I explained. Here is an article on how to do this: Cross Domain Localstorage.

Sources

cdot cookies 

Getting acquainted with DynamoDB

:

In my previous post about dynamodb I explained some limitations and quirks of DynamoDB, but I feel I focused too much on the negative sides, so this time I will focus on why it is actually a good database and how to avoid its limitations.

DynamoDB is not designed to be as flexible as common SQL databases, regarding to making Joins, selecting anything you want, and making arbitrary indexes: it is designed to handle big data (hundreds of gigabytes or terabytes), so it is natural that operations like "select all the movies from 1993, 1995 and 1998" would be discouraged - the operation would just be too costly. You can still do them, but it would involve scanning the whole database and filtering the results. Having this in mind, DynamoDB appears to be useful only if you are working with Big Data, if not, you'll probably be better with a more usual database.

So, what is the deal with queries with Secondary Indexes, exactly (I mentioned them in my previous post)? To explain this, it is good to understand how indexes work for DynamoDB, this way we can understand why they are so important.

Suppose we have this table, where id is a primary key:

id (PK) title year category
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 2
4 The Imitation Game 2014 2

In this case, we could search "movies which the id is 3", but not movies which the id is less than 3, more than 3, different than 3, or between 1 and 3 - this is because the primary key must always be a hash. Although it is a number, due to the way that this ID gets indexed (probably in a binary tree) makes it impossible to be searched by criteria that demand sorting, it can only be an exact value.

Now, I already explained that in order to make queries, we always need to use the primary key. This is true, but not entirely: you can create "secondary primary keys" (global secondary indexes), so you can search based on them, and for secondary indexes, they do not have to be unique. I will explain what are "local secondary indexes" later, for now I'll focus on global indexes: we could make a global secondary index in the category of the movie:

id (PK) title year category (GSIH)
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 2
4 The Imitation Game 2014 2

Where GSIH = Global secondary index, hash. Indexes need a name, so I will call this one "CategoryIndex".

Now that we have a secondary index, we can use it to make queries:

{
TableName : "Movies",
IndexName: "CategoryIndex",
ProjectionExpression:"id, title, year",
KeyConditionExpression: "#cat = :v",
ExpressionAttributeNames: { "#cat": "category" },
ExpressionAttributeValues: { ":v": 2 }
}

This will get us the movie The Godfather and GoldenEye. The attribute "category", however, is still a hash, and this means we can only search it with absolute values.

Not very intuitively, indexes can actually have 2 fields, the second one being optional: a hash (in the examples I showed, id and category), and a range. Ranges are stored sorted, meaning that we can perform searches with operators such as larger than, smaller than, between, etc - but, you still need to use the hash in the query. For instance, if we wanted to get the movies from category 2 from 1995 to 2005, we could turn the attribute year into a range, belonging to the index CategoryIndex:

id (PK) title year (GSIR) category (GSIH)
1 The Godfather 1972 1
2 GoldenEye 1995 1
3 Pirates of Silicon Valley 1999 2
4 The Imitation Game 2014 2

Where GSIH = Global secondary index hash, and GSIR = Global secondary index range.

{
TableName : "Movies",
IndexName: "CategoryIndex",
ProjectionExpression:"id, title, year",
KeyConditionExpression: "#cat = :v and #ye between :y and :z",
ExpressionAttributeNames: { "#cat": "category", "#ye": "year" },
ExpressionAttributeValues: { ":v": 2, ":y": 1995, ":z": 2005 }
}

This would give us the movie Pirates of Sillicon Valley. Global secondary indexes can be created/deleted whenever you want: you can have up to 5 of them in your table.

Local secondary indexes are the almost the same, the differences are: instead of creating a hash and an optional range, the primary key is the hash, meaning it will have to appear in the query. They are also used to partition your table, meaning that they cannot be changed after the table is created.

But after all, why do we still need to divide our data into smaller categories to search? Well, because if you are working with big data, you should divide your data into a smaller piece somehow, otherwise it will be just too hard to search. How can you divide it? Just find something in common that would separate the data nicely into homogenous groups.

Remember my other example, when I only wanted to search movies from 1992 to 1999, but without scanning the whole table? How could we do this? Let's think a bit about this example: why would you query this? If you are querying this because your website offers a list of "all movies released from the year X to Y in the Z decade", you could make use of this common ground, create an attribute for it, and index it like this (I'll call it DecadeIndex):

id (PK) title decade (GSIH) year (GSIR) category
1 The Godfather 70 1972 1
2 GoldenEye 90 1995 1
3 Pirates of Silicon Valley 90 1999 2
4 The Imitation Game 00 2014 2

Now look: we have a hash index (decade) that covers all the possible results that we want, and we also have a range field (year). We can search it with:

{
TableName : "Movies",
IndexName: "DecadeIndex",
ProjectionExpression:"id, title, year",
KeyConditionExpression: "#dec = :v and #ye between :y and :z",
ExpressionAttributeNames: { "#dec": "decade", "#ye": "year" },
ExpressionAttributeValues: { ":v": 90, ":y": 1992, ":z": 1999 }
}

If I didn't type anything wrong, we would get the movies GoldenEye and Pirates of Silicon Valley.

If you are like me, you are probably thinking: "Ok, but what if I wanted movies from 1992 to 2005? This will span more than 1 decade". This also simple to solve: if this is a possibility, you could have another index with the same functionality, or simply query once per decade - it seems costly, but since the entries are indexed, the operation will still be infinitely faster than doing a scan (and probably faster than doing the same operation in an SQL database).

In conclusion, DynamoDB seems to be extremely efficient for operations in tables with enormous amounts of data, but it comes with a price: you must plan the structure of your database well and create indexes wisely, having in mind what searches you will be doing.

cdot dynamo