Have you tried restarting?

Evaluating Performance: A Benchmark Study of Serverless Solutions for Message Delivery to Containers on AWS Cloud

Nicola Cremaschini — Sun, 03 Mar 2024 21:48:21 GMT

In this article i'll show you how to forward events to private containers using serverless services and fan-out pattern.

I'll explore possible solutions within AWS ecosystem, but all are applicable regardless the actual service / implementation.

Context

Suppose you have a cluster of containers and you need to notify them when a database record is inserted or changed, and these changes apply to the internal state of the application. A fairly common use case.

Let's say you have the following requirements:

The tasks are in an autoscaling group, so their number may change over time.
A task is only healthy if it can be updated when the status changes. In other words, all tasks must have the same status. Containers that do not change their status must be marked as unhealthy and replaced.
When a new task is started, it must be in the last known status.
Status changes must be in near real- time. Status changes in the database must be passed on to the containers in less than 2 seconds.

Given these requirements, let's explore a few options.

Option 1: tasks directly querying the database

Pros:

easy to implement: The task is just to perform a simple query and get the current status, assuming it can be queried.
fast: It really depends on the DB resources and the complexity of the query, but there are not many hops and can be configured to be fast. You can configure polling time to match our requirement of 2 seconds requirement, e.g. every 1 second.
easy to mark as unhealthy tasks that fails to perform queries. The application could catch errors in queries and mark itself as unhealthy if it has enough resources. Otherwise, the load balancer's health check would fail.

Cons:

waste of resources: Your application queries the database even if no changes have been made. If your database does not change more frequently than the polling rate, most queries are useless.
your database is a single point of failure: If the database cannot serve queries, tasks cannot be notified.
it does not scale well: As the number of tasks grows, the number of queries grows and you may need to scale the database as well, or you may need a very large cluster running all the time to accommodate any scaling, wasting resources.
difficult to monitor: How can you check if an individual task is in the right state?

In such a scenario, I definitely don't like polling.

Let's try a different and opposite approach.

Option 2: Db streams changes to containers

Instead of having tasks asking to the database, let's have the database notifying them for changes.

Before go into the pros and cons, i must say that it would be very hard if not impossible to implement this solution exactly as i drown it. We can use a very popular pattern, called fan-out.

This is the wikipedia definition:

In message-oriented middleware solutions, fan-out is a messaging pattern used to model an information exchange that implies the delivery (or spreading) of a message to one or multiple destinations possibly in parallel, and not halting the process that executes the messaging to wait for any response to that message

To make things a little more concrete, let's use some popular AWS services that are commonly used to implement this pattern:

DynamoDB: NoSql database with native event streaming
SNS: pub/sub event bus
SQS: queue service

The solution looks like this:

Now let's explore pros and cons:

Pros:

first of all, you can see that arrows turned into dotted lines. This architecture is completely asynchronous
easy to implement: all integrations you need are native. You need just to configure serverless services and to implement a SQS consumer in your application.
very scalable: you can add as many task as you want without affecting the database, your limit here is SNS but is very high. As stated in official docs a single topic supports up to 12,500,000 subscriptions.
no waste of resources: a.k.a really cost-effective. This solution leverages on pay-per-use services, and they would be used only when actual changes occurs on db.
very easy to monitor: both SNS and SQS supports Dead Letter Topic / Queue: if a message isn't consumed within the timeout, it can be moved into a DLQ. You can set up an alarm if a DLQ is not empty, and kill the associated task.
easy to recover: If a container cannot consume a message, it can try again. In other words, it does not have to be online and ready to receive the message at the moment it is delivered, as the queues are persistent.
very fast: i did a benchmark on this solution, here the github repo with the actual code. Later on in this article we'll see results

Cons

more moving parts: even if the integration code is not required since it's provided by AWS, connecting things and tuning connections is not straightforward as performing a query.
not so easy to troubleshoot. As every distributed system, i would say.
it strongly depends on serverless services: if one link in the chain slows down or are not available, your containers can't be notified. We have to say that all involved services have a very good SLA: 3 nines for SQS and SNS and 4 nines for DynamoDB. Not sure about Dynamo stream, since it appears to be not included in DynamoDB SLA. I suppose dynamo streams are backed by Kinesis Streams, which also has 3 nines of availability.

Open points:

The main open point here, to me, was: is this fast enough? Let's verify it.

Trust, but verify

I couldn't find any official SLA about latency for involved services nor any AWS official benchmark.

So i decided to perform one myself, and i scripted a basic application using typescript and CDK / SDK.

Here the github repo with the actual code and details on how the system is implemented.

Before going ahead, bare in mind that i performed this benchmark with the goal to understand if this combination of services / configuration could fit for my specific context / use case. Your context may be different, and this configuration may not fit with it.

System design and data flow

The AppSync API receives mutations and stores derived data in the DynamoDB table
The DynamoDB stream the events
The Lambda function is triggered by the DynamoDB stream
The Lambda function sends the events to the SNS topic
The SNS topic sends the events to the SQS queues
The Fargate service reads the events from the SQS queues
If events are not processed within a timeout, they are moved to the DLQ
A Cloudwatch alarm is triggered if the DLQ is not empty

Key system parameters:

Region: eu-south-1
Number of tasks: 20
Event bus: 1 SQS per task, 1 DLQ per SQS, all SQS subscribed to one SNS
SQS Consumer: provided by AWS SDK, configured for long polling (20s)
Task configuration: 256 CPU, 512 Memory, Docker image based on Official Node Image 20-slim
DynamoDB Configured in PayPerUseMode, stream enabled to trigger Lambda
Lambda stream handler written in node20 bundled with ESBuild, configured with 128MB

Benchmark parameters

I used a basic postman collection runner to perform a mutation to Appsync every 5 seconds, for 720 iterations.

Goal

The goal was to verify if containers would be updated within 2 seconds.

Measures

i used the following Cloudwatch provided metrics:

Appsync latency
Lambda latency
Dynamo stream latency

and i created two custom metrics for measuring SQS and SNS time taken.

Time taken custom metrics are calculated from the SNS and SQS provided attributes:

SNS Timestamp: from AWS doc

The time (GMT) when the notification was published.

ApproximateFirstReceiveTimestamp: from AWS doc

returns the time the message was first received from the queue (epoch time in milliseconds).

SentTimestamp: from AWS doc

Returns the time the message was sent to the queue (epoch time in milliseconds).

The following code snippet shows you how attributes are used to calculate sns time taken in millis and sqs time taken in millis

//despite the name, this is the ISO Date the message was sent to the SNS topiclet snsReceivedISODate = messageBody.Timestamp;if (snsReceivedISODate && message.Attributes) {      clientReceivedTimestamp = +message.Attributes.ApproximateFirstReceiveTimestamp!;   sqsReceivedTimestamp = +message.Attributes.SentTimestamp!;   let snsReceivedDate = new Date(snsReceivedISODate);   snsReceivedTimestamp = snsReceivedDate.getTime();   clientReceivedDate = new Date(clientReceivedTimestamp!);   sqsReceivedDate = new Date(sqsReceivedTimestamp!);   snsTimeTakenInMillis = sqsReceivedTimestamp - snsReceivedTimestamp;   sqsTimeTakenInMillis = clientReceivedTimestamp - sqsReceivedTimestamp;

i didn't calculate the time taken by the client to parse the message because it really depends on the logic the client applies parsing the message.

Results

Here screenshots from my Cloudwatch dashboard

Few key data, from Average numbers:

Most of time is taken by Appsync, i couldn't do anything to lower this latency since i used native Appsync native integration with DynamoDB.
The only custom code is the Lambda stream processor code, and lamba duration is the second slowest component here. As you can see in the graph, the lambda cold start is the killer, but considering this we can observe a very good latency on avg (38 ms).
The average total time taken is 108.39 ms
The average response time measured by my client, that cover my client network latency, is 92 ms. Given Appsync AVG Latency is 60.5 ms, my Avg network latency is 29.5 ms. This means that from my client sending the mutation to consumers receiving the message there are 108.39 + 29.5 = 137.89 ms

Conclusion

This solution has proven to be fast and reliable and requires little configuration to set up.

Since almost everything is managed, there is little space for tuning and improvements. In this particular configuration, I could simply give the Stream Processor Lambda more memory, but memory and latency do not scale (inversely) together.

I could remove Lambda and replace it with Event Bridge Pipe. I haven't tried it yet, but i'm going to use the exact same benchmark and compare the results.

Last but not least, keep in mind that AWS does not always include latency in the service SLA. I've run this benchmark a few times with comparable results, but I can't be sure that I would always get the same results over time. If your system requires stable and predictable performance over time, you can't go with services that don't include performance metrics in their SLA. You're better off taking control of the layers below, which means you should consider going to a restaurant or even making your own pizza at home.

Wrap up

In this article, I have presented you with a solution that I had to design as part of my work and my approach to solution development: this includes clarifying the scope and context, evaluating different options and having a good knowledge of the parts involved and the performance and quality attributes of the overall system, writing code and benchmarking where necessary, but always with the clear awareness that there are no perfect solutions.

I hope it was helpful for you.

Bye 👋!

Serverless social login with AWS Cognito

Nicola Cremaschini — Sun, 24 Dec 2023 16:30:24 GMT

Disclaimer: This is not a step-by-step guide, just my trade-off analysis on using Amazon Cognito to provide social login for your app and some pitfalls I found in my experience.

In this article, I'll show you my serverless solution to add social identity providers as a login option for web and mobile applications, based on managed services and native integrations, and how I mitigated some issues I encountered.

Context

Let's assume you have an application for which your users do not have to register, but can log in with their social identity.

If you 're wondering why, think about it:

registration could be a barrier to entry for users as it requires more steps and sharing of data
most internet users have at least one social identity. All mobile users have at least one (Google identity for Android users, Apple identity for Apple users)
it is very easy for users to access your app if most of the login is done without a password
you can receive users data from social providers, if users allow your app to give their data to your app.

The most popular social IdPs are Facebook, Google, Apple, Amazon, LinkedIn, Github and many others.

Considering that every IdP should implement the OpenID Connect standard (we'll come back to this later...), which is a layer above the OAuth2 standard, and that every IdP requires some configuration, let's explore some options.

Option 1: Native integration

Every one has its own SDK and apis to integrate natively, so you can code in your app the integration for IdPs you want to use.

Pros

fine-grained control over each individual IdP integration. Since each IdP is natively integrated, you can customise the specific UX via configuration and handle IdP requests that are not included in the OAuth standard (we'll get to that later...)
direct integration, no intermediary, straightforward architecture. You can rely on robust implementations (Google / Facebook / Amazon provides good code in their SDK) and IdP resilience and H-A.
Cost-effective: usually IdPs provide a free-tier for their api, so there aren't any costs from that side.

Cons

Difficult to scale: Each IdP has its own SDK and its own customs (someone said "standard?"), and it takes a lot of code is required to handle them. Even if you put your authentication logic into a library, you have to distribute it to all clients to get a change.
Hard to test / troubleshoot: more code, more tests. Moreover, different integrations require you to know IdP customs.

Option 2: use an OAuth Provider

Since Social IdP adhere to a standard, it's easy to abstract the specific implementation (SDK) and work with interfaces, integrating with an OAuth 2 service provider.

Pros

Just one integration, between your client and di OAuth identity platform. Less code, less test, less releases, more speed.
Easy to scale: you can add/remove IdP without impacting clients (see previous bullet)
Authentication flow configuration and governance is now centralised. You can create consistent auth flow regardless the specific IdP you support, and you can monitor it and gather metrics and statistics in one place.
You build your auth flow on standards.
There are Identity Platform as a service out there (AWS Cognito, Auth0, Google Firebase and many others)

Cons

Your integration choices are limited to IdPs supported by your OAuth provider.
Your system complexity is higher, since you add components to it.
The OAuth provider could be a single point of failure. If it is not available, you cannot offer authentication to your customers. Therefore, you need to think carefully about the reliability and scaling of your OAuth provider.

My choice: Option 2 with AWS Cognito

I'm aware that you may have many constraints and to be brief, I cannot list them all: given my context, I went with option 2 and used AWS Cognito as OAuth provider. I did a spike on Auth0 and some other services.

I decided to accept the constraints and costs of Cognito in exchange for a low-code implementation and easy setup, in other words for faster delivery, because I wasn't sure if it would be worth it.

Here my actual implementation

All you need is to:

configure your integration on Social Provider side. Here a reference for each of provider i integrated with
configure Cognito integration. Here AWS Doc for each supported providers
Integrate your application with Amazon Cognito. Cognito provides an hosted ui for the login page, but you can create your own.

Pitfalls: things to be careful about

Here I list some of the pitfalls I have encountered in this integration. This is not an exhaustive list of what goes wrong with Amazon Cognito and the social login flow, but again my personal experience or in other words things I found during my working days.

Watch out for Cognito limits

Serverless does not mean infinite, and Cognito is one of the services that best demonstrates this.

In one sentence: Cognito's scaling policy is not designed for spiky patterns.

The scaling pattern is (reasonably) tied to the size of your user pool: the more users, the more TPS provided.

But, and here comes the first pitfall, the first threshold is up to 1 million users. From 1 to 999999 users, you have the same TPS.

This means that if your login pattern is fairly consistent, you probably won't have any problems. However, if your login pattern is spiky, perhaps because your app is tied to certain time periods in some way, your app will struggle with a lot of throttling errors from Cognito.

These diagram show successful federation logins and throttling errors:

i split into two distinct diagrams for better visualisation, but i want to point out that

around 20:50 i had ~7K throttling errors and ~1.5K of success (total requests: ~8.5K)
around 21:20 i had ~6K throttling errors and ~1.4K success (total requests: ~7.5K)
around 22:30 i had ~1.3K success with ZERO throttling errors

Cognito TPS calculation rules can be found at this specific section of Cognito docs, and you have to carefully consider them.

As you can see from the successfully login metric diagram, handling the throttling exception in your app can mitigate the user impact: they would be able to successfully login anyway, but waiting a little bit more.

I decided that it could be acceptable, and i traded it for easy setup and integration with Social Providers.

Since this decision would impact our customer experience i tried to mitigate it as much as possible, for instance sending push notifications before traffic spikes to encourage users to log in and spread log in requests.

Standards are not prescriptive

I love standards, everybody should love them in engineering.

Unfortunately, sometimes for good reasons and sometimes not, giants have bias to force standards a little bit.

Apple, i'm pointing my finger at you!

First, Apple's guidelines require you to log in to Apple if you want to distribute your app in the Apple Store and your app has a social login feature. That may be a bit rude, but it's fair.

Again,Apple prescribe you that the "User cancellation" function must be accessible and clear. That is also fair.

And here Apple does not adhere to the OAuth standard: if an Apple user allows Apple to share their data with your app, some kind of association between your app and the user also takes place in the Apple system, and if a user wants to cancel from your app (also known as your user pool), this association should also be removed.

To do that, you have to invoke Apple apis to:

generate a valid access or refresh token.
invalidate the freshly generated token.

Sounds weird, but this is exactly what this doc page prescribes.

And, guess what? Cognito doesn't handle it.

Even if Cognito could handle it because it has all the information it needs, especially the private key you created on the Apple side and provided to Cognito to request the tokens, that's reasonable from a product perspective: Cognito adheres to standards and can't track every specific implementation.

But it does mean that Apple won't include your app in the store if you don't take care of it.

So let's take a look at how to implement it.

You can't implement it in the app: i used Cognito to decouple the app from auth providers, and i don't want to violate that requirement. Besides, you don't want to store your private key on the device, do you?

So you need to implement it on the backend side. My first idea was to use events to respond to the Cognito event to delete a user and trigger a lambda that calls the Apple api to delete the user on the Apple side.

As far as I know, Cognito today has

Lambda triggers: user deletion not supported
Cloudtrail tracks all management api calls, and user cancellation is a management api. But Cloudtrail event doesn't have any reference to actual user (and it saved my day in an audit session, but this is another story)
Cognito Sync: it seems to handle user deletion. Quoting:
To remove a record, either set the op to remove, or set the value to null.

This is how it looks like:

I see two problems here:

first, you have to put your Apple's private key in Cognito and in Secret Manager. Cognito can't retrieve it from Secret Manager. I raised this issue to Cognito team, keep you posted on this.
second, Cognito user cancellation and Apple user cancellation are asynchronous: what if it success on Cognito side and than fails on Apple side? User wont be in our Cognito user pool anymore, so we can't rollback the operation. So you need to handle failures, and to handle it you need to store it. Let's add a DLQ for our deletion lambda

After saving, you must analyse why the deletion failed and try again. How long can this take? It depends on the cause and your process, but until you've done that, users will still see their user associated with your app, and I'm not sure Apple would like it and approve your app submission.

You need to reverse the order of deletion, first on the Apple side and then on the Cognito side. If the Apple deletion fails, you can send an error message to the user and inform he/she that the deletion cannot be performed and they should try again later.

In the case of a Cognito error, you will have to do this later, but at least the user will not see that their user is linked to your app and Apple should be satisfied and approve your request.

Let's see how it looks like

I still see two problems here:

Again , you have to put your Apple's private key in Cognito and in Secret Manager.
Your app now is integrated with two systems: Cognito for Sign-in operation and your custom api for user deletion

Both solutions somehow solve the problem and both raise new concerns, so I had to opt for the less bad one.

I decided to implement a custom api for Apple user deletion because it can be implemented just in half our code base (not for Android version of the app), the integration is quite simple and Apple would be happy with this solution, but probably not with the alternative solution. Still an error handling mechanism still need to be implemented to catch Cognito deletion errors and to recover them.

Wrap up

I have shown you my solution to real-world problems and how you can make informed decisions by carefully weighing trade-offs between different solutions that best fit your context and constraints.

In other words, the daily work of an architect, simplified.

Architectures need to evolve as the context and constraints change over time. So always design your solutions so that they can easily evolve with them.

I hope it was useful for you!

Bye 👋!

How to handle multiple git based systems on the same Mac(hine)

Nicola Cremaschini — Fri, 08 Dec 2023 12:45:09 GMT

Hello everyone 👋 ! This is my first article, and my first tech blog actually.

In this article, I'll show you how I configured my Mac to work on repos hosted on my personal Github, on my company's Github, on Gitlab, and on AWS CodeCommit with AWS SSO integration.

Not rocket science 🚀, but something I've struggled with a bit and can be achieved in a number of ways.

Let's see how I did it.

Why do I need many version control systems?

Here's my context: I work for a company as a cloud architect and need to access my company's repositories hosted by Github Enterprise with a corporate user.

My company also has an AWS organisation, and we use AWS SSO federated with Corporate ADFS to access the organisation's accounts, and we have some repositories hosted on CodeCommit.

We also have some repositories on an old Gitlab installation that is somewhere in the basement of our office.

And finally, I have my personal repositories hosted on Github under my good old username.

I assume I'm not alone in the world with this:

Org's Enterprise Github, accessed with XYZ user
Org's Gitlab, accessed with TYU user
Org's AWS CodeCommit, accessed with QWE (federated) user
Your personal Github, accessed with ZXC user

I'm used to working with only one Mac, both for professional and personal projects. Therefore I need to pull and push code from/to different version control systems, in my case all Git-based, with different users and of course without being asked for credentials with every command.

I already had a lot of my Company's repositories downloaded to my Mac when I added all the other repositories, and I didn't want to reconfigure all my local Git repositories.

Does this sound familiar? If so, go ahead...

OSX Key Chain, dear friend...

Okay, now what?

So the OSX keychain can store your credentials and you can configure your Git client to retrieve them, right? Wrong!

Of course you can, but the OSX keychain stores credentials by hostname, which means it can store your Github company credentials OR your personal credentials, because for both the host is github.com.

With Key Chain you can store ONE credential per host, or at least I haven't found a way to store more than one.

So if you configure Git to use OSX Key Chain as a credential helper and you store the credentials of your personal Github user, everything will work fine when interacting with your personal repos.

But if you try to interact with your organisation's repositories, you'll get a 403.

HTTPS vs SSH vs GRC

The three Version Control Systems supports different protocols support

VCS	HTTPS	SSH	GRC
Github
AWS CodeCommit
GitLab

Since I had already configured many Github Enterprise repositories to use HTTPS, I considered my Enterprise Github to be the default.

Enteprise Github via HTTPS

I set up git to use OSX Key Chain as credential helper in my git global config, as follow:

[credential "https://github.com"]    helper = osxkeychain

and i use https connection when i clone repositories from there.

To edit your git global configuration, use the following command:

git config --global --edit

This snippet shows local git configuration for HTTPS connection:

[remote "origin"]        url = https://github.com/your-org/your-repo.git        fetch = +refs/heads/*:refs/remotes/origin/*

For my enterprise Github, that's enough: It uses the credentials stored in my Credentials Helper.

This way, every time I clone a repo over HTTPS and don't specify a local Git configuration, the global configuration and the OSX keychain are used for the credentials.

Personal Github via SSH

Then, for my personal Github, i set up SSH connection.

You can do the same following this guide.

You need to tell git to use that key, and here is my local git config

[remote "origin"]        url = git@your-shh-key-alias:your-user/your-repo.git        fetch = +refs/heads/*:refs/remotes/origin/*

to edit your local git config, run the following command inside your repository's root folder

git config --edit

Look the url part of this configuration:

url = git@your-ssh-key-alias:your-user/your-repo.git

You can see that I used an alias for my ssh key.

To do that, you need to edit your .ssh/config file.

Here is my .ssh/config

Host github.com-personal   HostName github.com   User git   IdentityFile ~/.ssh/github/id_rsa_personal   TCPKeepAlive yes   IdentitiesOnly yes

The Host parameter is your ssh key alias, so my url looks like

url = git@github.com-personal:*your-user/your-repo*.git

The IdentityFile is the path to your key file on disk, so it depends on where you have saved it.

Pay attention to the user parameter: this is not your Git user, but the user for the ssh connection.

This way, every time I clone one of my personal repositories, I have to edit my local Git configuration.

In my case, I have a lot more new repositories from my company than personal repositories and therefore I decided to keep HTTPS as default for my company repositories: It requires less configuration.

You know, programmers are lazy 🦥...

Gitlab via SSH

I used ssh for connecting to Gitlab too, here Gitlab doc to ssh connection.

The local configuration is exactly the same as for Github. So you need another alias for your ssh key and have to set up your local Git repository to use your alias.

CodeCommit via GRC

As i said, our AWS Accounts are part of our AWS Organisation and we access them with AWS SSO federated with our ADFS.

Googling around i found this documentation page from AWS that starts with:

If you want to connect to CodeCommit using a root account, federated access, or temporary credentials, you should set up access using git-remote-codecommit.

Bingo! 🎯

Here how it works:

First you have to install git-remote-codecommit python package.

This documentation page tells you how.

Then you need to create a local AWS profile tied to the account/region where your repositories are stored and configure AWS SSO.

Just follow this guide from AWS to do so.

Finally, you have to clone your repository using that package.

Here's mine local git repository configuration:

[remote "origin"]        url = codecommit://your-aws-profile@your-repo        fetch = +refs/heads/*:refs/remotes/origin/*

let's take a look to the url parameter:

url = codecommit://your-aws-profile@your-repo

codecommit is the protocol, it tells git to use our python package.

your-aws-profile refers to your AWS Profile name.

This way, Git commands are executed on this repository with the Python package and with my AWS profile.

If you have a repository in different accounts, you need to set up different profiles and use them accordingly.

Wrap up

Here is how it looks like at the end of the story:

With the above configuration, I can easily switch between local folders and use Git without being asked for any credentials and without 403. 🎉

I tried to keep the configuration overhead low for the most commonly used version control system (Enterprise Github) and I used GRC for CodeCommit because I don't want to specify a profile when I run my Git command.

i have aliases and want to use them the same way regardless of the account, and this configuration hides the profile specification from the commands.

But if I didn't already have many cloned repositories with HTTPS, I would use SSH for enterprise Github as well and remove the OSX key chain from my configuration.

I hope this was helpful, thanks for reading!

👋👋