Software Engineer, free software enthusiast, aspiring musician, and an indie game developer prototype.

Posts

Data loss: and now what?
Jun 4, 2020 |
Quarantine Notes
Mar 29, 2020 |
Flagit
Jan 1, 2020 |
1st Gamedev Meetup of Southern Santa Catarina
Dec 6, 2019 |
Imprima - First Game Dev Experience
Oct 31, 2019 |

Categories

programming , indie game , game development , meetup , ruby , gem , open source , flagit , python , flask , serverless , dynamodb , vuejs , quarantine , covid19 , data , infrastructure , postmortem

Data loss: and now what?
#serverless #data #dynamodb #infrastructure #postmortem

On March 29th I wrote about Quarantine Notes on a post. I was really excited about the project, it had been really fun to work on and I learned a lot on the process.

A friend of mine kept pushing me to share it on the web so people could see and give their opinions. Besides the joy of bringing this project to done, I never had high expectations about it but I decided to give it a shot and shared it on Twitter, Hacker News and Reddit, the results were, well, unexpected.

Quarantine Notes got roughly 400 access per day, and in a few weeks we had thousands of notes there for people to read and reply, if they wanted to. It was… satisfying.


Hm... Yep, all good!

Everything was done, people were adding notes, interacting through replies and so on. There’s still a TO DO list of small enhancements to be added, but things were fine and I started working on another project.

This new project of mine will also rely on AWS services using Lambda and DynamoDB through Serverless. So I got my notes from the previous project and started the new repository replicating the same steps I had done once already. I had deployed my new Serverless Flask API and it worked, awesome! Next morning, Quarantine Notes was broken, it couldn’t reach my API due to CORS policy, weird. I immediately performed a deploy of what should be the stable version and it starts working again but there aren’t notes to be displayed. I had lost everything.


What I wanted to do.

After trying to understand what happened and trying to retrieve the lost data, I figured that it might be worth to talk/write about this incident in something like a postmortem: identifying the cause of the issue, how can we fix it and what actions can we take to prevent this from happening in the future either on this project or any other one.

What Actually Happened

The Serverless framework does an amazing job making it really easy and simple for you to deploy your project to AWS but one should understand how it works in order to avoid this kind of situation. Serverless uses AWS CloudFormation to automate the provisioning of resources on the AWS environment, all requirements and details are specified through a template that is created from the serverless.yml file.

So basically, the serverless.yml file we create defines a Service that needs to have its resources created/updated on AWS, and to do so it uses CloudFormation.

Long story short, when working on my new project I followed the same steps I did before when creating Quarantine Notes and ended up deploying a service with the same name but not the same stack. This was basically, me UPDATING an already existent project with a totally different stack instead of creating a separate one. That caused all previous Resources (tables and Lambda functions) to be deleted.

Once I noticed it I immediately deployed the correct project again, which created deleted Resources once again. Data couldn’t be restored.


AAAAAAAA

Action

One word action: BACKUP.

Back in 2017, Amazon announced Dynamo DB Backups and later on ways to schedule them. It didn’t take long for Serverless to provide us with a pre-build solution for it.

Apparently you can Automate your DynamoDB Backups with Serverless in less than 5 minutes, I have tried and if you leave the deploy time out of the equation you can automate your backups in seconds.

Their approach on how to automate this process is really safe, it is a separate service on AWS that requires a different role with dynamodb:ListTables and dynamodb:CreatBackup actions. You can set up a custom retention period for your tables and adjust the backup rate as needed. This service can handle multiple tables, so if you create a new DynamoDB table in any other project, just be sure to add the table’s name to your backup service and you should be good to go.

So, all done. Right? - Wrong.

After I realized I had lost all data, I went after logs on CloudWatch and I couldn’t be more frustrated. The logs didn’t contain ANY useful information, the data from POST requests weren’t there because they have to be explicitly printed to STDOUT and, of course, I wasn’t doing that either. That leads me to

Action Round 2


Last mile sprint. GOGOGO

It’s better to be safe than sorry. I’ve decided to print every JSON formatted entry right after inserting it into the table, that way if for some reason backup tables don’t work I’ll be able to do some extra work and retrieve data from logs. But that will only be necessary if the table gets deleted, and CloudFormation has a way to help us prevent that.

Action Round 3

After establishing a backup routine and logging notes/replies before inserting them into the database there is one more thing left to do, prevent the table from being mistakenly deleted. When working on your serverless.yml file you can make usage of CloudFormation attributes in its default format under the “resources” section and CloudFormation provides us with a couple of attributes that come in handy: DeletionPolicy and UpdateReplacePolicy.

Names make it self-explanatory but, just to be crystal clear, those attributes can be applied to any resource on your stack and assigning them the value “Retain” they’ll retain or (in some cases) backup the existing physical instance of a resource when instructions are to replace/delete them, making it another safety measure that is worth using!

That’s it, I know this was a long post but I feel that it was needed. Answering the question in the title: now we analyze it, understand what happened and why it happened. Only doing that we’ll be able to work on prevention and data recovery methods.

Stay safe!