We partnered with DataBlade (DB) this year to build out a serverless data science solution for one of our clients’ main business needs, revenue forecasting. In this post I will discuss our architecture for the final product and how we overcame some implementation obstacles.
Problem
The client needed a way to forecast revenue for their business in order to better plan for resource and budget allocations. DB had already created a revenue forecasting model to solve that problem, based on the client’s historical data and various assumptions input by the end users. Now that DB had been acquired however, the client needed a way to execute, access, and evolve the forecasting model outside of the DB platform.
Solution
Our solution was twofold:
- Deploy a standalone implementation of the model to be accessible on-demand
- Create a web application that can interact with the model, including managing versions, executing runs with different inputs, and downloading results
In the next two sections I will discuss each of these solutions at a high level as well as diving into specific issues we faced to get them working.
Standalone model
The forecasting model is a series of Python scripts that connect to the client’s database for accessing historical data. A user can submit various inputs, such as growth projections, that affect the model results.
Per DB’s suggestions, we used AWS as our primary stack, specifically
- CodeBuild for automating model code updates that are packaged into Docker containers, linked directly to a Github repository
- Elastic Container Registry for managing the Docker container versions
- Batch for managing actual executions of the model, on demand
- S3 for storing cached data and outputs
- Virtual Private Cloud (VPC) for securing the execution environments
Setup for the most part was straightforward; we figured out early that Batch uses Elastic Container Service under the hood, which requires public IPs so some subnet configuration was required. There was also the usual iterating with getting IAM roles and permissions to work together correctly. One pain point we had was the lack of precise error logs and associated documentation when Batch executions did not work. In certain cases, a Batch job can get “stuck” in the RUNNABLE state, where according to AWS documentation the compute environment is either waiting for sufficient resources to run the job, or “log drivers” are not set up correctly. We found this StackOverflow answer to be more helpful in that the issue could also be due to permissions, internet access, image problem, or EC2 limits.
We eventually figured out that our VPC had additional required configurations, which led to another round of research and iterating. Our VPC setup ended up being a two-subnet system:
- Subnet A as a “public” subnet that routed requests through an Internet Gateway, and contained a NAT Gateway with an Elastic IP Address (a fixed IP so that external servers could whitelist access, such as TST’s database server)
- Subnet B as a “private” subnet where we deployed our internal services like Batch execution jobs, and that routed any outbound requests to the NAT Gateway in Subnet A
Once we got everything set up, this architecture made it easy to execute model runs in Batch, as well as push any updates to the model code to a new version that we could readily access in Batch.
Web application
The standalone model at this point could be executed via the AWS web interface or the command line. For the end solution however, we needed a simpler UI that abstracted away all of the details and allowed an easy click–submit–download flow. We stuck with AWS as the stack service provider for ease of integration and billing. A traditional path for a web app on AWS would have been a straight up EC2 instance, perhaps with Elastic Beanstalk to simplify deployment. After some research we came across the idea of a “serverless” web app and it seemed to fit our requirements for this project.
Serverless architecture is a newish computing pattern enabled by the ubiquity of cloud computing. Instead of having a server running continuously 24/7 (eg. an AWS EC2 instance or Digital Ocean Droplet), resources are spun up on demand to fulfill a request. The major advantages for us were not having to actually manage servers, and the cost savings from keeping servers running when no one was using them. The downsides were the initial implementation details and a latency increase on first-load, which was acceptable given our requirements.
On AWS, the services we used for our serverless web app comprised of
- Lambda as the primary event-driven compute resource
- API Gateway for routing requests
- S3 for serving static assets (images, css, and js files)
- RDS for persistent data storage and service
Our web app stack is Python, Flask, and Postgres at the core, and we chose a Python–AWS deployment framework called Zappa that enabled a seriously easy deployment process: zappa update
The Flask web app code could be written in almost the exact same way as with a “serverful” architecture. Zappa only required a settings file that included various environment variables, and a @task decorator for functions that were longer-running and required asynchronous execution; this second part we figure out after our initial naive Flask deployment threw a lot of “timeout” exceptions, as Lambda was limiting the execution time of a single request. If any request required more than 30 seconds to return, then we needed to separate the offending functions out into an asynchronous task as its own Lambda function. Some examples of asynchronous tasks include sending emails, submitting Batch job requests, and refreshing persistent data.
Putting it all together
The final product is a serverless, on-demand data science product that has a clean web UI, scalable job execution backend, and simple deployment process. We took a risk with using the more unfamiliar “serverless” compute architecture, and had to figure out some obstacles along the way due to sparse documentation (mostly on the AWS side), but it was a worthwhile experience that sets us up for future projects. We will definitely continue to use the serverless architecture given appropriate requirements.
If you are working on your own implementation and have questions, or would like help in building your own serverless data science solution, feel free to reach out.