Mage on Google Cloud Platform
My experience deploying and building data pipeline using Mage on GCP
TL;DR
Excellent features but need to be stabilized
Why Mage?
I mean there are a bunch of options for open source workflow orchestrations tools e.g. Airflow, Dagster, Prefect, even Luigi. But Mage is the one that fit my use case the best. I need to deploy to GCP as fast as possible in Serverless infrastructure to avoid idle cost. So the ideal solution should be deployed easily on Cloud Run Service.
Airflow is quite expensive to deploy. Especially in GCP, the options are using Cloud Composer which require minimum 3 GKE instances to be up and running even in idle state. Deployment on VM is not scalable while deployment on Cloud Run Service is not that straight forward. Dagster and Prefect also doesn’t provide a straightforward documentation on how to deploy their open-source solution on GCP Cloud Run Service.
There are several features that I’m really interested to:
- Modern UI
- Managing pipeline on the UI
- Dynamic Blocks
- Straightforward inter-blocks data exchange
- Built in Data Validation on runtime
- Reusable blocks
I’m fully aware that this tool is pretty young compared to its competitors, So I’m here to share my findings.
Original Deployment Architecture
I was exploring Mage and trying to deploy it on GCP ecosystem, so I try to follow their deployment best practice. My takeaway is that their deployment is super easy and the infrastructure is already production grade from the start. Here’s the tech stack they use
- Cloud Run Service to host the webserver and scheduler
- Cloud SQL to host the Mage database (Postgres)
- Cloud Filestore as a remote persistent storage (Storing Logs and Variables)
- VPC Access Connector to allow Cloud Run connect to private resources
- Cloud Load Balancer to control traffic to Cloud Run Service
Problems and Solutions
There are various issues that I got from this deployments:
- Problem: Filestore somehow overwrite my code when I mount it to the Cloud Run. All my code pushed on my project directory is overwritten by Mage default files.
- Solution: I end up not mounting any persistent storage to Cloud Run, instead I’m using GCS to store variables and logging. I set this up on the project metadata.yaml on remote_variables_dir and GCS logging. Although this is raising a problem with the logs as Mage not writing the log stream to GCS in real-time as GCS is an object store instead of file system, it doesn’t support append to file, which makes us need to wait until the block finish running to see the log.
- Problem: Too specific dependency pinning. Mage somehow pin its dependencies on a really specific versions (On my case, the issue is with
PyJWT==2.6.0
andaiofiles==22.1.0
), this might be good practice for software application that’s “ready to use” and designed to not extend for further usage. But here we talk about workflow orchestration that most like having additional dependencies and with specific dependency pinning, it means that there are a huge chance of dependency conflict when we install another python package. Especially if we use a strict package manager like poetry. - Solution: I end up using pip on requirements.txt file to cover the dependencies that have a conflict with mage dependencies as there is no visible issue with different version running, I think Mage should consider change their dependency pinning strategy.
- Problem: Mage doesn’t support async function as a block function. The reason is because Mage uses asyncio to wrap its pipeline executor, so if the block contain async function, it’ll have a nested async function which not natively supported by python.
- Solution: I end up using nest-asyncio which gives a nested async call to python. The problem with this solution is that this package archived and not maintained anymore, the reason is because the approach is considered anti-pattern, which may be the reason why python doesn’t support nested async call in the first place.
- Problem: When I set conditional block to a dynamic block, the pipeline raises
ZeroDivisionError
when the conditional block return False. The reason is because the way mage decide the number of child blocks is based on the output of the dynamic block. When the dynamic block conditional fails, the dynamic block is not executed and produce no output making the denominator of the child block numbers to 0, thusZeroDivisionError
. - Solution: To overcome this, I add a dummy block before the dynamic block and place the conditional block to the dummy block instead of the dynamic block. This makes failing conditional behave as it supposed to.
- Problem: Cloud Run Job executor only uses Mage container resource utilization (e.g. CPU and memory). Sometimes there are some block that require more CPU and memory than the other, and making the resource allocation flat to all job might not be a good approach as it can lead to overprovisioning or underprovisioning issues. Other than that, Cloud Run Job have a read and write quota applied to each region, so I might also want to deploy Cloud Run Job on different region.
- Solution: I create a patch file that overwrite
mage_ai/services/gcp/cloud_run/cloud_run.py
file. The patch file gives more config to the setting such as CPU, memory, region, etc. This way a block with higher memory requirement can be configured with higher memory without making overprovisioning on other blocks. Multi-region deployment also allow us to avoid read and write quota exhaust from Cloud Run. - Problem: Output being wrapped with a list. This is one of the most annoying issue. When I use mage v0.9.70, all the outputs are passed to the downstream block as is but when I upgrade to v0.9.71, all the outputs are passed as a single element list with the upstream block output as the element.
- Solution: The solution for this issue is pretty clear, that I need to handle the upstream output passed to a block and take the first element. The problem is that I’m afraid this behavior might not be consistent for the next version, so I need to handle both cases, which makes my code a degree more uglier.
- Problem: Notification template have limited number of variables that can be sent. My use case require more variables like execution duration, pipeline input variables, completed blocks etc. Which mage template not provide.
- Solution: the solution is similar to cloud run job issue. I create a patch that overwrite
/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/notification/sender.py
file. I add more variables passed to the message template. - Problem: Mage on Cloud Run Service deployed is not scale to zero. The reason is because Mage have a Scheduler component which requires at least one instance to be up and running. If we set the minimum instance of Cloud Run Service to 0, the scheduler will not trigger any pipeline until we open the webserver. Of course this behavior is not desirable. But we also want to manifest the sclae to zero feature of Cloud Run Service.
- Solution: Luckily, mage have a capability to separate webserver and scheduler on different instance. So I can set the scheduler to a free tier Compute Engine VM and the webserver on Cloud Run Server that can scale to 0. This makes 0 cost charged when the Mage is on idle state.
Post Adjustment Architecture
Above problems and solutions iterations makes the mage architecture on GCP change quite significant.
- Now the scheduler and the webserver are separated. The scheduler is now on Compute Engine and the Webserver is on Cloud Run Service (scalable to zero)
- No persistent storage mounted to the webserver, instead we use GCS to store the logs and the variables. This also disable the ability to edit code on staging and production environment as any change will not be persisted and not passed to the executor. All code changes must happen on the development, which is a good thing.
My Key Takeaways
- Mage has various excellent features
- Their slack channel also really active and I can ask any question and getting a prompt response.
- The deployment is quite simple and the documentation is easy to follow.
- As this tool is still new and not battle-tested, there are some improvement needs to be done to ensure stability in any platform.