Data Analytics in Early-Stage Startups

The whole process, the challenge, and the culture.

9 min readSep 2, 2020

Why early-stage startups?

My experience as a fresh graduate, I’ve got faced by the condition that forces me to choose between working at a startup or a corporate company. It brings me to google search about “startup vs corporate” and ends up choosing startup. But the startup also differs by its stages from pre-seed to some alphabetical series funding. I remember once Jack Ma gave some career advice to follow a good boss when we want to kickstart our career and it can only be possible if I’m in a small company so that I can contact the “Boss” directly.

Initial Condition

The first time I’m into the startup, they still have no idea how to implement good data analytics. Even in the interview, they seem to be confused on what to ask because no one in the company knew about data analytics. So, I have to build the whole data analytics process without any supervision and senior guidance. From the data pipeline to insight generation. The first question that comes up in my mind on this condition is what should I do first? And turns out I should learn about the company data first.

Learn The Company Data

Most first graduates are usually faced with ready to use dataset (Like in Kaggle), but in reality, those data can be anywhere and not provided in one place immediately. So, it’ll be hard to learn about company data because we don’t know where to start. This part is quite tricky and the most important one, the question does not start from what data does the company have, but more on what data does the company need. This should start from the business point of view as they are the ones who can make the decision and take action (it’s like they will be the ones who use the insights from the data analytics). So, list all the actions for each business unit can take and all questions they need to answer. It can be their KPI, Northstar metrics, OKR or anything that needs data to be achieved.

After those actions and questions are listed, we now check on the data that the company has, is it sufficient to answer all those questions and help those actions? If not, look for another, and don’t be shy to ask the team if there are any possibilities to gather those data because data can be anywhere and can be gathered anyhow. The data can exist anywhere but commonly on the list below:

Backend Data (Transactional Data inside Database)
Frontend Data (Tracking data on the platform)
Third-Party Data (Data generated by service used by the company)
External Data (Necessary data outside the company system)

Backend data is not a big deal as it is already structured but it is so far the most useful data compared to the others. We just have to learn about the structures and the business logic behind it. Backend data can be stored in some kinds of databases, it can be RDBMS like MySQL, MariaDB, PostgreSQL, Microsoft SQL Server, Oracle, and so on. Or it can be in NoSQL databases like MongoDB, Redis, CouchDB, and so on. Understanding how they differ will be useful.

Frontend data is quite hard to get, as not all platforms provide tracking, but most of them are. The structure is quite messy but can be arranged in some way. It will come handy when we want to analyze users behavior on the platform.

Third-party data is usually provided by the third-party service itself in a nice way, but sometimes they are not provided or may be limiting it. As a technical person, we usually want to solve this limitation using engineering stuff, but most likely it can be done by negotiation with the third party, so be open with that possibility. Third-party usually provide their data for us in a form of dashboard, but if we want to integrate them with other data, we need to access their data using API.

External data can be anything and anywhere. The first thing before learn this data is to make sure that this data will be needed later on with great value. The data can be an excel sheet, the data on some random websites, or maybe some data from other companies. Sometimes it can be the easiest but it can also be the hardest to get.

Determine The Flow and The Tech Stacks

On an early-stage startup, you may be the only data analytics person, so they will let you determine how the data analytics flow happens. There are a lot of resources to learn about data analytics flow, but the problem is the tech stacks to choose from. The consideration can be the company budget, your ability to learn about, pros and cons about certain tech stack, and so on. There are two flows that I made (Data Engineering and Data Analysis), but I’m planning to create two more (Data Science and ML Engineering).

Data Engineering Flow and Tech Stacks

For the first time, I use a simple VM Server and make some python scripts to perform ETL Jobs. Those jobs are scheduled using crontab, the data sources are connected using credentials saved inside the server. As we evolve from Batch Processing to Micro-Batch Processing, we also save the data load state to limit the data extraction to the data that hasn’t been extracted yet. We haven’t met the case when we have to use Stream Processing, so I rather pick a relevant yet reliable approach. After several trials and errors, we finally picked the following approach.

Metadata gives the latest state of data in the data warehouse, so it won’t extract the loaded data. Metadata also gives the connection credentials to access the data sources.
Extractor extracts data from the data source based on the given state and connection credentials
Extractor stores the extracted raw data into the data staging*
Transformer reads the raw data from the data staging and transform it into the desirable format
Transformer stores the transformed data into the data staging
Loader reads the transformed data from the data staging
Loader loads the data to the data warehouse
Loader sends the latest state of data in the data warehouse to the updater
Updater updates the latest state of data in the data warehouse to the metadata

Workflow Orchestrator: Apache Airflow
Metadata: Airflow Variable and Connection**
Extractor: Custom Operator → Pandas, Requests, SQLAlchemy
Data Source: RDBMS, NoSQL, REST API, Flat Files
Data Staging: Google Cloud Storage → Mounted to the Kubernetes Pod using FUSE and Stored as Feather***
Transformer: Custom Operator → Pandas DataFrame
Loader: Custom Operator → Pandas GBQ
Data Warehouse → BigQuery

*Airflow has XCom for inter task data communication but it’s not recommended as it’s limited to the pickle limitation, so we use Data Staging. But XCom is still used by the Loader to send the data load state to the Updater.

**Airflow encrypts connection credentials and data state in Variable using Fernet Key, so it should be secure.

***We use feather as the data staging format because it has the fastest read and write capability

Data Analysis and BI Flow and Tech Stacks

When I first learned about data analysis and BI, the tools that I know are only paid services like Tableau, Microservice, PowerBI and so on. But in the early stage startup, they often prefer the open source as it’s free and they already have cloud service credit to deploy them freely.

The first open source I knew is Redash. It’s SQL friendly and has an amazing parameterized query feature. The visualization is also magnificent, because they use Plotly as the visualization base. The only bad thing about Redash is that it’s not friendly for non-technical users. If they want to create their own analysis, they have to perform SQL queries and as most of them cannot do that, the workload of data visualization and reports are stacked on me (as a single data fighter).

So, I find Metabase which is more friendly for non-technical users, even if the visualizations are not as amazing as redash, but for mostly used visualizations are available there. Both Metabase and Redash support data governance, but I truly implement them when I use Metabase. There’s one case that I feel metabase cannot handle and it’s when I want to make a report for external users like periodical report to business regulator, dashboard for affiliate marketing partner and so on. The data governance of Metabase cannot handle this kind of stuff and if the developer has to build them, it’ll be hard.

So, I’m looking for BI tools that can filter data using user account email and it’s Google Data Studio. Data Studio allows us to filter the data using DS_USER_EMAIL which is almost the same as the user session. This feature is compatible for external report cases. So, I also picked it. These combinations of BI tools make the flow look like this.

a. Connect Data Warehouse to the BI Tools
b. Organize external report using Google Data Studio and let the external users access them with their account filter
c. Create an internal question using Metabase and set a MetaBot, Alert Scheduler and Pulse Scheduler
d. Metabase Send the Alert and Pulse to Slack on schedule. Users communicate with MetaBot to gain insights
e. Arrange Dashboard from several internal questions using Metabase
f. Let the internal users access the Dashboard using their account managed by Metabase Admin
g. Create advanced visualization using redash and let the internal users access them for in depth analysis

For the next update, I’ll add the data science flow and the ML engineering flow.

The Challenge

Sometimes you have to be a single fighter to handle all data analytics problems, even though you know data analytics is a broad topic. You have to handle ad hoc requests from the stakeholders, business users, internal users, external users and so on. But on the other hand, you have to build a reliable data infrastructure. As the goal of a start-up is the growth, our data analytics infrastructure has to be as efficient as possible but should be scalable. Beside that, you’re expected to have enough time to experiment with the data, find the hidden pattern of the data, generate actionable insights and so on. To sum up, there are huge needs in data analysis from the users but to fulfill that, you have to build a reliable data infrastructure, on the other hand you have to be able to experiment with their data. So, how to handle those things.

There are a lot of tools that we can use in order to manage those things, like JIRA to manage priority, Slack to communicate effectively, Notion to share and create documentations and so on. I use JIRA to manage EPIC and break them down into several backlog. Each backlog represents all the tasks that I have to do. The task can be developing data infrastructure, fulfilling ad hoc analysis, experiments with data and even writing documentations in notion. Of course those tasks are managed by priority, usually, I’ll set the urgent and important ad hoc request as P0 (highest priority) and sometimes this request is dependent with data that haven’t been extracted, so I have to set a P0 to build data infrastructure for that data. I almost put the data experimentation on P2 as I’ll do it in my free time. P1 is for important tasks but it’s not urgent and it’s most likely for scaling (up or out) data infrastructure and some deep ad hoc data analysis.

As a single fighter, I have to make people understand that I have some limitations. So, in order to fulfill their analytics needs, I have to manage a new approach to analytics. In an early-stage startup, the employees are most likely ready to learn new stuff, so it’s important to teach them how to make their own analytics and that’s why we use Metabase, as it’s quite friendly for non-technical users. If they want a simple analysis, they can make it by themselves, what we have to do is prepare the data and perform complicated and advanced analytics. In terms of business understanding, they should be more understanding about the actual condition, so they should be able to generate insights better than us. What we have to do is to educate them to be data-driven.

The internal challenges that I feel are poor guidance. There’s no other data person in the company that you can ask about. No one can review your code, no one will tell if you did something wrong and so on. But if you think about it, the experience is actually the best teacher and it will tell you whether you did right or wrong, I made several mistakes until I figured out the best data infrastructure and its framework, I was really bad at managing tasks until I figured out how to manage priority, expectation and communicate better. This condition will push you to grow even further, the responsibility that you take can make you a better person and I believe I can grow further again together with the company. You can understand the whole business process rather than a best practice in one specific task.