Doing ML effectively at startup scale

The Post-Modern Stack

Joining the modern data stack and the modern ML stack

8 min readSep 22, 2021

Overview

As all good things come to an end, we have reached the last episode of our series, with a shiny new open-source repo that brings together many of the themes we discussed in the previous episodes, and which we recap here before we start:

MLops without much Ops: where we introduce the principle of focusing on what matters;
ML and MLOps at reasonable scale: where we explain the “reasonable scale”. In-between planetary-scale infra for Tech Giants and no-code scenarios, there is a world of exciting work for sophisticated practitioners: we call it “the reasonable scale”, and it’s indeed where most dataOps and MLOps happen;
Hagakure for MLOps: where we discuss the principles of modern MLOps, and how small teams can be productive thanks to a blooming open ecosystem;
The modern data pattern: where we present (batteries, data and open source code included) a pragmatic solution to the problem of ingesting, transforming and querying data at scale.

If you followed us closely, Episode 4 brought us to the limit of Data Land and at the start of ML Land: it is now time to close the circle, and take those nicely transformed data rows into a machine learning model serving predictions to users.

TL;DR: in this post, we will combine once again technical content with organizational wisdom:

We introduce the “Post-Modern Stack”, that is, a deconstruction (see the pun now?) of the modern data stack we previously shared. We re-purpose the DataOps tools from Episode 4 (Snowflake + dbt) to power our favorite MLOps setup, a Metaflow pipeline seamlessly combining local and cloud computing, and bridging the gap between data, training and inference in a serverless fashion.
We go back to where we started, and discuss again, in the light of what we have learned, the basic principles of MLOps without Ops, and how that shape (or should shape) many traditional discussion about software organizations: staffing, build vs buy, etc.

Clone the repo, check the video, buckle-up, and join us for one last trip together.

Joining the modern data stack with the modern ML stack

The modern data stack (MDS) has been consolidating a number of best practices around data collection, storage and transformation. Especially effective with structured or semi-structured data, the MDS will typically rely on three key pieces:

A scalable ingestion mechanism, either through tools or infrastructure;
A data warehouse for storage and computation (querying performances are pretty impressive at reasonable scale);
A transformation tool for DAG-like operations over raw data, possibly based SQL as the lingua franca for different personae (data engineers, analysts, ML folks).

The web is full of examples (including our own!) of how to set up the MDS. However, they may leave you wondering what happens “on the ML side”: once data is pre-aggregated and features pre-computed, how is that consumed downstream to produce business value? This post sets out to answer this question, by proposing a lightweight toolchain that leverages Metaflow as the backbone for ML operations: the community reaction to the “Bigger boat” repo has been overwhelmingly positive, but we thought we should also put forward a low-touch alternative for teams that want quicker start.

The Post-Modern stack

As a flow chart is worth a thousand README, our Post-Modern Stack (PMS) looks like this:

We have four main “functional” phases, two in Data Land, two in ML Land:

Storage: we use Snowflake to store raw data — we re-use the fantastic open dataset for ecommerce released by Coveo last year, containing millions of real-world anonymized shopping events.
Transformation: we use dbt as our transformation framework — we run a DAG-like series of SQL queries inside Snowflake, and prepare our raw data to be consumed by Python code.
Training: we use a deep learning framework, Keras, to train a sequential model for shopping recommendations — given a list of products the shopper interacted with, what is the most likely next interaction?
Serving: we use Sagemaker as our PaaS serving platform, so that i) we can use Python code to trigger the deployment, and ii) by using AWS, we get great interoperability with Metaflow (i.e. model artifacts are already in s3).

The PMS is not particularly more complex than your vanilla Metaflow pipeline: by delegating aggregations to Snowflake, distributed computation is abstracted away for the reasonable scale; by introducing support for dbt, the end-to-end scientist can prepare her own features and version her dataset in one move; by using Metaflow, we can run all the Python code we want, where we want it: we can join dataOps and MLOps in a unified, principled manner, and we can pick and choose where hardware acceleration is needed.

The PDP is a zero-fat, no-nonsense but fully realistic pipeline to start turning raw data into real-time predictions.

Better still, you have a pipeline that’s heavy on open-source and that’s light on people’s time: developing, training and deployment can be performed by one ML engineer without any infrastructure knowledge, and without asking for devOps support.

Before exploring the full consequences of this setup for your organization, not just your code, it may be a good moment to mention some hidden gems for the reader interested in the nerdy details:

dbt cloud: dbt offers a SaaS version of its tool for collaboration within and across teams. To support this scenario, we include the possibility of running the same flow by connecting to a dbt cloud instance: while it is a bit less intuitive from a flow perspective, we do believe there is value in the cloud offering, especially in a bigger organization with a more diverse set of people involved with the data stack.
Model testing: we include a testing step before deployment to raise awareness on the importance of thorough testing before deployment. We combine the power of RecList with Metaflow cards to show how open-source software can help develop more trustworthy models, and more inclusive documentation. Stay tuned for a deeper integration in the near future!

In a moment of blooming, but also confusing growth of the space, we hope that our open stack will provide a reliable first step for teams testing the MLOps waters, showing how few, simple pieces go a very long way towards building ML systems at scale.

It may not be the end of your journey, but we do believe it can make for a very good start.

MLOps and peopleOps

If you recall our landscape overview, teams operating at reasonable scale are either small, fast-growing startups, or teams starting up a ML practice in a big but traditional company (a startup inside an enterprise, if you will): velocity to close the feedback loop is all they want, so NoOps is what they need. In particular, our approach to the ML life-cycle highlights the importance of not spending upfront engineering time to support a scale and sophistication which is certainly not needed at day 1 (and maybe even at day 1000).

Compared to “toy-world” tutorials, our design has the advantage of growing with you: if you indeed need to swap X for Y at day 1000, the rest of the tools may still be perfectly playing nice with each other.

We want to conclude our series highlighting some implications of this approach on how organizations work and conceptualize data and ML development for their products.

Efficiency beyond headcount. Consider traditional metrics, such as R&D headcount: a modern MLOps approach can call into question some well-established principles revolving around that. For instance, adopting a modern MLOps approach means that your Costs of Goods Sold (COGS) may comprise a larger AWS bill — yet the direct labor involved in the production of goods and services will arguably be lower. Moreover, what this also means is that traditional metrics such as R&D headcount or number of patents filed may need to be reconsidered and different benchmarks may be required. So, as the world of tech is changing rapidly, our approach to metrics should take that into account.
Flexible verticality. ML is becoming an important product component for many companies (and Coveo is certainly one of them). The uncomfortable truth about being able to embed ML capabilities in a product is that ML engineers need to be trained in the business problem as much as they are in hyperparameter optimization. With this in mind, having 10+ data scientists in a horizontal independent unit might not be the right way to go, as it detaches them from the field and slow the feedback loop between data and decisions. Alternatively, one might want to embed the ML ninjas directly into the business units so they can learn the business problems first-hand. The only trouble there is that if the business units are not prepared to absorb the ML engineers, they easily end up having less impact than one would expect. Adopting strong MLOps practices is a way to make the verticalization of the ML team more gracious, as business units can absorb Data Scientists more efficiently. The MLOps at “reasonable scale” topology looks like mid-way between vertical and horizontal — a T shape if you will: some horizontal components are in place to make everybody productive and to re-use knowledge and expertise (e.g. widespread Metaflow adoption); but then features are developed vertically within line of business, taking into account the specificity of the use case.
Retain talent: stay away from infrastructure. Arguably the most important consequence of this approach is that MLOps can serve as part of the proposition to attract and retain critical talent. Trading-off more computing for less human effort will result in a small, happy ML team that is significantly better than a bigger, less focused group. Most technical talent gets excited about doing cutting-edge work with the best tools, focusing on challenging problems and seeing the impact of their work in production. Without the right MLOps practice in place, top talent will quickly become frustrated by working on transactional tasks and not seeing their work have a tangible business impact. So, a possibly larger AWS bill is often offset by higher retention rate and greater ML productivity. As McKinsey put it in an article on “the Great Attrition,” companies are wanting the best, and keeping the worst, and one of the main reasons for turnover of ML practitioners is devoting a sizable portion of their time to low-impact tasks, such as data preparation and infrastructure maintenance.

Finally, adopting our MLOps approach will also clearly impact the strategic decisions made by CFOs and make them happier.

There’s a popular idea that an efficient way of managing R&D spending entails reducing infrastructure costs. But this is often a misleading way of looking at things. Buying rather than building can result in more accurate estimates and predictions of COGS, especially for less mature, and more experimental lines of businesses — proverbially so, time is indeed money, and infrastructure may look much cheaper when seen in the light of the opportunity cost of slow exploration. Moreover, we often found that the actual costs of building and maintaining infrastructure in a startup are less predictable over time than what most people would think. Not only is it extremely easy to underestimate the total effort required in the long run, but every time you create a team for the sole purpose of building a piece of infrastructure you introduce the quintessential unpredictability of the human factor.

Want to chat more about the future of ML “at reasonable scale”?

Our series stops here, but we would love to hear from you: get in touch with us, and follow us on Medium and Linkedin (here, here and here) to see what our next “reasonable project” is.

See you, MLOps cowboys!

Acknowledgements

Special thanks to Sung Won Chung from dbt Labs, Hugo Bowne-Anderson, Gaurav Bhushan, Savin Goyal, Valay Dave from Outerbounds, and Luca Bigon and Patrick John Chia from Coveo for helpful comments on the draft and the repo.

TDS Archive