HPC Debrief: Mathew Shaxted, CEO of Parallel Works – HPCwire

This post was originally published on this site.

In this installment of The HPC Debrief, we will discuss a big topic in HPC — cluster provisioning. Getting hardware on-prem or in the cloud is often the easy part of standing up an HPC or AI cluster. Indeed, cloud technology makes it simple to access hardware, but turning that hardware into a usable and maintainable HPC or AI resource is not easy. In this interview, HPCwire will talk with Matthew Schacht, CEO of Parallel Works, as we learn about creating effective on-prem, cloud, and hybrid cluster configurations.

[embedded content]

HPCwire: Hi, this is Doug Eadline, managing editor of HPCwire. Welcome to the Executive Debrief, where we interview industry leaders who are shaping the future of HPC. Today, I’m honored to have Matthew Schacht, CEO of Parallel Works, a company that has a stated goal of democratizing HPC and AI by offering workflow technologies for R&D practitioners that use diverse computing environments. So we’re going to dig into that today and learn a lot more about it. And welcome, Matt.

Matthew Shaxted: Thanks, Doug. Happy to be here.

HPCwire: Yeah. So, I assume many people in our audience have some experience with HPC clusters. A collection of servers that can be used together to solve large problems. This is the way I like to describe them in very simple terms. What often gets lost in the top-line hardware specification for clusters, usually either on-prem or in the cloud, is the software that glues everything together and actually makes them usable. And this is where the rubber meets the road for the end user. In particular, I like to learn about software solutions that don’t force users to have to become system administrators. And I’m sure you know all about that based on your tool. And that is, to me, a key thing for HPC: to keep people focused on their application, on their research, on their discovery, and not have them get down to get down into the engine room of the cluster and do things to make things work. So, with that little introduction, the one question I had is, can you talk a little bit about your company and products and where Parallel Works started? And how do you see yourself in the HPC and AI market?

Matthew Shaxted: Your point about getting in the engine room, I say a similar analogy where, you know, we don’t necessarily want all R&D practitioners to become the plumbers, right? Other people can do the plumbing. So that’s actually how I see ourselves. Our company started almost about ten years ago. We’ve been focused on building this piece of software that is designed to do what you said democratize HPC, move people out of kind of a traditional terminal to access these systems and make them more productive for end users. So in 2015, my partner, who was at Argonne National Lab at the time, a principal investigator on a workflow technology designed to help take computing workloads that run maybe on a desktop and scale them up to the big iron computing machines that exist across the country. And I was, you know, a lowly engineer running simulations that wanted to scale up my runs. And we kind of started working together and reached a conclusion. And I remember it was the first time I ran on a large Cray machine, 20,000 cores, and got more work computing simulations done than I’ve done in my entire life. And I said, you know what? Maybe there’s a market for this in this industry. And that’s kind of what started the journey. By mid-2015, it took about six months, and we built a front-end interface and web-based front-end interface on top of this workflow Orchestration software.

Matthew Shaxted: At the time, it was called Swift, but it’s been replaced with other open-source Python-based products. And we started going after opportunities. And, you know, that was ten years ago. And now we’re about a 25-person company, still fairly small, run it as an organically grown business, essentially, and have really gotten in some larger engagements that are seeing what we do and kind of want we can do within their organizations. So, we’ve been a traditional HPC company since our inception. When I say traditional, it’s running the big iron machines and building-size computers in commercial organizations; it’s operating clusters, HPC, and sysadmins. Maybe about four years ago, we started noticing a shift, though, of the the work that end users were doing on our platform were starting to evolve into more machine learning and AI model development and training, initially with object detection and inference, and replacing physics based codes with prediction models. And now, especially with LLMs, we’ve seen a lot of that. The user base runs GPU clusters to bridge the gap between running a single GPU single node and multi-GPU, multi-GPU, and multi-node training jobs. So that’s been kind of interesting watching that progression.

HPCwire: I know your main product is called ACTIVATE, and I assume it’s the kind of thing that works on on-prem for on-prem systems and in the cloud and maybe in between to give you a consistent view of the workflow that you’re trying to do. And you mentioned AI and HPC. So, with your tool, does it really matter what the user is trying to do? Is it developed enough so that, you know, whatever the user sits down to do, whether it be GPU number crunching or GPU AI or just CPU stuff? So, how does it play out for the user in that respect?

Matthew Shaxted: Yeah, we’re constantly building new capability to accommodate, I’d say, more diverse user bases. And one thing I’ve noticed is that over the years when you go into different communities that do computing, I was at the HPC on Wall Street event last week, and before that, I was on a naval base in North Charleston. Communities have different pieces of software to schedule jobs using their computing resources, right? And it really does kind of vary by community. A lot of the AI space is moving to microservice and Kubernetes Workloads, and you know they need to be able to orchestrate tasks for that. So our system aims to make the actual scheduler, you know, we interact with the scheduler layer of these systems and kind of abstract that away for the end users. So when you want to get application portability between, you know, maybe traditional HPC, SLURM, or PBS cluster systems to more and more like microservice Kubernetes systems, we’re kind of a layer that can exist to remove the differences between those. So, to your beginning point, the end users coming in are building a workload for whatever it may be, an ML training job or a traditional physics-based weather forecast model, and the difference between those sites kind of doesn’t really matter. You can make them all look and feel the same. Our current capability right now, we do that by trying to unify the scheduler layer, which is the piece of software that kind of makes the resources available, and fair use policies for those.

Matthew Shaxted: We abstract that layer away using SLURM at the moment primarily, and when we run in cloud environments AWS, Google, Azure, and Oracle, we’re working on, and we run on OpenStack for on-prem clouds, virtualized clouds. We unify that entire those diverse interfaces with Slurm. So when you stand up one of these systems through our platform, you’re actually just talking to elastic SLURM clusters. And when I say elastic, it means it scales up and down compute nodes as the scheduler needs them. That’s what makes it easy to go between on-prem sites that typically run some type of scheduler like that. And there are many others. LSF and PBS and Moab, Torque, etc., and so the gap between them moving into these other diverse systems becomes lower because it’s something that people are already familiar with. We are having to accommodate new schedulers all the time. We’re working on a Kubernetes integration now specifically for people running more operational AI workloads and things like that. The HPC+AI on Wall Street event talked a lot about IBM’s Symphony product, which is a low-latency task scheduler for high-volume tasks. In that space, we should be working on those. So it kind of varies all the time. And we just try to bring them all together into one place.

HPCwire: It gives me a little bit of insight into the next question I was going to ask. And because you brought up SLURM, I wanted to dig a little bit more into that. So, in essence, when you start up one of your clusters, your SLURM nodes, particularly in the cloud, you’re saying they don’t exist yet until they get assigned a job.

Matthew Shaxted: Running in the cloud is generally expensive. And if I were to compare a large on-prem system with doing that same type of thing in the cloud with, you know, maybe best-class networking, storage, and computing today, it could be, you know, even be five times as expensive if you compare one to one. This is part of the cloud story, where you don’t actually need to compare one to one, and the total cost of ownership can become less if you rightsize your infrastructure. But, in the cloud, you know, we provision SLURM clusters. AWS released its parallel cluster service last week. That’s essentially what we’re doing. And we’ve been doing that for years and years. We do that on AWS in a very similar way, but we do it across all the different clouds we work in.

It’s a standard SLURM cluster. It looks as though a traditional SLURM scheduler, you know, partitions everything you get to determine the shape of the cluster, what type of login node you want, what, how many partitions you want, the regions and zones and instance types, the shapes that you want to use and when you log into it. We make it easy to log in over a terminal for a conventional power user. It appears as a normal partitions. You know, it just shows them idle. Except the nodes aren’t really there until you submit a job. When you submit a job, we scale them up. We go out to the cloud provider’s APIs or SDKs and spin up the nodes. It takes about 3 to 4 minutes, sometimes depending on the cloud. It connects them to Slurm job runs. When the job finishes, they stick around for a little while, and then we shut them down because we’re incurring costs. Anything you touch, you’re incurring costs. So we want to try to be economical where we can.

HPCwire: This is actually an interest of mine because I’ve done some things with SLURM, with turning systems off when they’re not in use. So, are you using that kind of mechanism built into Slurm?

Matthew Shaxted: That is it exactly. It’s a power management plugin mechanism. Same thing. Yep.

Matthew Shaxted: It’s really just doing an intercept. It’s doing an intercept at the resume level for the Slurm Power Management plugin. When it basically starts up, it’s going to ask here’s the nodes I need. We go out and then hit any SDK, OpenStack or any of the clouds. We go and get the nodes started up. SLURM runs on the remote side and connects them in, and now it’s showing them as idle without any other users, and it runs the job and then shuts them down. And there’s another intercept at that point. So it’s not it’s not actually magic. It’s just getting all of that kind of working reliably. As the technologies evolve and change over time, it is a full-time job, we found.

HPCwire: Yeah, I can imagine. So, just to kind of move back up a level here, I assume you mentioned this: you support all the major cloud vendors in this case. So, it’s just not an AWS product at this point.

Matthew Shaxted: Well, it depends on how you define major cloud vendors. But yeah, that’s right.

HPCwire: Well, ask the cloud vendor. They’ll all tell you they’re a major cloud vendor.

Matthew Shaxted: We support at the moment AWS, Google, and Azure SLURM clusters on all of those. We are working on Kubernetes providers for all of those that would allow connection to existing Kubernetes clusters. I won’t talk too much about that. We are working on an Oracle SLURM integration right now as well with the Oracle team. IBM Cloud is, again, depending on the community you go to. They’re big in financial services. That’s one we’re starting to look at. And then for on-prem clouds, OpenStack is where we kind of went with that. So, for people who are operating on-prem, bare metal resources as a cloud, we can integrate with OpenStack as the layer that does the virtualization, and we run Slurm on top of OpenStack then.

HPCwire: Got it. Okay, we’ve kind of answered my next question. If I wanted to run a bunch of NAMD simulations and I basically in, let’s consider the cloud now. So, I set up my system and submitted my jobs to SLURM. It goes out, starts the nodes, runs the jobs it’s done, and then when I’m done, I assume it all collapses down to a single node, as I understand it. So when the queue is empty, that’s just your SLURM node, basically the controller.

Matthew Shaxted: It’s like the equivalent of a front-end node staying around for the life of the cluster session. You can choose you or administrators get to choose the cluster session time. You know they want to run it all the time or let the end users actually control them themselves.

HPCwire: So that makes sense. I mean, it’s a really economical way.

Matthew Shaxted: Yeah. I mean, what we’ve seen following that is that different organizations want their users to interact with these clusters in different ways. Our platform allows admins who are managing the computing environments to come in and create one or a few clusters and share them with an entire organization or specific project teams and operate them. Kind of like a traditional HPC shared asset, right? Where lots of users share the same cluster. There’s persistent storage attached to it, and it literally looks and feels like another site, another computing site. But oh, it just so happens: it’s in Azure, Google, or whatever, or AWS. What we’ve found that is kind of interesting is that if organizations put a policy in that gives more, let’s say, ownership to the clusters, to the end users themselves. What we’re seeing is that, actually, the end users are empowered to come in and build their own clusters and choose the shapes of the things they want. They own them either as personal clusters and they can attach whatever their storage into it, or they share the end users themselves are collaborating with a small project team or some project team, and it becomes like a collaboration and environment for that so that we’ve seen that happen pretty successfully.

Matthew Shaxted: And then people start using these clusters, which it takes the idea of a normally large static shared cluster, and it turns it into these pretty agile computing systems that we have. We have some groups. For example, they just published a press release from NOAA R&D on their hurricane forecast system. This team, it’s about 20 people within NOAA R&D. They have 20 clusters, shared clusters in an account, and they run every individual ensemble member on a different cluster with its own dedicated lustre file system. And it’s kind of interesting. Then, they automate that with our Rest API every six. 6 a.m., and every morning, they start 20 of these standalone clusters that are shared across the entire team and send an ensemble member to each one. So it’s been interesting to see how people have been using this idea of clusters, which are usually in a traditional HPC sense, static things that are, you know, not changing and shared amongst a lot of users to things that are now pretty lightweight and agile.

HPCwire: Yeah. So that actually brings me to my next question: when you start getting to the nitty gritty of defining and setting up a cluster, you know, every, every cluster I’ve ever been involved with, they all look different from a hardware resource perspective. You know, some of them may have certain nodes with GPUs, you know, some certain amounts of memory, etc., different interconnects, anything from, you know, slow, a slow old Ethernet connection to InfiniBand, numbers of cores, all that kind of stuff. And then the big one though is, you know, how do you bring in parallel file systems, if at all? I had worked with some people who just did everything over NFS and then would complain about why this was slow, which was explained to them. And, How does how do you encapsulate those issues? I guess that is what I’d like to know.

Matthew Shaxted: You mentioned a lot of the challenge points. Storage and file systems are fundamental to every workload. And they are pretty workload-specific. So it really depends on what you’re trying to do. But we’ve designed our platform to try to be flexible enough to allow a wide variety of storage tiers that the cloud providers use, or there are commercial solutions out there that are doing like global namespace actions and everything. So we try to make it so that the end users again have the autonomy to define what they want to do themselves. And we, our team, exist to provide best practices. If you’re doing tightly coupled MPI jobs, we have an out-of-the-box deployment that can do that. Take your highest performance, fastest but most expensive file system, make it ephemeral, attach it to the cluster, and stage your data in and out through object storage. We have best practices to follow based on what you’re trying to do. The bottom line is this technology we’ve been building, our software makes it so that you can provision these things with some guardrails. We don’t allow you to do everything that you can in the cloud consoles directly. But the HPC and AI-specific things like shared, medium to low performance but economical, shared NFS mountable file system, Lustre file system for high performance. More specialized file systems that each cloud provider has Azure, NetApp files, Google’s GFS, etc. you can actually provision these and then mix and match them on a cluster yourself or with your project team. What we see is that when you make a cluster definition, or users make a cluster definition or a platform, they get to choose what storage is appropriate for their workflow and then just mount it along with it.

HPCwire: I want to mention what I have found to be one of the thornier issues with cloud computing. And we’re, you know, we’re talking about cloud. I’ve heard many stories floating around about the end-of-month cloud surprise when they get the bill. Usually, costs exceed expectations and, oftentimes, budgets. So the reason often cited when I talk to people about it is they said there’s sometimes a disconnect between running the application and then knowing what the cost is — I call it the taxi meter. You know, it is even running when you’re not going anywhere. And that’s why it’s not really clear sometimes, like, what’s this going to cost me or is it costing me while it’s running? I recall hearing last week at the HPC+AI on Wall Street show that you talked about some capabilities of ACTIVATE to help with this issue. So I’d like to hear a little bit more about them.

Matthew Shaxted: First of all, I wish we didn’t have to solve this problem. I wish the cloud providers gave you the costs as they occur in real-time. Right? But it’s it’s a complicated problem. The cloud providers have sometimes thousands or tens of thousands of individual SKUs that are all accumulating to some taxi meter runtime as you’re getting a job done. Also, it’s perhaps not in their best interest to tell you exactly what you’re spending right away. Now, that is in direct conflict with many organizations’ requirements to have fixed budgets for an annual cycle. It’s it is very difficult to enforce true fixed budgets in a cloud environment. The reason for that is because the cloud bills are delayed in coming back. If I start a job right now on Azure, let’s say, with my ten nodes and my Lustre file system or whatever, and I think it costs $100 an hour, I actually don’t know. Twelve to twenty-four hours later, what is the true cost of that job? That’s when they would publish the real bills to me. What started happening for our user base is that it’s spinning up 10,000 CPU cores or a cluster of 20 to 30 GPUs or something.

Matthew Shaxted: And they’re spending a lot of money. The managers of the organization say, “Hey, this project can only spend $10,000.” They would blow past that budget before they even get the bills back from the cloud provider. That became a problem that started about four years ago when this was really becoming front and center for our customers who wanted to truly enforce fixed budgets with very tight control in direct conflict with the CSP cloud service provider billing models. So, unfortunately, we had to build a FinOps capability specific for HPC and AI, these types of workloads, and, again, we’re not a cloud service provider. We are not reselling cycles. We don’t get involved in the consumption. We are a piece of software that just orchestrates other people’s cloud accounts. So we actually had to build a module that looks into the cloud accounts that we’re running and tracks spending at three-minute intervals. We do it at a three-minute resolution. So, every three minutes, we have a system that looks at your organization’s cloud accounts or the views that we have access to. And we get a cost estimate.

Matthew Shaxted: We call it a real-time cost estimate that we’ve dialed in to be about 5% accurate for all things except egress charges. Egress, we found to be difficult to reverse engineer and quantify the spend on for for some reason. And we’ve tried a lot of different ways. So within a three-minute resolution, we can get a roughly 5% picture of what the spend is at any given time. And we actually let the user see that in real-time. They can go to an interface and see when their job is running, what their true cost spend is as it’s running, and then the administrators are actually allowed to say, hey, enforce budget restrictions. If you’re starting to meet a certain quota, we can actually shut down the resources based on that real-time estimate. So that is solved. The problem of having these and this has happened a lot. Runaway jobs, someone submits, you know, 100-node jobs on a Friday night, and then they walk away. But it’s just run away. We’re able to actually get in front of that and shut things down before it becomes a really large expense. And we have some good case studies on doing just that.

HPCwire: That is a big problem. I mean, you almost could have a separate company that does that.

Matthew Shaxted: Well, there are separate companies that exist just for FinOps purposes. We used to integrate with them, but they didn’t get us the resolution that we needed for HPC and AI jobs. We’re doing bursty, large jobs generally, right? If someone comes in and spins up a thousand cores or 5000 or 10,000 cores, and they’re doing maybe lots of those in a day, you need a pretty fine resolution to do that. So I kind of wish we didn’t have to build it, but we did.

HPCwire: One final question I wanted to throw out there is you’ve already made some, what I believe, good arguments. But if you’re giving an elevator speech about how ACTIVATE saves money. In a world where not everybody, including a lot of DIY users, likes to roll their own stuff, And so it’s not like walking into a corporate environment saying, “Hey, we can automate X, Y, and Z.” You’re sometimes walking into an environment where you’ve got somebody who’s been doing this their way for 20 years and maybe it works, but it doesn’t provide a lot of what the organizations really need. So that’s your biggest competitor, really.

Matthew Shaxted: It is. It’s exactly what you just said. When we come into an organization, it does have to happen. We need to get end users to buy in. And that’s really important, and it makes sure that their lives are going to become easier by using this. But, you know, the people managing these computing systems also need to buy in, right? And they need to say, we are really going to try this as a way for people to access our systems, which means it’s a long cycle generally. It usually becomes part of like a refresh cycle when they’re considering this. And like you said, a lot of people have built do-it-yourself solutions, or there are open-source tools that do what we do. Open on Demand is one, for example, to maintain these systems and then keep them up to date, especially as the cloud has really emerged as a true competitor to HPC workloads. They work. Their performance is there. Now. You know, five years ago, we were starting to see that shift a bit. But it is here now, and there are obviously cost implications and opex implications. But it does work. And there are new clouds coming online with diverse technologies. There’s a whole tier of GPU cloud providers that have the capacity that you need today. You know, for GPUs, when maybe the hyperscalers don’t. So it’s important to constantly be agile in these different environments and then do-it-yourself solutions. They are out there. They work great. And so, you know, maybe environments that have kind of static single clusters or maybe a few clusters.

Matthew Shaxted: But as soon as you move to supporting multiple environments and all the differences between them, I mean, this is what we’ve been spending ten years on, and it’s we still haven’t really fully cracked it. You’ll start running the do-it-yourself groups that are running into the same issues that we have been solving for years, in terms of even just operating a cloud program with fixed budgets, like I just said. Getting application portability between these different environments. And you may have an organization, and I’ve seen this a lot, that is an AWS shop today, and they’re fine with that. But you know, Azure or Google just put some new technology on the floor that they want to use. Or they’re willing to give them a big discount coming up. You know, how do you get portability between these sites in a conventional do-it-yourself? Oh, we need to bring in an Azure expert who’s going to know all the roles and permissions. Well, what we’re trying to say is, “No, you don’t use something like ours.” And then your team is already the experts in these different clouds. We’ve already taken care of that. So that’s that’s kind of the story that we go into. And a lot of times, they’ll say we’re fine, and then they’ll start running into these issues and come back, you know, and like, hey, maybe we should really look at doing something like that.

HPCwire: Yeah. Having grown up in this environment, I know what got us here, you know, the roll-your-own rugged individualists. I’ve stepped into situations where they’re all gone now we don’t know how anything works. This is another issue in some organizations. Well, we’re about to reach the end of our time. Is there anything I forgot to ask? Or did you want to mention it in closing this up?

Matthew Shaxted: I think one last thing. We talked a lot about the cloud today, but our platform does operate fully on-prem systems as well. And it’s used as a conventional portal interface to use an old-school term of the word. Right, but what we’ve been seeing a lot of is this hybrid multi-cloud world. This is what we see for all the reasons I said before. But when people have on-prem resources, and they want to start introducing burst sites or even colo, but burst sites, we are fully trying to accommodate those scenarios where it’s like to run a workload; use what you can on on-prem first. If you can’t get what you need in a certain amount of time, or if things are down or whatever it is, here’s a fallback option that is already compatible with the workloads you’re running. That’s a lot of the workloads we’re trying to run right now or with customers saying they’re not going to get rid of their systems. They have their annual refreshes. Those are there to stay. It’s about growing the small opex cloud budgets to help the burst scenarios. That’s that’s where we’re trying to really sit.

HPCwire: So in as I understand it. So I can manage my cluster on-prem, and from a user standpoint, if we’ve got authorization, we can burst into the cloud. So, finding more nodes for them is not really an issue. In other words, they can submit a job, and then it can get sent to the cloud, and it spins up nodes as they need them and then brings them back down when it’s done.

Matthew Shaxted: We have workloads running every day that 80% of the time you can get the on-prem resources you need, but you know that the other 20% of the time, these pipelines just take care of sending work out to Azure or whatever cloud environment you’re running in. So that’s that’s exactly the case there.

HPCwire: This has all been very interesting. And, at HPCwire, we love to hear about the big systems and all that, but this is, in my opinion, a lot of where the rubber meets the road with getting things done in HPC and AI. Thank you for contributing to this. And I’m sure we’ll be hearing a lot more from Parallel Works in the future.

Matthew Shaxted: Yes, Doug, thanks so much. It’s really great to be here, and I appreciate the opportunity.