Platform¶

What is an end-to-end ML platform?¶

Think of it as baking a loaf of bread. If you take ready-made bread mix and follow the recipe, but someone else eats it, that's not end-to-end. If you harvest your own wheat, mill it into flour, make your loaf from scratch (flour, yeast, water, etc.), try out several different recipes, take the best loaf, eat some of it yourself, and then watch to see if it doesn't become moldy—that's end to end.

Single-tenant vs multi-tenant SaaS?¶

DataRobot supports both single-tenant and multi-tenant SaaS and here's what it means.

ELI5Robot-to-robot

Single-tenant: You rent an apartment. When you're not using it, neither is anybody else. You can leave your stuff there without being concerned that others will mess with it.

Multi-tenant: You stay in a hotel room.

Multi-tenant: Imagine a library with many individual, locked rooms, where every reader has a designated room for their personal collection, but the core library collection at the center of the space is shared, allowing everyone to access those resources. For the most part, you have plenty of privacy and control over your personal collection, but there's only one of copy of each book at the center of the building, so it's possible for someone to rent out the entire collection on a particular topic, leaving others to wait their turn.

Single-tenant: Imagine a library network of many individual branches, where each individual library branch carries a complete collection while still providing private rooms. Readers don't need to share the central collection of their branch with others, but the branches are maintained by the central library committee, ensuring that the contents of each library branch is regularly updated for all readers.

Self-managed: Some readers don't want to use our library space and instead want to make a copy to use in their own home. These folks make a copy of the library and resources and take them home, and then maintain them on their own schedule with their own personal resources. This gives them even more privacy and control over their content, but they lose the convenience of automated updates, new books, and library management.

Robot 1

What do we mean by Single-tenant and multi-tenant SaaS? Especially with respect to the DataRobot cloud?

Robot 2

Single-tenant and multi-tenant generally refer to the architecture of a software-as-a-service (SaaS) application. In a single-tenant architecture, each customer has their own dedicated instance of the DataRobot application. This means that their DataRobot is completely isolated from other customers, and the customer has full control over their own instance of the software (it is self-managed). In our case, these deployment options fall in this category:

Virtual Private Cloud (VPC), customer-managed
AI Platform, DataRobot-managed

In a multi-tenant SaaS architecture, multiple customers share a single instance of the DataRobot application, running on a shared infrastructure. This means that the customers do not have their own dedicated instance of the software, and their data and operations are potentially stored and running alongside other customers, while still being isolated through various security controls. This is what our DataRobot Managed Cloud offers.

In a DataRobot context, multi-tenant SaaS is a single core DataRobot app (app.datarobot.com), a core set of instances/nodes. All customers are using the same job queue & resources pool.

In single-tenant, we instead run a custom environment for each user & connect to them with a private connection. This means that resources are dedicated to a single customer and allows for more restriction of access AND more customizability.

Robot 3

Single-tenant = We manage a cloud install for one customer.
Multi-tenant = We manage multiple customers on one instance—this is https://app.datarobot.com/

Robot 2

In a single-tenant environment, one customer's resource load is isolated from any other customer, which avoids someone's extremely large and resource-intensive job affecting others. That said, we isolate our workers, so even if a large working job is running on one user it doesn’t affect other users. We also have worker limits to prevent one user from hogging all the workers.

Robot 1

Ah okay, I see...

Robot 2

Single-tenant's more rigid separation is a way to balance the benefits of self-managed(privacy, dedicated resources, etc.) and the benefits of cloud (don't have to upkeep your own servers/hardware, software updating and general maintenance is handled by DR, etc.).

Robot 1

Thank you very much Robot 2 (and 3)... I understand this concept much better now!

Robot 2

Glad I could help clarify it a bit! Note that I'm not directly involved in single-tenant development, so I don't have details on how we're implementing it, but this is accurate as to the general motivation to host single-tenant SaaS options alongside our multi-tenant environments.

What is Kubernetes and why is running it natively important?¶

Kubernetes is an open source platform for hosting applications and scheduling dynamic application workloads.

Before Kubernetes, most applications were hosted by launching individual servers and deploying software to them—that's your database node, your webserver node, etc.

Kubernetes uses container technology and a control plane to abstract the individual servers, allowing application deployments to easily change size in response to load and handle common needs like rolling updates, automatic recovery from node failure, etc.

It's important for DataRobot to run natively on Kubernetes because Kubernetes has become the world's most popular application hosting platform. Users' infrastructure teams have Kubernetes clusters and want to deploy third-party vendor software to them rather than maintaining bespoke virtual machines for every application. This means easier installation because many infrastructure teams already know how to set up or provide a Kubernetes cluster.

Interesting links:

"Smooth sailing with kubernetes."

CPUs vs GPUs¶

Here’s a good image from NVIDIA that helps to compare CPUs to GPUs.

CPU's are designed to coordinate and calculate a bunch of math—they have a bunch of routing set up and they're going to have drivers (or operating systems) built to make that pathing and organizing as easy as the simple calculations. Because they're designed to be a "brain" for a computer, they're built to do it all.

GPU's are designed to be specialized for, well, graphics hence the name. To quickly render video and 3d graphics, you want a bunch of very simple calculations performed all at once - instead of having one "thing" [CPU cores] calculating the color for a 1920x1080 display [a total of 2073600 pixels], maybe you have 1920 "things" [GPU cores] dedicated to doing one line of pixels each and all running in parallel.

"Split this Hex code for this pixel's color into a separate R, G, and B value and send it to the screen's pixel matrix" is a much simpler task than, say, the "convert this video file into a series of frames, combine them with the current display frame of this other application, be prepared to interrupt this task to catch and respond to keyboard/mouse input, and keep this background process running the whole time..." tasks that a CPU might be doing. Because of this, a GPU can be slower and more limited than a CPU while still being useful, and it might have unique methods to complete its calculations so it can be specialized for X purpose [3d rendering takes more flexibility than "display to screen"]. Maybe it only knows very simple conversions or can't keep track of what it used to be doing - "history" isn't always useful for displaying graphics, especially if there's a CPU and a buffer [RAM] keeping track of history for you.

Since CPU's want to be usable for a lot of different things, there tends to be a lot of Operating Systems/drivers to translate between the higher level code I might write and the machine's specific registers and routing. BUT since a GPU is made with the default assumption "this is going to make basic graphics data more scalable" they often have more specialized machine functionality, and drivers can be much more limited in many cases. It might be harder to find a translator that can tell the GPU how to do the very specific thing that would be helpful in a specific use case, vs the multiple helpful translators ready to explain to your CPU how to do what you need.

ELI5 example

Think about CPU (central processing unit) as a 4-lane highway with trucks delivering the computation, and GPUs (graphics processing unit) as a 100-lane highway with little shopping carts. GPUs are great at parallelism, but only for less complex tasks. Deep learning specifically benefits from that since it's mainly batches of matrix multiplication, and these can be parallelized very easily. So, training a neural network in a GPU can be 10x faster than on a CPU. But not all model types get that benefit.

Here's another one:

Let’s say there is a very large library, and the goal is to count all of the books. The librarian is knowledgeable about where books are, how they’re organized, how the library works, etc. The librarian is perfectly capable of counting the books on their own and they’ll probably be very good and organized about it.

But what if there is a big team of people who could count the books with the librarian—not library experts, just people who can count accurately.

If you have 3 people who count books, that speeds up your counting.
If you have 10 people who count books, your counting gets even faster.
If you have 100 people who count books...that’s awesome!

A CPU is like a librarian. Just like you need a librarian running a library, you need a CPU. A CPU can basically do any jobs that you need done. Just like a librarian could count all of the books on their own, a CPU can do math things like building machine learning models.

A GPU is like a team of people counting books. Just like counting books is something that can be done by many people without specific library expertise, a GPU makes it much easier to take a job, split it among many different units, and do math things like building machine learning models.

A GPU can usually accomplish certain tasks much faster than a CPU can.

If you’re part of a team of people who are counting books, maybe the librarian assigns every person a shelf. You count the books on your shelf. At the exact same time, everyone else counts the books on their shelves. Then you all come together, add up your books, and get the total number of books in the library.

This process is called parallelizing, which is just a fancy word for “we split a big job into small chunks, and these small chunks can be done at the same time.” We say parallelizing because we’re doing these jobs “in parallel.” You count your books at the same time as your neighbor counts their books. (Jobs that can't be done in parallel are usually done sequentially, which means "one after another.")

Let’s say you have 100 shelves in your library and it takes 1 minute to count all of the books on 1 shelf.

If the librarian was the only person counting, they couldn’t parallelize their work, because only one person can count one stack of books at one time. So your librarian will take 100 minutes to count all of the books.
If you have a team of 100 people and they each count their shelves, then every single book is counted in 1 minute. (Now, it will take a little bit of time for the librarian to assign who gets what shelf. It’ll also take a little bit of time to add all of the numbers together from the 100 people. But those parts are relatively fast, so you’re still getting your whole library counted in, maybe, 2-3 minutes.)

Three minutes instead of 100 minutes—that’s way better! Again: GPUs can usually accomplish certain tasks much faster than a CPU can.

There are some cases when a GPU probably isn’t needed

Let’s say you only have one shelf of books. Taking the time for the librarian to assign 100 people to different parts of the shelf, counting, then adding them up probably isn’t worth it. It might be faster for the librarian to just count all of the books.
If a data science job can’t be parallelized (split into small chunks where the small chunks can be done at the same time), then a GPU usually isn’t going to be helpful. Luckily for us, some very smart people have made the vast majority of data science tasks parallelizable.

Let’s look at a simple math example: calculating the average of a set of numbers.

If you calculate the average with a CPU, it’s kind of like using your librarian. Your CPU has to add up all of the numbers, then divide by the sample size.

If you leverage a GPU to help calculate the average, it’s kind of like using a full team. Your CPU splits the numbers up into small chunks, then each of your GPU workers (called cores) sums their chunk of numbers. Then, your CPU (librarian) will coordinate combining those numbers back together into the average.

If you’re calculating the average of a set of, say, a billion numbers, it will probably be much faster for your CPU to split that billion into chunks and having separate GPU workers doing the addition rather than your CPU doing all of it by itself.

Let’s look at a more complicated machine learning example: a random forest is basically a large number of decision trees. Let’s say you want to build a random forest with 100 decision trees.

If you build a random forest on a CPU, it’s kind of like using your librarian to do the entire counting on their own. Your CPU basically has to build a first tree, then build a second tree, then build a third tree, and so on.
If you leverage a GPU to help build a random forest, then it’s kind of like using a full team to count your books with the librarian coordinating everyone. One part of your GPU (called a core) will build the first tree. At the same time, another GPU core will build the second tree. This all happens simultaneously!

Just like your librarian has to manage all sorts of things in the library (counting, organizing, staffing the front desk, replacing books), your CPU has a bunch of different jobs that it does. The green ALU boxes in the CPU image represent “arithmetic logic units.” These ALUs are used to do mathematical calculations. Your CPU can do some mathematical calculations on its own, just like your librarian can count books! But a lot of your CPU’s room is taken up by those other boxes, because your CPU is responsible for lots of other things, too. It’s not just for mathematical calculations.

Just like your team of counters are there to do one job (count books), your GPU is optimized to basically just do mathematical calculations. It’s way more powerful when it comes to doing math things.

So, in short:

CPUs have many jobs to do. CPUs can do mathematical calculations on their own.
GPUs are highly optimized to do mathematical calculations.
If you have a job that relies on math (like counting or averaging or building a machine learning model), then a GPU can probably do some of the math much faster than a CPU can. This is because we can use the discoveries of very smart people to parallelize (split big jobs into small chunks).
If you have a job that doesn’t rely on math or is very small, a GPU probably isn’t worth it.

Let's have more analogies

Here is a video about, quite literally, an analogy. Specifically analog CPUs (as opposed to digital). This video is very interesting, very well presented, and gives a full history of CPUs and GPUs usage wrt AI, and why the next evolution could be analog computers. Well worth watching!!

How about, if your CPU is a teacher, your GPU is a classroom full of elementary school students.

Sometimes it might be worth having the teacher explain to the class how to help with a group project… but it depends on the cost of the teacher having to figure out how to talk to each student in a way they'll understand and listen plus the energy the teacher now has to spend making sure they're doing what they're supposed to and getting the materials they need along the way. Meanwhile, your teacher came pre-trained and already knows how to do a bunch of more complicated tasks and organization!

If it's a project where a lot of unskilled but eager help can make things go faster, then it might be worth using a GPU. But before you can get the benefits, you need to make sure you know what languages each kid in the classroom speaks and what they already know how to do. Sometimes, its just easier and more helpful to focus on making sure your teachers can do the tasks themselves before recruiting the kids.

Platform¶

What is an end-to-end ML platform?¶

Single-tenant vs multi-tenant SaaS?¶

What is Kubernetes and why is running it natively important?¶

CPUs vs GPUs¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?