Explain it like I'm 5 (ELI5) contains complex DataRobot and data science concepts, broken down into brief, digestible answers. Many topics include a link to the full documentation where you can learn more.
What is MLOps?
Machine learning operations (MLOps) itself is a derivative of DevOps; the thought being that there is an entire “Ops” (operations) industry that exists for normal software, and that such an industry needed to emerge for ML (machine learning) as well. Technology (including DataRobot AutoML) has made it easy for people to build predictive models, but to get value out of models, you have to deploy, monitor, and maintain them. Very few people know how to do this and even fewer than know how to build a good model in the first place.
This is where DataRobot comes in. DataRobot offers a product that performs the "deploy, monitor, and maintain" component of ML (MLOps) in addition to the modeling (AutoML), which automates core tasks with built in best practices to achieve better cost, performance, scalability, trust, accuracy, and more.
Who can benefit from MLOps? MLOps can help AutoML users who have problems operating models, as well as organizations that do not want AutoML but do want a system to operationalize their existing models.
Key pieces of MLOps include the following:
- The Model Management piece in which DataRobot provides model monitoring and tracks performance statistics.
- The Custom Models piece makes it applicable to the 99.9% of existing models that weren’t created in DataRobot.
- The Tracking Agents piece makes it applicable even to models that are never brought into DataRobot—this makes it much easier to start monitoring existing models (no need to shift production pipelines).
What are stacked predictions?
Some of DataRobot's blender models are like a committee with a chairman: each "member model" votes and then the "chairman model" uses their votes to make the final decision—this is stacking.
The chairman is trained to map these votes to the target. However, the training votes must be based on data the members haven't seen (e.g., each member trains on 80% then predicts the remaining 20%, and repeats this for four other chunks of 20%). Otherwise, there would be leakage or unrealistically good scores. The unseen chunk predictions are stacked predictions.
Learn more about Stacked Predictions.
What's a rating table, why are generalized additive models (GAM) good for insurance, and what's the relation between them?
A rating table is a ready-made set of rules you can apply to insurance policy pricing, like, "if driving experience and number of accidents is in this range, set this price.""
A GAM model is interpretable by an actuary because it models things like, "if you have this feature, add $100; if you have this, add another $50."
The way you learn GAMs allows you to automatically learn ranges for the rating tables.
Learn more about rating tables.
Loss reserve modeling vs. loss cost modeling
You just got paid and have $1000 in the bank, but in 10 days your $800 mortgage payment is due. If you spend your $1000, you won't be able to pay your mortgage, so you put aside $800 as a reserve to pay the future bill.
Loss reserving is estimating the ultimate costs of policies that you've already sold (regardless of what price you charged). If you sold 1000 policies this year, at the end of the year lets say you see that there have been 50 claims reported and only $40,000 has been paid. They estimate that when they look back 50 or 100 years from now, they'll have paid out a total of $95k, so they set aside an additional $55k of "loss reserve". Loss reserves are by far the biggest liability on an insurer's balance sheet. A multi-billion dollar insurer will have hundreds of millions if not billions of dollars worth of reserves on their balance sheet. Those reserves are very much dependent on predictions.
Algorithm vs. model
The following is an example model for sandwiches: a sandwich is a savory filling (such as pastrami, a portobello mushroom, or a sausage) and optional extras (lettuce, cheese, mayo, etc.) surrounded by a carbohydrate (bread). This model allows you to describe foods simply (you can classify all foods as "sandwich" or "not sandwich"), and allows you to predict new sets of ingredients to make a sandwich.
An algorithm for making a sandwich would consist of a set of instructions:
- Slice two pieces of bread from a loaf.
- Spread chunky peanut butter on one side of one slice of bread.
- Spread raspberry jam on one side of the other slice.
- Place one slice of bread on top of the other so that the sides with the peanut butter and jam are facing each other.
API vs. SDK
API: "This is how you talk to me."
SDK: "These are the tools to help you talk to me."
API: "Talk into this tube."
SDK: "Here's a loudspeaker and a specialized tool that holds the tube in the right place for you."
DataRobot's REST API is an API but the Python and R packages are a part of DataRobot's SDK because they provide an easier way to interact with the API.
API: Bolts and nuts
SDK: Screwdrivers and wrenches
Learn more about DataRobot's APIs and SDKs.
What does monotonic mean?
Let's say you collect comic books. You expect that the more money you spend, the more value your collection has (monotonically increasing relationship between value and money spent). However, there could be other factors that affect this relationship, like a comic book tears and your collection is worth less even though you spent more money. You don't want your model to learn that spending more money decreases value because it's really decreasing from a comic book tearing or other factor it doesn't consider. So, you force it to learn the monotonic relationship.
Let's say you're an insurance company, and you give a discount to people who install a speed monitor in their car. You want to give a bigger discount to people who are safer drivers, based on their speed. However, your model discovers a small population of people who drive incredibly fast (e.g., 150 MPH or more), that are also really safe drivers, so it decides to give a discount to these customers too. Then other customers discover that if they can hit 150 MPH in their cars each month, they get a big insurance discount, and then you go bankrupt. Monotonicity is a way for you to say to the model: "as top speed of the car goes up, insurance prices must always go up too."
Learn more about monotonics.
What is ridge regressor?
If you have a group of friends in a room talking about which team is going to win a game, you want to hear multiple opinions and not have one friend dominate the conversation. So if they keep talking and talking, you give them a 'shush' and then keep 'shushing' them louder the more they talk. Similarly, the ridge regressor penalizes one variable from dominating the model and spreads the signal to more variables.
There are two kinds of penalized regression—one kind of penalty makes the model keep all the features but spend less on the unimportant features and more on the important ones. This is Ridge. The other kind of penalty makes the model leave some unimportant variable completely out of the model. This is called Lasso.
Anomaly detection vs. other machine learning problems DataRobot can solve
Anomaly detection is an unsupervised learning problem. This means that it does not use a target and does not have labels, as opposed to supervised learning which is the type of learning many DataRobot problems fall into. In supervised learning there is a "correct" answer and models predict that answer as close as possible by training on the features. There are a number of anomaly detection techniques, but no matter what way you do it, there is no real "right answer" to whether something is an anomaly or not—it's just trying to group common rows together and find a heuristic way to tell you "hey wait a minute, this new data doesn't look like the old data, maybe you should check it out."
Supervised = I know what I’m looking for.
Unsupervised = Show me something interesting.
Example: Network access and credit card transactions
In some anomaly detection use cases there are millions of transactions that require a manual process of assigning labels. This is impossible for humans to do when you have thousands of transactions per day, so they have large amounts of unlabeled data. Anomaly detection is used to try and pick up the abnormal transactions or network access. A ranked list can then be passed on to a human to manually investigate, saving them time.
A parent feeds their toddler; the toddler throws the food on the floor and Mom gets mad. The next day, the same thing happens. The next day, Mom feeds the kid, the kid eats the food, and Mom's happy. The kid is particularly aware of Mom's reaction, and that ends up driving their learning (or supervising their learning), i.e., they learn the association between their action and mom's reaction—that's supervised.
Mom feeds the kid; the kid separates his food into two piles: cold and hot. Another day, the kid separates the peas, carrots, and corn. They're finding some structure in the food, but there isn't an outcome (like Mom's reaction) guiding their observations.
Learn more about machine learning problems DataRobot can help solve.
What are tuning parameters and hyperparameters?
Tuning parameters and hyperparameters are like knobs and dials you can adjust to make a model perform differently. DataRobot automates this process to make a model fit data better.
Say you are playing a song on an electric guitar. The chords progression is the model, but you and your friend play it with different effects on the guitar—your friend might tune their amplifier with some rock distortion and you might increase the bass. Depending on that, the same song will sound different. That's hyperparameter tuning.
Some cars, like a Honda Civic, have very little tuning you can do to them. Other cars, like a racecar, have a lot of tuning you can do. Depending on the racetrack, you might change the way your car is tuned.
What insights can a user get from Hotspots visualization?
Hotspots can give you feature engineering ideas for subsequent DataRobot projects. Since they act as simple IF statements, they are easy to add to see if your models get better results. They can also help you find clusters in data where variables go together, so you can see how they interact.
If Hotspots could talk to the user: "The model does some heavy math behind the scenes, but let's try to boil it down to some if-then-else rules you can memorize or implement in a simple piece of code without losing much accuracy. Some rules look promising, some don't, so take a look at them and see if they make sense based on your domain expertise."
If a strongly-colored, large rule was something like "Age > 65 & discharge_type = 'discharged to home'" you might conclude that there is a high diabetes readmission rate for people over 65 who are discharged to home. Then, you might consider new business ideas that treat the affected populationto prevent readmission, although this approach is completely non-scientific.
Learn more about Hotspots visualizations.
What is target leakage?
Target leakage is like that scene in Mean Girls where the girl can predict when it's going to rain if it's already raining.
One of the features used to build your model is actually derived from the target, or closely related.
You and a friend are trying to predict who’s going to win the Super Bowl (the Rams, or the Patriots).
You both start collecting information about past Super Bowls. Then your friend goes, “Hey wait! I know, we can just grab the newspaper headline from the day after the Super Bowl! Every past Super Bowl, you could just read the next day’s newspaper to know exactly who won.”
So you start collecting all newspapers from past Super Bowls, and become really good at predicting previous Super Bowl winners.
Then, you get to the upcoming Super Bowl and try to predict who’s going to win, however, something is wrong: “where is the newspaper that tells us who wins?”
You were using target leakage that helped you predict the past winners with high accuracy, but that method wasn't useful for predicting future behavior. The newspaper headline was an example of target leakage, because the target information was “leaking” into the past.
Learn more about Target Leakage.
Scoring data vs. scoring a model
You want to compete in a cooking competition, so you practice different recipes at home. You start with your ingredients (training data), then you try out different recipes on your friend to optimize each of your recipes (training my models). After that, you try out the recipes on some external guests who you trust and are somewhat unbiased (validation), ultimately, choosing the recipe that you will try in the competition. This is the model that you will be using for scoring.
Now you go to the competition where they give you a bunch of ingredients—this is your scoring data (new data that you haven't seen). You want to run these through your recipe and produce a dish for the judges—that is making predictions or scoring using the model.
You could have tried many recipes with the same ingredients—so the same scoring data can be used to generate predictions from different models.
Bias vs. variance
You're going to a wine tasting party and are thinking about inviting one of two friends:
- Friend 1: Enjoys all kinds of wine, but may not actually show up (low bias/high variance).
- Friend 2: Only enjoys bad gas station wine, but you can always count on them to show up to things (high bias/low variance).
Best case scenario: You find someone who isn’t picky about wine and is reliable (low bias/low variance).
However, this is hard to come by, so you may just try to incentivize Friend 1 to show up or convince Friend 2 to try other wines (hyperparameter tuning).
You avoid friends who only drink gas station wine and are unreliable about showing up to things (high bias/high variance).
Structured vs. unstructured datasets
Structured data is neat and organized—you can upload it right into DataRobot. Structured data is CSV files or nicely organized Excel files with one table.
Unstructured data is messy and unorganized—you have to add some structure to it before you can upload it to Datarobot.
Unstructured data is a bunch of tables in various PDF files.
Let’s imagine that you have a task to predict categories for new Wikipedia pages (Art, History, Politics, Science, Cats, Dogs, Famous Person, etc.).
All the needed information is right in front of you—just go to wikipedia.com and you can find all categories for each page. But the way the information is structured right now is not suitable for predicting categories for new pages. It would be hard to extract some knowledge from this data. On the other hand, if you will query Wikipedia’s databases and form a file with columns corresponding to different features of articles (title, content, age, previous views, number of edits, number of editors) and rows corresponding to different articles, this would be a structured dataset, it is more suitable to extract some hidden value from it using machine learning methods.
Note that text is not ALWAYS unstructured. For example, you have 1000 short stories, some of which you liked, and some of which you didn't. As 1000 separate files, this is an unstructured problem. But if you put them all together in one CSV, the problem becomes structured, and now DataRobot can solve it.
Log scale vs. linear scale
In log scale the values keep multiplying by a fixed factor (1, 10, 100, 1000, 10000). In linear scale the values keep adding up by a fixed amount (1, 2, 3, 4, 5).
Going up one point on the Richter scale is a magnitude increase of about 30x. So 7 is 30 times higher than 6, but 8 is 30 times higher than 7, so 8 is 900 (30x30) times higher than 6.
The octave numbers increase linearly, but the sound frequencies increase exponentially. So note A 3rd octave = 220 Hz, note A 4th octave = 440 Hz, note A 5th octave = 880 Hz, note A 6th octave = 1760 Hz.
- In economics and finance, log scale is used because it's much easier to translate to a % change.
- It's possible that a reason log scale exists is because many events in nature are governed by exponential laws rather than linear, but linear is easier to understand and visualize.
- If you have large linear numbers and they make your graph look bad, then you can log the numbers to shrink them and make your graph look prettier.
What is reinforcement learning?
Let's say you want to find the best restaurant in town. To do this, you have to go to one, try the food, and decide if you like it or not. Now, every time you go to a new restaurant, you will need to figure out if it is better than all the other restaurants you've already been to, but you aren't sure about your judgement, because maybe a dish you had was good/bad compared to others offered in that restaurant. Reinforcement learning is the targeted approach you can take to still be able to find the best restaurant for you, by choosing the right amount of restaurants to visit or choosing to revisit one to try a different dish. It narrows down your uncertainty about a particular restaurant, trading off the potential quality of unvisited restaurants.
Reinforcement learning is like training a dog—for every action your model takes, you either say "good dog" or "bad dog". Over time, by trial and error, the model learns the behavior so as to maximize the reward. Your job is to provide the environment to respond to the agent's (dog's) actions with numeric rewards. Reinforcement learning algorithms operate in this environment and learn a policy.
- Reinforcement learning works better if you can generate an unlimited amount of training data, like with Doom/Atari, AlphaGo games, and so on. You need to emulate the training environment so the model can learn its mechanics by trying different approaches a gazillion times.
- A good reinforcement learning framework is OpenAI Gym. In it you set some goal for your model, put it in some environment, and keep it training until it learns something.
- Tasks that humans normally consider "easy" are actually some of the hardest problems to solve. It's part of why robotics is currently behind machine learning. It is significantly harder to learn how to stand up or walk or move smoothly than it is to perform a supervised multiclass prediction with 25 million rows and 200 features.
What is target encoding?
Machine learning models don't understand categorical data, so you need to turn the categories into numbers to be able to do math with them. Some example methods to do target encoding include:
One-hot encoding is a way to turn categories into numbers by encoding categories as very wide matrices of 0s and 1s. This works well for linear models.
Target encoding is a different way to turn a categorical into a number by replacing each category with the mean of the target for that category. This method gives a very narrow matrix, as the result is only one column (vs. one column per category with a one-hot encoding).
Although more complicated, you can also try to avoid overfitting while using target encoding—DataRobot's version of this is called credibility encoding.
What is credibility weighting?
Credibility weighting is a way of accounting for the certainty of outcomes for data with categorical labels (e.g. what model vehicle you drive).
For popular vehicle category types, e.g. Ford F-series was the top-selling vehicle in USA in 2018, there will be many people in your data, and you will be more certain that the historical outcome is reliable. For unpopular category types, e.g. Smart Fortwo was ranked one of the rarest vehicle models in the US in 2017, you may only have one or two people in your data, and you will not be certain that the historical outcome is a reliable guide to the future. Therefore, you will use broader population statistics to guide your decisions.
You know that when you toss a coin, you can't predict with any certainty whether it is going to be heads or tails, but if you toss a coin 1000 times, you are going to be more certain about how many times you see heads (close to 500 times if you're doing it correctly).
What is ROC Curve?
The ROC Curve is a measure of how well a good model can classify data, and it's also a good off-the-shelf method of comparing two models. You typically have several different models to choose from, so you need a way to compare them. If you can find a model that has a very good ROC curve, meaning the model classifies with close to 100% True Positive and 0% False Positive, then that model is probably your best model.
Example: Alien sighting
Imagine you want to receive an answer to the question "Do aliens exist?" Your best plan to get an answer to this question is to interview a particular stranger by asking "Did you see anything strange last night?" If they say "Yes", you conclude aliens exist. If say "No", you conclude aliens don't exist. What's nice is that you have a friend in the army who has access to radar technology so they can determine whether an alien did or did not show up. However, you won't see your friend until next week, so for now conducting the interview experiment is your best option.
Now, you have to decide which strangers to interview. You inevitably have to balance whether you should conclude aliens exist now, or just wait for your army friend. You get about 100 people together and you will conduct this experiment with each of them tomorrow. The ROC curve is a way to decide which stranger you should interview because people are blind, drink alcohol, shy, etc. It represents a ranking of each person and how good they are, and at the end you pick the "best" person, and if the "best" person is good enough, you go with the experiment.
The ROC curve's y-axis is True Positives, and the x-axis is False Positives. You can imagine people that drink a lot of wine are ranked on the top right of the curve. They think anything is an alien, so they have a 100% True Positive ranking. They will identify an alien if one exists, but they also have 100% False Positive ranking—if you say everything is an alien, you're flat out wrong when there really aren't aliens. People ranked on the lower left don't believe in aliens, so nothing is an alien because aliens never existed. They have a 0% False Positive ranking, and 0% True Positive ranking. Again, nothing is an alien, so they will never identify if aliens exist.
What you want is a person with a 100% True Positive ranking and 0% False Positive ranking—they correctly identify aliens when they exist, but only when they exist. That's a person that is close to the top-left of the ROC Chart. So your procedure is, take 100 people, and rank them on this space of True Positives vs. False Positives.
Learn more about ROC Curve.
What is overfitting?
You tell Goodreads that you like a bunch of Agatha Christie books, and you want to know if you'd like other murder mysteries. It says “no,” because those other books weren't written by Agatha Christie.
Overfitting is like a bad student who only remembers book facts but does not draw conclusions from them. Any life situation that wasn't specifically mentioned in the book will leave them helpless. But they'll do well on an exam based purely on book facts (that's why you shouldn't score on training data).
Standalone Prediction Server (SSE) vs. Dedicated Prediction Server (DPS)
- Dedicated: You have your own garage, but it's attached to your house.
- Standalone: You have your own garage, but it's down the street from your house.
- Standalone garage: You keep it down the street because you want to make sure that all your nice cars are somewhere that your teenage driver can’t bash into them when trying to park, and also you want to always know there is a space away from the hustle and bustle of the driveway.
SSE allows a prediction server to be deployed with no dependency on the rest of the DataRobot cluster. It can be helpful in cases where data segregation or network performance are barriers to more traditional scoring with a DPE. SSE might be a good option if you're considering using scoring code, but would benefit from the simplicity of the prediction API, or if you have a requirement to collect Prediction Explanations.
What is deep learning?
Imagine your grandma Dot forgot her chicken matzo ball soup recipe. You want to try to replicate it, so you get your family together and make them chicken matzo ball soup.
It’s not even close to what grandma Dot used to make, but you give it to everyone. Your cousin says “too much salt,” your mom says, “maybe she used more egg in the batter,” and your uncle says, “the carrots are too soft.” So you make another one, and they give you more feedback, and you keep making chicken matzo ball soup over until everyone agrees that it tastes like grandma Dot's.
That’s how a neural network trains—something called backpropagation, where the errors are passed back through the network, and you make small changes to try to get closer to the right answers.
What is federated machine learning?
The idea is that once a central model is built, you can retrain the model to use in different edge devices.
McDonald's set aside its menu (central model) and gives the different franchise flexibility to use it. Then, McDonald's locations in India use that recipe and tweak it to include McPaneer Tikka burger. To do that tweaking, the Indian McDonald's did not need to reach out to the central McDonald's—they can update those decisions locally. It's advantageous because your models will be updated faster without having to send the data to some central place all the time. The model can use the devices local data (e.g., smartphone usage data on your smartphone) without having to store it in a central training data storage, which can also be good for privacy.
One example that Google gives is the smart keyboard on your phone. There is a shared model that gets updated based on your phone usage. All that computing is happening on your phone without having to store your usage data in a central cloud.
What are offsets?
Let's say you are a 5 year old who understands linear models. With linear regression, you find that betas minimize the error, but you may already know some of the betas in advance. So you give the model the value of those betas and ask it to go find the values of the other betas that minimize error. When you give a model an effect that’s known ahead of time, you're giving the model an offset.
What is F1 score?
Let's say you have a medical test (ML model) that determines if a person has a disease. Like many tests, this test is not perfect and can make mistakes (call a healthy person unhealthy or otherwise).
We might care the most about maximizing the % of truly sick people among those our model calls sick (precision), or we might care about maximizing the % of detection of truly sick people in our population (recall).
Unfortunately, tuning towards one metric often makes the other metric worse, especially if the target is imbalanced. Imagine you have 1% of sick people on the planet and your model calls everyone on the planet (100%) sick. Now it has a perfect recall score but a horrible precision score. On the opposite side, you might make the model so conservative that it calls only one person in a billion sick but gets it right. That way it has perfect precision but terrible recall.
F1 score is a metric that considers precision and recall at the same time so that you could achieve balance between the two.
How do you consider precision and recall at the same time? Well, you could just take an average of the two (arithmetic mean), but because precision and recall are ratios with different denominators, arithmetic mean doesn't work that well in this case and a harmonic mean is better. That's exactly what an F1 score is—a harmonic mean between precision and recall.
Explanations of Harmonic Mean.
What is SVM?
Let's say you build houses for good borrowers and bad borrowers on different sides of the street so that the road between them is as wide as possible. When a new person moves to this street, you csn see which side of the road they're on to determine if they're a good borrower or not. SVM learns how to draw this "road" between positive and negative examples.
SVMs are also called “maximum margin” classifiers. You define a road by the center line and the curbs on either side, and then try to find the widest possible road. The curbs on the side on the road are the “support vectors”.
Closely related term: Kernel Trick.
In the original design, SVM could only learn roads that are straight lines, however, kernels are a math trick that allows them to learn curve-shaped roads. Kernels project the points into a higher dimensional space where they are still separated by a linear "road," but in the original space, they are no longer a straight line.
The ingenious part about kernels compared to manually creating polynomial features in logistic regression, is that you don't have to compute those higher-dimensional coordinates beforehand because kernel is always applied to a pair of points and only needs to return a dot product, not the coordinates. This makes it very computationally efficient.
Help me understand Support Vector Machines on Stack Exchange.
What is an end-to-end ML platform?
Think of it as baking a loaf of bread. If you take ready-made bread mix and follow the recipe, but someone else eats it, that's not end-to-end. If you harvest your own wheat, mill it into flour, make your loaf from scratch (flour, yeast, water, etc.), try out several different recipes, take the best loaf, eat some of it yourself, and then watch to see if it doesn't become moldy—that's end to end.
What are lift charts?
You have 100 rocks. Your friend guesses the measurement of each rock while you actually measure each one. Next, you put them in order from smallest to largest (according to your friend's guesses, not by how big they actually are). You divide them into groups of 10 and take the average size of each group. Then, you compare what your friend guessed with what you measured. This allows you to determine how good your friend is at guessing the size of rocks.
Let's say you build a model for customer churn, and you want to send out campaigns to 10% of your customers. If you use a model to target that 10% with higher probability of churn, then you have more chance of targeting clients that might churn vs. not using model and just sending your campaigns randomly. Cumulative lift chart shows this more clearly.
Learn more about Lift Charts.
What is transfer learning?
Short version: When you teach someone how to distinguish dogs from cats, the skills that go into that can be useful when distinguishing foxes and wolves.
You are a 5-year old whose parents decided you need to learn tennis, while you are wondering who "Tennis" is.
Every day your parents push you out the door and say, “go learn tennis and if you come back without learning anything today, there is no food for you.”
Worried that you'll starve, you started looking for "Tennis." It took a few days for you to figure out that tennis is a game and where tennis is played. It takes you a few more days to understand how to hold the racquet and how to hit the ball. Finally, by the time you figured out the complete game, you are already 6 years old.
Your parents took you to the best tennis club in town, and found Roger Federer to coach you. He can immediately start working with you—teaching you all about tennis and makes you tennis-ready in just a week. Because this guy also happened to have a lot of experience playing tennis, you were able to take advantage of all his tips, and within a few months, you are already one of the best players in town.
Scenario 1 is similar to how a regular machine learning algorithm starts learning. With the fear of being punished, it starts looking for a way to learn what is being taught and slowly starts learning stuff from scratch. On the other hand, by using Transfer Learning the same ML algorithm has a much better guidance/starting ground or in other words it is using the same ML algorithm that was trained on a similar data as an initialization point so that it can quickly learn the new data but at a much faster rate and sometimes with better accuracy.
What are summarized categorical features?
Let's say you go to the store to shop for food. You walk around the store and put items of different types into your cart, one at a time. Then, someone calls you on the phone and asks you what have in your cart, so you respond with something like "6 cans of soup, 2 boxes of Cheerios, 1 jar of peanut butter, 7 jars of pickles, 82 grapes..."
Learn more about Summarized Categorical Features.
Particle swarm vs. GridSearch
GridSearch takes a fixed amount of time but may not find a good result.
Particle swarm takes an unpredictable, potentially unlimited amount of time, but can find better results.
You’ve successfully shown up to Black Friday at Best Buy with 3 of your friends and walkie talkies. However, you forgot to look at the ads in the paper for sales. Not to worry, you decided the way you were going to find the best deal in the Big Blue Box is to spread out around the store and walk around for 1 minute to find the best deal and then call your friends and tell them what you found. The friend with the best deal is now an anchor and the other friends start moving in that direction and repeat this process every minute until the friends are all in the same spot (2 hours later), looking at the same deal, feeling accomplished and smart.
You’ve successfully shown up to Black Friday at Best Buy with 3 of your friends. However, you forgot to look at the ads in the paper for sales and you also forgot the walkie talkies. Not to worry, you decided the way you were going to find the best deal in the Big Blue Box is to spread out around the store in a 2 x 2 grid and grab the best deal in the area then meet your friends at the checkout counter and see who has the best deal. You meet at the checkout counter (5 minutes later), feeling that you didn’t do all you could, but happy that you get to go home, eat leftover pumpkin pie and watch college football.
What's GridSearch and why is it important?
Let’s say that you’re baking cookies and you want them to taste as good as they possibly can. To keep it simple, let’s say you use exactly two ingredients: flour and sugar (realistically, you need more ingredients but just go with it for now).
How much flour do you add? How much sugar do you add? Maybe you look up recipes online, but they’re all telling you different things. There’s not some magical, perfect amount of flour you need and sugar you need that you can just look up online.
So, what do you decide to do? You decide to try a bunch of different values for flour and sugar and just taste-test each batch to see what tastes best.
- You might decide to try having 1 cup, 2 cups, and 3 cups of sugar.
- You might also decide to try having 3 cups, 4 cups, and 5 cups of flour.
In order to see which of these recipes is the best, you’d have to test each possible combination of sugar and of flour. So, that means:
- Batch A: 1 cup of sugar & 3 cups of flour
- Batch B: 1 cup of sugar & 4 cups of flour
- Batch C: 1 cup of sugar & 5 cups of flour
- Batch D: 2 cups of sugar & 3 cups of flour
- Batch E: 2 cups of sugar & 4 cups of flour
- Batch F: 2 cups of sugar & 5 cups of flour
- Batch G: 3 cups of sugar & 3 cups of flour
- Batch H: 3 cups of sugar & 4 cups of flour
- Batch I: 3 cups of sugar & 5 cups of flour
If you want, you can draw this out, kind of like you’re playing the game tic-tac-toe.
|1 cup of sugar||2 cups of sugar||3 cups of sugar|
|3 cups of flour||1 cup of sugar & 3 cups of flour||1 cup of sugar & 4 cups of flour||1 cup of sugar & 5 cups of flour|
|4 cups of flour||2 cups of sugar & 3 cups of flour||2 cups of sugar & 4 cups of flour||2 cups of sugar & 5 cups of flour|
|5 cups of flour||3 cups of sugar & 3 cups of flour||3 cups of sugar & 4 cups of flour||3 cups of sugar & 5 cups of flour|
Notice how this looks like a grid. You are searching this grid for the best combination of sugar and flour. The only way for you to get the best-tasting cookies is to bake cookies with all of these combinations, taste test each batch, and decide which batch is best. If you skipped some of the combinations, then it’s possible you’ll miss the best-tasting cookies.
Now, what happens when you’re in the real world and you have more than two ingredients? For example, you also have to decide how many eggs to include. Well, your “grid” now becomes a 3-dimensional grid. If you decide between 2 eggs and 3 eggs, then you need to try all nine combinations of sugar and flour for 2 eggs, and you need to try all nine combinations of sugar and flour for 3 eggs.
The more ingredients you include, the more combinations you'll have. Also, the more values of ingredients (e.g. 3 cups, 4 cups, 5 cups) you include, the more combinations you have to choose.
Applied to Machine Learning: When you build models, you have lots of choices to make. Some of these choices are called hyperparameters. For example, if you build a random forest, you need to choose things like:
- How many decision trees do you want to include in your random forest?
- How deep can each individual decision tree grow?
- At least how many samples must be in the final “node” of each decision tree?
The way we test this is just like how you taste-tested all of those different batches of cookies:
- You pick which hyperparameters you want to search over (all three are listed above).
- You pick what values of each hyperparameter you want to search.
- You then fit a model separately for each combination of hyperparameter values.
- Now it’s time to taste test: you measure each model’s performance (using some metric like accuracy or root mean squared error).
- You pick the set of hyperparameters that had the best-performing model. (Just like your recipe would be the one that gave you the best-tasting cookies.)
Just like with ingredients, the number of hyperparameters and number of levels you search are important.
- Trying 2 hyperparameters (ingredients) of 3 levels apiece → 3 * 3 = 9 combinations of models (cookies) to test.
- Trying 2 hyperparameters (ingredients) of 3 levels apiece and a third hyperparameter with two levels (when we added the eggs) → 3 * 3 * 2 = 18 combinations of models (cookies) to test.
The formula for that is: you take the number of levels of each hyperparameter you want to test and multiply it. So, if you try 5 hyperparameters, each with 4 different levels, then you’re building 4 * 4 * 4 * 4 * 4 = 4^5 = 1,024 models.
Building models can be time-consuming, so if you try too many hyperparameters and too many levels of each hyperparameter, you might get a really high-performing model but it might take a really, really, really long time to get.
DataRobot automatically GridSearches for the best hyperparameters for its models. It is not an exhaustive search where it searches every possible combination of hyperparameters. That’s because this would take a very, very long time and might be impossible.
In one line, but technical: GridSearch is a commonly-used technique in machine learning that is used to find the best set of hyperparameters for a model.
Bonus note: You might also hear RandomizedSearch, which is an alternative to GridSearch. Rather than setting up a grid to check, you might specify a range of each hyperparameter (e.g. somewhere between 1 and 3 cups of sugar, somewhere between 3 and 5 cups of flour) and a computer will randomly generate, say, 5 combinations of sugar/flour. It might be like:
- Batch A: 1.2 cups of sugar & 3.5 cups of flour.
- Batch B: 1.7 cups of sugar & 3.1 cups of flour.
- Batch C: 2.4 cups of sugar & 4.1 cups of flour.
- Batch D: 2.9 cups of sugar & 3.9 cups of flour.
- Batch E: 2.6 cups of sugar & 4.8 cups of flour.
Keras vs. TensorFlow
In DataRobot, “TensorFlow" really means “TensorFlow 0.7” and “Keras” really means “TensorFlow 1.x".
In the past, TensorFlow had many interfaces, most of which were lower level than Keras, and Keras supported multiple backends (e.g., Theano and TensorFlow). However, TensorFlow consolidated these interfaces and Keras now only supports running code with TensorFlow, so as of Tensorflow 2.x, Keras and TensorFlow are effectively one and the same.
Because of this history, upgrading from an older TensorFlow to a new TensorFlow is easier to understand than switching from TensorFlow to Keras.
Keras vs. Tensorflow is like an automatic coffee machine vs. grinding and brewing coffee manually.
There are many ways to make coffee, meaning TensorFlow is not the only technology that can be used by Keras. Keras offers "buttons" (interface) that is powered by a specific "brewing technology" (TensorFlow, CNTK, Theano or something else, known as the Keras backend).
Earlier, DataRobot used the lower-level technology, TensorFlow, directly. But just like grinding and brewing coffee manually, this takes a lot more effort and maintenance, and increased the maintenance burden as well, so DataRobot switched to a higher-level technology, like Keras, that provides many nice things under the hood, for example, delivering more advanced blueprints in the product more quickly, which would have taken a lot of effort if manually implemented in TensorFlow.
How does CPU differ from GPU (in term of training ML models)?
Think about CPU (central processing unit) as a 4-lane highway with trucks delivering the computation, and GPUs (graphics processing unit) as a 100-lane highway with little shopping carts. GPUs are great at parallelism, but only for less complex tasks. Deep learning specifically benefits from that since it's mainly batches of matrix multiplication, and these can be parallelized very easily. So, training a neural network in a GPU can be 10x faster than on a CPU. But not all model types get that benefit.
Here's another one:
Let’s say there is a very large library, and the goal is to count all of the books. The librarian is knowledgeable about where books are, how they’re organized, how the library works, etc. The librarian is perfectly capable of counting the books on their own and they’ll probably be very good and organized about it.
But what if there is a big team of people who could count the books with the librarian—not library experts, just people who can count accurately.
- If you have 3 people who count books, that speeds up your counting.
- If you have 10 people who count books, your counting gets even faster.
- If you have 100 people who count books...that’s awesome!
A CPU is like a librarian. Just like you need a librarian running a library, you need a CPU. A CPU can basically do any jobs that you need done. Just like a librarian could count all of the books on their own, a CPU can do math things like building machine learning models.
A GPU is like a team of people counting books. Just like counting books is something that can be done by many people without specific library expertise, a GPU makes it much easier to take a job, split it among many different units, and do math things like building machine learning models.
For more details and analogy, see this post in the DataRobot Community.