Lecture 05 – Training Versus Testing

Lecture 05 – Training Versus Testing

January 6, 2020 65 By Kody Olson


ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about
error and noise. And these are two notions that relate
the learning problem to practical situations. In the case of error measures, we
realized that in order to specify the error that is caused by your hypothesis,
we should try to estimate the cost of using your h, instead of f
which should have been used in the first place. And that is something the user can
specify, of the price to pay when they use h instead of f. And that is the principled way
of defining an error measure. In the absence of that, which happens
for quite a bit of time, we go to plan B and resort to analytic properties, or
practical properties of optimization, in order to choose the error measure. Once you define the error between the
performance of your hypothesis versus the target function on a particular
point, you can plug this in, into different error quantities, like the
in-sample error and the out-of-sample error, and get those values in
terms of the error measure by getting an average. In the case of the training set, you
estimate the error on the training points, and then you average with
respect to the N examples that you have. And in the case of out-of-sample,
theoretically the definition would be that you also evaluate the error between
h and f on a particular point x, give the weight of x according to its
probability, and get the expected value with respect to this x. The notion of noisy targets came from
the fact that what we are trying to learn may not be a deterministic
function, the only function in mathematics, where y is uniquely
determined by the value of x. But rather, when y is affected by x– y is distributed according to
a probability distribution, which gives you y given x. And we talked about, for example, in
the case of credit application, two identical applications may lead
to different credit behavior. Therefore, the credit behavior is
a probabilistic thing, not a deterministic function of
the credit application. You can go back to our first example,
let’s say, of the movie rentals. If you rate a movie, you may rate the
same movie at different times differently, depending on your
mood and other factors. So there’s always a noise factor
in these practical problems. And that is captured by the transitional
probability from x to y, probability of y given x. When we look at the diagram involving
this probability– so now we replace the target function,
which used to be a function, by a probability distribution, which
can be modeled as a target function plus noise. And these feed into the generation
of the training examples. And when you look at the unknown input
distribution, which we introduced technically in order to get the benefit
of Hoeffding inequality, that also feeds into the
training example. This determines x. And this determines y given x. And then you generate these examples
independently, according to this distribution. So when we had x being the only
probabilistic thing, and y being a deterministic function of x, then
x_1 was independent of x_2, independent of x_N. And then you compute each y, according
to the function, on the corresponding x. When you have the noisy version, then
the pair x_1 and y_1 is generated according to the joint probability
distribution, which is P of x, the original one, times P of y given
x, the one you introduced to accommodate the noise. And then the independence lies
between different pairs. So x_1, y_1 would be independent of x_2,
y_2, independent of x_3, y_3 and so on. And when you get the expected values
for errors, you now have to take into consideration the probability with
respect to both x and y. So what used to be the expected value
with respect to x, is now the expected value with respect to x and y. And then you plug in x into h, and
correspond it to the probabilistic value of y that happened to occur. And that would be now the out-of-sample
error in this case. Now in this lecture, I’m going to start
the theory track that will last for this particular route three
lectures, followed by another theory lecture on a related but
different topic. And the idea is to relate training to
testing, in-sample and out-of-sample, in a realistic way. So the outline will be the following. We’ll spend some time talking about
training versus testing, a very intuitive notion. But we’d like to put the mathematical
framework that describes what is training versus testing. And then we will introduce quantities
that will be mathematically helpful in characterizing that relationship. And after I give you a number of
examples to make sure that the notion is clear, we are going to introduce
the key notion, the break point. And the break point is the one that will
later result in the VC dimension, the main notion in the
theory of learning. And finally, I end up with a puzzle. It’s an interesting puzzle that will
hopefully fix the ideas that we talked about in the lecture. So now let’s talk about training
versus testing. And I’m going to take a very simple
example that you can relate to. Let’s say that I’m giving
you a final exam. So now I want to help you out. So before the final exam, I give you
some practice problems and solutions, so you can work on and prepare
yourself for the final exam. That is very typical. Now if you look at the practice problems
and solutions, this would be your training set, so to speak. You’re going to look at the question. You’re going to answer. You’re going to compare it
with the real answer. And then you are going to adjust your
hypothesis, your understanding of the material, in order to do it better, and
go through them and perhaps go through them again, until you get them right or
mostly right or figure out the material. And now you are more ready
for the final exam. Now the reason I gave you the practice
problems and solutions is to help you do better on the final, right? Why don’t I just give you the
problems on the final, then? Excellent idea, I can see! Now the problem is obvious. The problem is that doing well
on the final is not the goal, in and of itself. The goal is for you to learn the
material, to have a small E_out. The final exam is only
a way of gauging how well you actually learned. And in order for it to gauge how well
you actually learned, I have to give you the final at the point you have
already fixed your hypothesis. You prepared. You studied. You discussed with people. You now sit down to take
the final exam. So you have one hypothesis. And you go through the exam. And therefore, your answer on, let’s
say, the 50 questions of the final– hopefully, it’s not going to be
that long if there’s a final– will reflect what your understanding
will be outside. So the distinction is conceptual. And now, let’s put mathematically
what is training versus testing? It will be an extremely simple
distinction, although it’s an important distinction. Here is what testing is, in terms
of a mathematical description. You have seen this before. This is the plain-vanilla Hoeffding. This part is how well you did
on the final exam. This is how well you understand
the material proper. And since you have only one
hypothesis– this is a final, you are fixed, and you just take the exam. Your performance on the exam tracks well
how you understand the material. And therefore, the difference
between them is small. And the probability that it’s not small
is becoming less and less, when the number of questions,
in this case, goes up. So that is what testing is. How about training? Almost identical, except
for one thing– this fellow. Because in the case of training,
this is how you performed on the practice problems. In the practice problems, you had
the answers, and you modified your hypothesis. And you looked at it, and you
got an answer wrong. So you modified your hypothesis again. You are learning better. That’s all very nice. But now the practice set
is contaminated. You pretty much almost
memorized what it is. And there’s a price to pay for that, in
terms of how your performance on the practice, which is E_in in this case,
tracks how well you understand the material, which is still E_out. And the price you pay is
how much you explored. And that was reflected by
the simple M, which was the number of hypotheses in the very simple
derivation we did. So if you want an executive summary of
this lecture, we are just going to try to get M to be replaced by something
more friendly, because you realize that M– if you just measure the
complexity of your hypothesis set by the number of hypotheses– this is next
to useless in almost all cases. Something as simple as the perceptron
has M equals infinity. And therefore, this guarantee
is no guarantee at all. If we can replace M with another
quantity, and justify that, and that quantity is not infinite even if
the hypothesis set is infinite, then we are in business. And we can start talking about the
feasibility of learning in an actual model, and be able to establish
the notion in a way that we can apply to a real situation. That’s the plan. We’re talking about M, so the
first question is to ask, where did this M come from? If we are going to replace it, we need
to understand where it came from, to understand the context
for replacing it. Well, there are bad events that
we have talked about. And the bad events are called
B, because they are bad. That’s good! And then– these are the bad events. What is the bad event that
we were trying to avoid? We were trying to avoid the situation
where your in-sample performance does not track the out-of-sample
performance. If their difference is bigger than
epsilon, this is a bad situation. And we’re trying to say that
the probability of a bad situation is small. That was the starting point. Now we applied the union bound,
and we got the probability of several bad events. This is the bad event for
the first hypothesis. You can see here that there
is m, a small m. m is 1, 2, 3, 4, up to M.
So there are M hypotheses, capital M hypotheses that I’m talking about. And I would like the probability of
any of them happening to be small. Why is that? Because your learning algorithm is free
to pick whichever hypothesis it wants, based on the examples. So if you tell me that the probability
of any of the bad events is small, then whichever hypothesis your algorithm
picks, they will be OK. And I want that guarantee to be there. So let’s try to understand the
probability of the B_1 or B_2 or B_M. What does it look like? Well, if you look at a Venn diagram,
and you place B_1 and B_2 and B_3 as areas here, these areas– these are different events. They could be disjoint, in which case
the circles will be far apart. Or they could be coincident, which
will be on top of each other. They could be independent, which means
that they are proportionately overlapping. There could be many situations. Now the point of the bound is that we
would like to make that statement regardless of the correlations
between the events. And therefore, we use the union bound,
which actually bounds it by the total area of the first one, plus the total
area of the second one, et cetera, as if they were disjoint. Well, that will always hold regardless
of the level of overlap. But you can see that this is a poor
bound because in this case, we are estimating it to be about three times
the area, when it’s actually closer to just the area, because the overlap
is so significant. And therefore, we would like to be able
to take into consideration the overlaps, because with no overlaps,
you just get M terms. And you’re stuck with M, and
infinity, in almost all the interesting hypothesis sets. Now of course, you can go– in principle, you can go and I give you the hypothesis set,
which is the perceptron. And you can try to formalize, what
is this bad event in terms of the perceptron. And what happens when you go to the
other perceptron, and try to get the full joint distribution of all of these
guys, and solve this exactly. Well, you can, in principle–
theoretically. It’s a complete nightmare,
completely undoable. And if we have to do this for every
hypothesis set you propose, there wouldn’t be learning theory around. People will just give up. So what we are going to do, we are
going to try to abstract from the hypothesis set a quantity that is
sufficient to characterize the overlaps, and get us a good bound,
without having to go through the intricate details of analyzing how the
bad events are correlated. That would be the goal. And we will achieve it, through
a very simple argument. So that’s where M comes from. When we asked, can we improve on M? Maybe M is the best we can do. It’s not like we
wish to improve it, so it has to be improved. Maybe that’s the best we can say. If you have an infinite hypothesis, then
you’re stuck, and that’s that. But it turns out that, no, the overlap
situation we talked about is actually very common. Yes, we can improve on M. And the reason
is that the bad events are extremely overlapping in
a practical situation. Let’s take the example we know, which
is the perceptron, to understand what this is. I’m going through the example because
now we have lots of binary things– +1 versus -1 for the target,
+1 versus -1 for the hypothesis, agreeing versus
disagreeing, et cetera. I want to pin down exactly what is
the bad event, in terms of this picture, so that we understand what
we are talking about. Here is the target function
for a perceptron. And it returns +1 for some
guys, -1 for some guys. That’s easy. And then you have a hypothesis,
a perceptron. And this is not the final hypothesis. This is a badly performing hypothesis. But it is a general perceptron. If you find any vector of weights,
you’ll find another blue line. So now in terms of this picture, could
someone tell me what is E_out? What is the out-of-sample error for this
hypothesis, when it’s applied to this target function? It’s not that difficult. It is actually just these areas,
the differential areas. This is where they disagree. One is saying +1. One is saying -1. So these two areas– if you get the total
area if it’s uniform, the total probability if it’s not– then this will give
you the value of E_out. That’s one quantity. How about E_in? For E_in, you need a sample. So first, you generate a sample. Here’s a constellation of points. Some of these points, as you
see, will fall into the bad region, here and here. And I color them red. So the fraction of red compared to
all the sample gives you E_in. That is understood. This is E_in and E_out. And these are the guys that I
want to track each other. OK, fine. I understand this part. And in words. Now you’ll look at: what is the change
E_in and E_out, when you change your hypothesis? So here’s your first hypothesis. Now take another perceptron. You probably already suspect that
this is hugely overlapping. Whatever you’re talking about, it must
be overlapping, because they’re so close to each other. But let’s pin down the specific event
that, we would like to argue, is overlapping. So the change in E_out when you go from,
let’s say, the blue hypothesis, this blue hypothesis, to
the green hypothesis– the change in E_out would be the area
of this yellow thing, not very much. A very thin area. That’s where E_out changed, right? So if you look at the area, that gives you
delta E_out. If you look at the delta E_in, the change of the labels
of data points– if one of the data points happens
to fall in this yellow region, then its error status will change from one
hypothesis to another, because one hypothesis got it right, and
the other one got it wrong. Now the chances of a point
falling here is small. So you can see why we are arguing that
the change delta E_out and the change delta E_in is small. The area is small, and the probability
of a point falling there is small. Moreover, they are actually moving in
the same direction because the change is actually depending on the
area of the yellow part. So this– let’s say that this is increasing. If they increase, they increase both,
because I get a net positive area for the delta E_out. And the probability of falling
there also increases. Now, the reason I’m saying that, is
because what we care about are these. We would like to make the statement
that, how E_in tracks E_out for the first hypothesis, for the blue
perceptron, is comparable to how E_in tracks E_out for the second one. Why are we interested in that? Because we would like to argue that this
exceeding epsilon happens often, when this exceeds epsilon. The events are overlapping. We are not looking for the absolute
value of those, we are just saying that, if this exceeds epsilon,
this also exceeds epsilon most of the time. And therefore, the picture we had
last time is actually true. These guys are overlapping. The bad events are overlapping. And at least we stand a hope that we
will get something better than just counting the number of hypotheses, for
the complexity we are seeking. So we can improve M. That’s good news. We can improve M. We’re going
to replace it with something. What are we going to replace it with? I’m going to introduce to you now the
notion that will replace M. It is not going to be completely
obvious that we can actually replace M with this quantity. That will require a proof. And that will take us
into next lecture. The purpose here is to define the
quantity, and make you understand it well, because this is the quantity that
will end up characterizing the complexity of any model you use. So we want to understand it well. And we are going to motivate that it can
replace M. It will be plausible. It makes sense. It’s not a crazy quantity. It also counts the number
of hypotheses, of sorts. And therefore, let’s define the quantity
and become familiar with it. And then next time, we will like the
quantity so much that we’ll bite the bullet, and go through the proof that we
can actually replace M with this quantity. So what is the quantity? The quantity is based
on the following. When we count the number of hypotheses,
we obviously take into consideration the entire input space. What does that mean? These are four different perceptrons. So I take the input space. And the reason these guys are
different is because they are different on at least one point
in the input space. That’s what makes two
functions different. And because the input space is infinite,
continuous, that’s why we get an infinite number of hypotheses. So let’s say that, instead of counting
the number of hypotheses on the entire input space, I’m going to restrict
my attention only to the sample. So I generate only the input points,
which are finite points, put them on the diagram. So I have this constellation
of points. And when I look at these points alone,
regardless of the entire input space, those perceptrons will classify them. These guys will turn into red and blue,
according to the regions they fall in. Now, in order to fully understand what
it means to count only on the number of points, we have to wipe
out the input space. So that’s what I’m going to do. That’s what you have. So you can imagine the perceptron
is somewhere. And it’s splitting the points. And now what I’m counting is– for this constellation, which is
a fixed constellation of points, how many patterns of red
and blue can I get? Now when you do this, you’re not
counting the hypotheses proper, because the hypotheses are
defined on the input space. You are counting them
on a restricted set. But still, you’re counting. You’re counting the number
of hypotheses. For example, if I give you a hypothesis
set where you get all possible combinations of red and blue,
that’s a powerful hypothesis. If I give you a hypothesis where you get
only few, that’s not so powerful a hypothesis. So the count here also corresponds in
our mind to the strength, or the power, of the hypothesis set, which in our mind
is what we try to capture by the crude M. So we are going to
count the number of hypotheses. I’m putting them between quotations. Why? Because now the hypotheses are defined
only on a subset of the points. So I’m going to give them a different
name, when I define them only on a subset of the points, in order not to
confuse the hypotheses, on the general input space, with this case. I’m going to call them dichotomies. And the idea is that
I give you N points. And there is a dichotomy between what
goes into red, and what goes into blue. That’s where the name came from. So when you look only at the points, and
you look at this, which ones are blue and which ones are
red, are a dichotomy. And if you want to understand
it, let’s look at this. Let’s say that you’re looking
at the full input space. And this is your perceptron. And this is the function
it’s implementing. And then you put a constellation
of points. The way to understand dichotomies is to
think that I have an opaque sheet of paper, that has holes in it. And you put that opaque sheet of paper
on top of your input space. So you don’t see the input space. You only see it through the
eyes of those points. So what do you see when you put this? You end up with this here. You don’t see anything. You don’t see where the hypothesis is. You just see that these guys
turned blue, and these guys turned red or pink. Now as you vary the perceptron, as you
vary the line here, you are not going to notice it here, until the line
crosses one of the points. So I could be running around here, here,
here, and here, and generating an infinite number of hypotheses, for
which I’m charging a huge M. And this guy is sitting here, looking. Nothing happened. It’s the same thing. I’m counting it as 1. And then when you cross, you end
up with another pattern. So all of a sudden, these
guys are blue. And these guys are red. That’s when, let’s say, this guy
is horizontal here rather than vertical here. So you can always think that we reduced
the situation to where we’re going to look at the problem exactly
as it is, except through this sheet that has only N holes. Let’s put, in mathematical terms, the
dichotomies which are the mini hypotheses, the hypotheses restricted
to the data points. A hypothesis formally is a function. And the function takes the full
input space X, and produces -1, +1. That’s the blue and red
region that we saw. A dichotomy, on the other hand,
is also a hypothesis. We can even give it the same name,
because it’s returning the same values for the points it’s allowed
to return values on. But the domain of it is not
the full input space, but very specifically, x_1 up to x_N. These are– each one of these points belongs to
X, to the input space. But now I’m restricting
my function here. And again, the result is -1,
+1, exactly as it was here. That’s what a dichotomy is. Now if I ask you how many hypotheses
there are, let’s say for the perceptron case? Very easy. It can be infinite. In the case of the perceptron,
it’s infinite. Why? Because this guy is seriously
infinite. So the number of functions is
just infinite, by a margin! So that’s fine. Now if you ask yourself, what is
the number of dichotomies? Let’s look at the notation first,
and then answer the question. The dichotomy is a function
h applied to one of those. So when I talk about it, the value, I
would say h of x_1 or h of x_2, one value at a time. If I decide to use the fancy notation,
I say I’m going to apply small h to the entire vector, x_1, x_2, up to x_N. I would be meaning that you tell me the
values of h of x on each of them. So you return a vector of
the values, h of x_1, h of x_2, up to h of x_N. That’s not an unusual notation. Now if you apply the entire set of
hypotheses H to that, what you are doing is that you are applying
each member here, which is h, to the entire vector. Each time you apply one of those
guys, you get -1, -1, +1, +1, -1, +1, -1, et cetera. So you get a full dichotomy. And then you apply another h, and you
get another dichotomy, and so on. However, as you vary h, which has
an infinite number of guys, many of these guys will return exactly the same
dichotomy, because the dichotomies are very restricted. I have these N points only. And I’m returning +1 or -1
on them only. So how many different ones
can I possibly get? At most, 2 to the N. If H is
extremely expressive, it will get you all 2 to the N. If not, it will get
you smaller than 2 to the N. So I can start with the most infinite
type of hypothesis. And if I translate it into dichotomies,
I have an upper bound of 2 to the N for the number
of dichotomies I have. So this thing now becomes a candidate
for replacing the number of hypotheses. Instead of the number of hypotheses,
we’re talking about the number of dichotomies. Now we define the actual quantity. Capital M is red. And I keep it red throughout. And we are going now to define small
m, which I will also keep as red. That will hopefully, and provably
as we will see next time, replace M. It’s called the growth function. What is the idea of
the growth function? The growth function counts
the most dichotomies you can get, using your hypothesis
set on any N points. So here is the game. I give you a budget N.
That’s my decision. You choose where to place
the points, x_1 up to x_N. Your choice is based on your attempt
to find as many dichotomies as possible, on the N points, using
the hypothesis set. So it would silly, for example, to take
the points and put them, let’s say, on a line, because now you are
restricted in separating them. But you can see the most I can
get if I put them in this general constellation. And then you count the number of
dichotomies you are going to get. And what you’re going to report to me is
the value of the growth function on the N that I passed on to you. So I give you N, you go through this
exercise, and you return a number that is the growth function. Let’s put it formally. The growth function is going to be
called m, in red as I promised. And it is the maximum. Maximum with respect to what? With respect to any choice of
N points from the input space. That is your part. I gave you the N. So I
told you what N is. And then you chose x_1 up to x_N with
a view to maximizing something. What are you maximizing? Well, we had this funny notation. H applied to this entire
vector is actually the set of dichotomies, the vectors, -1, +1,
-1, +1, and then the next guy and the next guy–
the actual vectors here. When you put this cardinality on top
of them, you’re just counting them. You’re asking yourself: how
many dichotomies do I get? So you’re maximizing, with respect to the
choice of x_1 up to x_N, this thing. That will give you the most expressive
facet of the hypothesis set on N points, that number. I tell you 10. And you come back with the number 500. It means that by your choice of the x_1 up
to x_10, you managed to generate 500 different guys, according to the
hypothesis set that I gave you. Now because of this, you can see now
that there is an added notation here. It used to be m, but it actually
depends on the hypothesis set, right? It’s the growth function for
your hypothesis set. So I’m making that dependency explicit,
by putting a subscript H. Furthermore, this is
a full-fledged function. M was a number. I give you a hypothesis set. It’s an number. Well, it happens to be infinite,
but it’s a number. Here, I’m giving you
a full function. That is, I tell you N, you tell me
what the growth function is. So it’s a little bit more complicated. And because it is this way, m_H is
actually a function of N. That’s the growth function. So that is the notion. Now what can we say about
the growth function? Well, if the number of dichotomies is
at most 2 to the N, because that’s as many +1, -1, N-tuples you can
produce, then the maximum of them is also bounded by the same thing, at most
2 to the N. Well, if we are going to replace M with m, I would say
2 to the N is an improvement over infinity. If we can afford to do it. Maybe it’s not a great improvement,
nonetheless improvement. Now, let’s apply the definition to
the case of perceptrons, in order to give it flesh, so we understand
what the notion is. It’s not just an abstract quantity. We take the perceptrons, and we would
like to get the growth function of the perceptrons. Well, getting the growth function of
the perceptron is quite a task. If I tell you what is M
for the perceptron? Infinity. And then you go home. What is the growth function
of the perceptron? You have to tell me what is the growth
function at N equals 1, what is at N equals 2, at N equals
3, at N equals 4. It’s a whole function. So we say, 1 and 2 is easy. Let’s start with N equals 3. So I’m choosing 3 points. And I chose them wisely, so that I can
maximize the number of dichotomies. And now I’m asking myself, what is the
value of the growth function for the perceptron for the value
N equals 3? Well, it’s not that difficult. You can see, I can actually get
everything there is to get. Why? Because I can have my line here, or I
can have my line here, or I can have my line here. That’s 3 possibilities times 2 because
I can make it +1 versus two -1’s, or -1 versus two +1’s. We are counting 6 so far. And then I can have my hypothesis
sitting here. That will make them all +1. Or I can have it sitting here, which
makes them all -1. That’s 8. That’s all of them. The perceptron hypothesis is as strong
as you can get, if you only restrict your attention to 3 points. So the answer would be what? Is it already 8? Wait a minute. Someone else chose the points co-linear,
and then found out that if you want these guys to go to the -1
class, and this guy to go to the +1 class, there is no perceptron
that is capable of doing this. Correct? You cannot pass a line that will make
these two guys go to +1, and this guy go to -1, if these are co-linear. Does this bother us? No. Because we are taking the maximum. So this, the quantity you computed
here, since you got to the 8– you cannot go above 8. That defines it. And indeed, you can with authority
answer the question that the growth function for this case,
m at N equals 3, is 8. Now let’s see if we are still in
luck when we go to N equals 4. What is the growth function
for 4 points? We’ll choose the point in
general position again. We are not going to have any
co-linearity, in order to maximize our chances. But then we are stuck with
the following problem. Even if you choose the points in
general position, there is this constellation– there is this particular pattern on the
constellation, which is -1, -1, and +1, +1. Can you generate this
using a perceptron? No. And the opposite of it,
you cannot either. If this was -1, -1, and
this one, +1, +1. Can you find any other 4 points, where
you can generate everything? No. I can play around, and there is always
2 missing guys, or even worse. If I choose the points unwisely,
I will be missing more of them. So the maximum you are getting is that
you are missing 2 out of all the possibilities. And the growth function here is 14, not
16, as it might have been if you had the maximum. Now this is a very satisfactory
result, because perceptrons are pretty limited models. We use them because they are
simple, and there’s a nice algorithm that goes with them. So we have to expect that the quantity
we are measuring the sophistication of the perceptrons with, which is the
growth function, had better not be the maximum possible. Because if it’s the maximum possible,
then we are declaring: perceptrons are as strong as can be. Now they break. And they are limited. And if I pick another model, which,
let’s say– just for the extreme case– the set of all hypotheses. What would be the growth function
for the set of all hypotheses? It would be 2 to the N, because
I can generate anything. So now, according to this measure that
I just introduced, the set of all hypotheses is stronger
than the perceptrons. Satisfactory result, simple
but satisfactory. Now what I’m going to do– I’m going to take some examples, in
which we can compute the growth function completely for all values of
N. You can see that if I continued with this and say, let’s
go with the perceptron. 5 points. You put the 5 points,
and then you try. Am I missing this? Or maybe if I change the position
of the points. It’s just a nightmare, just to get 5. And basically, if you just do it by
brute force, it’s not going to happen. So I’m taking examples where we can
actually, by a simple counting argument, get the value of the growth
function for the entire domain, N from 1 up to infinity, in
order to get a better feel for the growth function. That’s the purpose of this portion. Our first model, I’m going
to call positive rays. Let’s look at what positive
rays look like. They are defined on the real line. So the input space is
R, the real numbers. And they are very simple. From a point on, which we are going to
call ‘a’– this is the parameter that decides one hypothesis versus
the other in this particular hypothesis set. All the points that are
bigger go to +1. All the points that are
smaller go to -1. And it’s called positive ray, because
here is the ray– very simple hypothesis set. Now in order to define the growth
function, I need a bunch of points. So I’m going to generate some points. I’m going to call them x_1 up to x_N. And I am going to choose them
as general as possible. I guess there is very little generality
when you’re talking about a line. Just make sure that they don’t
fall on each other. If they fall on each other, you cannot
really dichotomize them at all. If you put them separately,
you’ll be OK. So you have these N points. Now when you apply your hypothesis,
the particular hypothesis that is drawn on the slide, to these points,
you are going to get this pattern. True? And you’re asking yourself, how many
different patterns I can get on these N points by varying my hypothesis,
which means that I’m varying the value of ‘a’? That is the parameter that gives me
one hypothesis versus the other. Formally, the hypothesis set is a set
from the real numbers to -1, +1. And I can actually find
an analytic formula here. If you want an analytic formula,
you remember the sign? This is, I think– If you apply it, that’s exactly
what I described. Now we ask ourselves, what
is the growth function? Here is a very simple argument. If you have N points, the value of the
dichotomy– which ones go to blue and which ones go to red– depends on
which segment between the points ‘a’ will fall in. If ‘a’ falls here, you get this pattern. If ‘a’ falls here, this guy will be red. And the rest of the guys will be blue. So I get a different dichotomy. I get different dichotomies when
I choose a different line segment. How many line segments are
there to choose from? I have N points. I have N minus 1 sandwiched ones, and
one here when all of them are red, and one here when all of them are blue. Right? So I have N plus 1 choices. And that’s exactly the number of
dichotomies I’m going to get on N points, regardless of what N is. So I found that the growth function,
for this thing, is exactly N plus 1. Let’s take a more sophisticated model,
and see if we get a bigger growth function. Because that’s
the whole idea, right? The next guy is positive intervals. What are these? They’re like the other guys, except
they’re a little bit more elaborate. Instead of having a ray,
you have an interval. Again, you’re talking
about the real line. And you are going to define
an interval from here to here. And anything that lies within
here, will map to +1 and will become blue. And anything outside, whether it’s right
or left, will go to -1. That’s obviously more powerful than the
previous one, because you can think of the positive ray as having
an infinite interval. That’s fine. So you put the points. We have done this before. And they get classified this way. And I’m asking myself, how many
different dichotomies I can get now by choosing really 2 parameters, the
beginning of the interval and the end of the interval. These are my 2 parameters, that will tell
me one hypothesis versus the other. How many different patterns can I get? Again, the function is very simple. It’s defined on the real numbers.
And now the counting argument, which is an interesting one. The way you get a different dichotomy
is by choosing 2 different line segments, to put the ends
of the interval in. If I start the interval here and
end it here, I get something. If I start the interval here and end
it here, I get something else. If I start the interval here and
end here, I get something else. And that is exactly one-to-one mapping
between the dichotomies and the choice of 2 segments. So if this is the case, then I can
very simply say that the growth function, in this case, is the number
of ways to pick 2 segments out the N plus 1 segments. And that would be N plus 1 choose 2. There is only 1 missing. When you count, there are 2 rules– make sure that you count everything,
and make sure that you don’t count anything twice. Very simple. So we counted almost everything. But the missing guy here is what? Let’s say that all of them are blue. Is this counted already? Yes, because I can choose this
segment and this segment. And that is already counted in this. But if they’re all red,
what does that mean? It means that the beginning of the
interval, and the end of the interval, happen to be within the same segment. So they didn’t capture any point. And that, I didn’t count. And it doesn’t matter which segment
they’re in, because I will get just the all reds. So it’s one dichotomy. So all I need to do is just add 1. And that’s the number. Do a little algebra, and you get this. That is the growth function
for this hypothesis set. And now I’m happy, because
I see it’s quadratic. It’s more powerful than the previous
guy, which was linear. Now let’s up the ante, and
go to the third one. Convex sets. This time, I’m taking the
plane, rather than the line. So it’s R squared. And my hypotheses are simply
the convex regions. If you look at the values of x at
which the hypothesis is +1, this has to be a convex region,
any convex region. A convex region is a region where,
if you pick any 2 points within the region, the entirety of the line segment
connecting them lies within the region. That’s the definition. So this is my artwork
for a convex region. You take any 2 points and– So this is an example of that. The blue is the +1. And the red is the -1. That’s the entire space. So this is a valid hypothesis. Now you can see that there is
an enormous variety of convex sets that qualify as hypotheses. But there are some which
don’t qualify. For example, this one is not convex,
because of this fellow. Here’s the line segment, and
it went out of the region. So that’s not convex. We understand what the
hypothesis set is. Now we come to the task. What is the growth function
for this hypothesis set? In other to answer this, what you
need is– you put your points. I give you N, and you place them. So here is a cloud of points. I give you N, and you say, it seems
like putting them in general position is a good idea. So let’s put them in
a general position. And let’s try to see how many patterns
I can get out of these, using convex regions. Man, this is going to be tough
because I can see– Let’s see. First, I cannot get all
of them, right? Because let’s say I take the outermost
points, and map them all to +1. This will force all the internal points
to be +1, because I’m using a convex region. Therefore, I cannot get +1’s for
the out guys, and any -1 whatsoever inside. So that excludes a lot of dichotomies. Now I have to do real
counting. But wait a minute. The criterion for choosing the cloud of
points was not to make them look good and general, but to maximize
your growth function. Is there another choice for the
points that gives me more hypotheses than these? As a matter of fact, is there another
choice, for where I put the points, that will give me all possible dichotomies
using convex regions? If you succeed in that, then you
don’t care about this cloud. The other one will count, because
you are taking the maximum. Here is the way to do it. Take a circle, and put your points
on the perimeter of that circle. Now I maintain that you can get any
dichotomy you want on these points. What is the argument? Well, pick your favorite one. I have a bunch of blues
and a bunch of reds. Can I realize this using
a convex region? Yes. I just connect these guys. And the interior of this
goes to +1. And whatever is outside
goes to -1. And I am assured it’s convex, because
the points are on the perimeter of a circle. That means what? That means that the growth
function is 2 to the N, notwithstanding the other guy. You realize now a weakness in
defining the growth function as the maximum, because in a real learning
situation, the chances are the points you’re going to get are not going to
end up on a perimeter of a circle. They are likely to be
all over the place. And some of them will be interior
points, in which case you’re not going to get all possibilities. But we don’t want to keep studying the
particular probability distribution, and the particular data
set you get, and so on. We would like to have
a simple quantity. And therefore, we’re taking the maximum
overall, which will have a simple combinatorial property. The price we pay is that, the chances are
the bound we are going to get is not going to be as tight as possible. But that’s a normal price. If you want a general result that
applies to all situations, it’s not going to be all that tight
in any given situation. That is the normal tradeoff. But here, the growth function is
indeed 2 to the N. Just as a term, when you get all
possible hypotheses, all possible dichotomies, you say that the hypothesis
set shattered the points– broke them in every possible way. So we can say, can we shatter
this set, et cetera? That’s what it means. You get all possible combinations
on them. Just as a term. Now let’s look at the 3 growth functions
on one slide, in order to be able to compare. We started with the positive rays, and
we got a linear growth function. And then we went on to the
positive intervals. And we had a quadratic function. And that is good, because we are getting
more sophisticated and the growth function is getting bigger. And then we went to convex
sets, which are– It’s powerful and two-dimensional
and all, but not that powerful. Convex sets are still– It’s really, although we got a bigger
one, it’s inordinately bigger. Maybe we should have gotten N cubed. But that’s what we have. At least it goes this way. So sometimes that thing
will be too much. But in general, you can see the trend
that, with more sophisticated, you get a bigger growth function. Now let’s go back to the big picture,
to see where that growth function will fit. Remember this inequality? Oh, yes. We have seen it. We have seen it often. We are tired of it! What we are trying to do is replace M.
And we decided to replace it with the growth function m. M can be infinity. m is a finite number, at most
2 to the N, so that’s good. What happens if we replace
M with small m? Let’s say that we can do that, which
we’ll establish in the next lecture. What will happen? If your growth function happens to
be polynomial, you are in great shape. Why is that? Because if you look at this quantity,
this is a negative exponential. epsilon can be very, very small. epsilon squared can be really,
really, really small. But this remains a negative exponential
in N. And for any choice of epsilon you wish, this will kill the
heck out of any polynomial you put here, eventually. Right? I can put a 1000th-order polynomial,
and can have epsilon equal 10 to the minus 6. And if you’re patient enough, or if your
customer has enough data, which would be an enormous amount of data, you
will eventually get this to win. And you will get the probability to be
diminishingly small, which means that you can generalize. That’s a very attractive observation,
because now all you need to do is just declare that this is
polynomial, and you’re in business. We saw that it’s not that easy
to evaluate this explicitly. But maybe, there is a trick that will
make us able to declare that it is polynomial. And once you declare that a hypothesis
set has a polynomial growth function, we can declare that learning is feasible
using that hypothesis, period. We may become finicky and ask ourselves,
how many examples do you need for what, et cetera? But at least, we know we can do it. If you’re given enough examples, you
will be able to generalize from a finite set, albeit big, to the general
space with a probability assurance. So that’s pretty good. I’m happy that this is the case. So maybe we can, as I mentioned, just
prove that m_H is polynomial, the growth function is polynomial. Can we do that? Maybe we can. Maybe we cannot. Here’s the key notion that
will enable us to do that. We are going to define what
is called the break point. You give me a hypothesis set, and
I tell you it has a break point. Perceptrons, 4. Another set, the break point is 7. Just one number. That’s much better than giving
me a full growth function for every N. Just one number. So what is the break point? The definition is the following. It’s the point at which you fail
to get all possible dichotomies. So you can see that, if the break point
is 3, this is not a very fancy hypothesis set. I can’t even generate all 8
possibilities on 3 points. If the break point is 100, well, that’s
a pretty respectable guy, because I can generate everything up to 99 points,
all 2 to the 99 of them. And then I start failing at 100. So you can see that the break point
also has a correspondence to the complexity of the hypothesis set. If no data set of size k can
be shattered by H– that is, if there are no choice of
k points in which you are able to generate all possible dichotomies. Then you call k a break point for H. So let’s look at– what is the– So that’s what it means. You can’t shatter, so less than
2 to the k, which are all the possibilities for k data points. So for the 2D perceptron, can you
think of what is the break point? We did it already. We didn’t explicitly say
it in those terms. But this is the hint. For 3, we did everything. For 4, we knew we cannot
do everything. So it doesn’t matter whether
it’s 14 or 15 or 12 or 5. As long as it breaks, it breaks. It’s not 16. And therefore, in this case,
the break point is 4. That number 4 will characterize the
perceptrons. Just to tell me, I have a hypothesis set. And it is defined– I don’t want to know the input space. Wait a minute. OK, I’m not going to tell
you the input space. I’m going to tell you the hypotheses. The hypotheses are produced by the– I don’t want to hear it. Just tell me the break point, and I will
tell you the learning behavior. Also, if you have a break point, every bigger point is also
a break point. That is, if you cannot get all
possibilities on 10 points, then you certainly cannot get
all of them on 11. If you could get them on 11,
just kill one. And you will have gotten them on 10. Let’s look at the 3 examples, and
find what are the break points. Positive rays had this guy. This is a formula. We can plug in for N. And we
ask ourselves, when do I get to the point where I no longer get 2 to the N,
numerically for a particular value. What is the break point here? N equals 1. I get 1 plus 1. That’s 2. That also happens to be 2 to the 1. 2: N plus 1 is 3. Oh, that’s less than 4. So 2 must be a break point. This is since we invested in computing the
function, we are just lazy now and just substituting. But you could go for the original thing,
and say that’s obvious. Because this particular combination
of points– if I want the rightmost point to be
red, and the left one to be blue, there is no way for the positive
ray to generate that. And therefore, that 2 is a break point. There’s something where I fail. Let’s go for this one. We need faster calculators now. 1, 1/2, et cetera. Wow. It’s exactly. When I put 1, it gives me 2. It must be the correct formula. Let’s write 2. At 4, I get 2. And– it calculates. What is the break point? It must be bigger than the other guy,
because it’s more elaborate. And you realize it’s 3. If you put 2 points,
you will get the 4. And if you put 3, you’ll get
7, which is short of 8. Again, that’s not a mystery. That’s what you cannot get
using the interval. You cannot get the middle point to be
red while the other ones are blue. So you cannot get all possibilities
on 3 points. Therefore, 3 is a break point. What is the break point
for the convex sets? Tell me how many points where I can fail. Well, I’m never going to fail. So if you like, you can
say this is infinity. Let’s define it this way. So also, the break point–
just a single number– has the property we want. It gets more sophisticated as the
model gets more sophisticated. So what is the main result? The main result is that the
first part will be– if you don’t have a break point,
I have news for you. The growth function is
2 to the N. OK, yes. That’s the definition. Thank you. So that cannot possibly
be the main result. So what is the main result? The main result is that if you
have a break point, any break point, 1, 5, 7000. Just tell me that there
is a break point. You don’t even have to tell
me what is the break point. We are going to make a statement
about the growth function. The growth function is– do I hear a drum roll? [MAKES DRUM SOUND] It’s guaranteed to be polynomial in N. Wow, we have come a long way. I used to ask you what
are the hypotheses, and count them. That was hopeless because
it’s infinity. We defined the growth function,
and we have to evaluate it. That was painful. Then we found the break point. Maybe it’s easier to compute
the break point. I just want to find a clever way,
and say that I cannot get it. Now all I need to hear from you
is that there is a break point. And I’m in business as far as the
generalization is concerned, because I know that regardless of what polynomial
you get, you will be able to learn eventually. I will become more particular, and ask you
what is the break point, when I try to find the budget of examples you need
in order to get a particular performance. But in principle, if I just want to say
you can use this hypothesis set, and you can learn, I just want you
to tell me I have a break point. That’s all I want. This is a remarkable result. And I have to give you a puzzle
to appreciate it. The idea of the puzzle
is the following. If I just tell you that there’s
a break point, the constraint on the number of dichotomies you get, because there is a break point, is enormous. If I tell you a break point is,
let’s say, 3, how many can you get on 100 points? On those 100 points, for any choice of
3 guys, you cannot have all possible combinations– at any 3 points,
all 100 choose 3 of them. So the combinatorial restriction
is enormous. And you will end up losing possible
dichotomies in droves, because of that restriction. And therefore, the thing that used to
be 2 to the N, if it’s unrestricted, will collapse to polynomial. Let’s take a puzzle, and try to
compute this in a particular case. Here is the puzzle. We have only 3 points. And for this hypothesis set, I’m telling
you that the break point is 2. So you cannot get all possible four
dichotomies on any 2 points. If you put x_1 and x_2, you cannot get
-1 -1, -1 +1, +1 -1, and +1 +1. All of them. You cannot get it. One of them has to be missing. So I’m asking you, given that this is
the constraint, how many dichotomies can you get on 3 points? You can see, this is what I’m trying to
do because I’m telling you that the restriction on 2 will– If I didn’t have the restriction,
I would be putting eight. So I’m just telling you this case. So how many do I get? For visual clarity, I’m going to
express them as either black or white circles, just for you to be able to– instead of writing -1 or +1. This dichotomy is fine. It doesn’t violate anything. I’ve only one possibility. So we keep adding. Everything is fine. As a matter of fact, everything will
remain fine until we get to four, because the whole idea is that I cannot
get all four on any of them. So if I have less than four, I cannot
possibly get four combinations. You see what the point is. This is still allowed. I’m going through it as a binary one. So this is 0 0 0, 0 0 1, et cetera. I’m still OK, right? Am I still OK? [MAKES BUZZER SOUND] You have violated the constraint. You cannot put the last row, because
it now violates the constraint. I have to take it out. So let’s take it out. Try the next guy. Maybe we are in luck. Are we OK? OK. That’s promising. So let’s go for the next guy. Maybe we’ll get it. Are we OK? [MAKES BUZZER SOUND] Tough. So we have to take out the last row. How about this one? Nope. We take it out. We don’t have too many
options left, right? Actually, this is the last guy. It had better work. Does it work? No. So that’s what we can do. We lost half of them. Now you may think, maybe you
messed it up because you started very regularly. Just started from all 0, 0 0 1. But if I started differently,
I may be able to achieve more. It’s conceivable. Please don’t lose sleep over it. The only row you are going to be able
to add to this table is this one. This is indeed the solution. And you can verify it at home. Now we know that indeed the
break point is a very good restriction. And we are going, in the next lecture,
to prove that it actually leads to a polynomial growth, which is
the main result we want. Let me stop here. And we will take the questions
after a short break. Let’s start with the questions. MODERATOR: The first question is,
what if the target or the hypotheses are not binary? PROFESSOR: There is a counterpart for the entire theory that
I’m going to develop, for real-valued functions and other
types of functions. The development of the theory is
technical enough, that I’m going to develop it only for the binary case,
because it is manageable. And it carries all of the
concepts that you need. The other case is more technical. And I don’t find the value of going to
that level of technicality useful, in terms of adding insight. What I’m going to do is, I’m going
to apply a different approach to real-valued functions, which is
the bias-variance tradeoff. And it’s a completely different approach
from this one, that will give us another angle on generalization
that is particularly suitable for real-valued functions. But the short answer is that, if the
function is not binary, there is a counterpart to what I’m
saying that will work. But it is significantly more technical
than the one I am developing. MODERATOR: Just as a sanity check. When the hypothesis set can
shatter the points, this is a bad thing, right? PROFESSOR: OK. There is a tradeoff that will stay
with us for the entire course. It’s bad and good. If you shatter the points, it’s good
for fitting the data, because I know that if you give me the data, regardless
of what your data is, I’m going to be able to fit them because, I
have something that can generate a hypothesis for any particular
set of combinations. So if your question is,
can I fit the data? Then shattering is good. When you go to generalization,
shattering is bad, because basically you can get anything. So it doesn’t mean anything
that you fit the data. And therefore, you have less hope
of generalization, which will be formalized through the
theoretical results. And the correct answer is, what is the
good balance between the two extremes? And then we’ll find a value for which
we are not exactly shattering the points, but we are not very restricted,
in which we are getting some approximation, and we’re getting some generalization. And that will come up. MODERATOR: Is there a similar
trick to the one you used for convex sets in higher dimensions? PROFESSOR: So if you– The principle I explained,
I explained it in terms of two-dimensional and perceptrons. If you look at the essence
of it, the space is X. It could be anything. The only restrictions I have
are binary functions. So this could be a high-dimensional space. And the surfaces will be very
sophisticated surfaces. And all I’m reading off, as far as this
lecture is concerned, is how many patterns do I get on a number
of N points. MODERATOR: Also a question
on the complexity. Why is usually polynomial time
considered as acceptable? PROFESSOR: OK. Polynomial, in this case, is polynomial
growth in the number of points N. It just so happens that we are working
with the Hoeffding inequality that gives us a very helpful term, which
is the negative exponential. And therefore, if you get a polynomial,
as I mentioned, any polynomial, you are guaranteed that for
a large enough N, the probability– the right-hand side of the Hoeffding,
including the growth function, will be small. And therefore, the probability of
something bad happening is small. Now obviously, there are other functions
that also will be killed by the negative exponential. For example, if I had a growth function
of the form, let’s say, e to the square root of N, that’s
not a polynomial. But that will also be killed by the
negative exponential, because it’s square root versus the other one. It just so happens that we are in the
very fortunate situation that the growth function is either identically
2 to the N, or else it’s polynomial. There is nothing in between. If you draw something that is
super polynomial and sub exponential and try to find the hypothesis set
for which this is a growth function, you will fail. So I’m getting it for free. I’m just taking the simplicity
of the polynomial, because lucky for me, the polynomials are the
ones that come out. And they happen to serve the purpose. MODERATOR: OK. A few people are asking, could you
repeat the constraints of the puzzle? Because they didn’t get the– PROFESSOR: OK. Let’s look at the puzzle. I am putting 3 bits on every row. I’m trying to get as many different rows
as possible, under the constraint that if you focus on any 2 of them– so
if I focus on x_1 and x_2 and go down the columns, it must be that one
of the possible patterns for x_1 and x_2 is missing. Because I’m saying that 2 is
a break point, so I cannot shatter any 2 points. Therefore, I cannot shatter x_1 and x_2,
among others, meaning that I cannot get all possible patterns. There are only four possible patterns,
which is, if you take it as a binary 0 0, 0 1, 1 0, 1 1. And I’m representing them using
the circles. In this case, the x_1 and x_2
get 0 0, so to speak. If I keep adding a pattern– So let’s look at here. x_1 and x_2, how many patterns do they have? They have this pattern. They have it again. That doesn’t count. So there’s only one pattern
here, plus one, is two. So on x_1 and x_2, I have
only two patterns. So I haven’t violated anything, because
I will be only violating if I get all four patterns. So I’m OK, and similarly
for the other guys. Things become interesting when you
start getting the fourth row. Now again, if you look at the first 2
points, I get one pattern here and one pattern here. There are only two patterns. Nothing is violated as far these
2 points are concerned. But the constraint has to be satisfied
for any choice of 2 points. So if you particularly choose x_2 and x_3,
and count the number of patterns, you realize, 0 0,
0 1, 1 0, 1 1. I am in trouble. That’s why we put it in red. Because now these guys have
all possible patterns. And I know, by the assumption of the
problem, that I cannot get all four patterns on any 2 points. So I cannot get this. So I’m unable to add this row
under those constraints. And therefore, I’m taking it away. And I’m going through the exercise. And every time I put a row, I keep
an eye on all possible combinations. So here, I put– let’s look at x_1 and x_2. 1 pattern, 2, 3. I’m OK. x_2 and x_3, 1 pattern, which
is here and here. 2, 3. I’m also OK. And then you put x_1, x_3. Here is a pattern. It repeats here. 0 0 and 0 0. So that’s one. And then I get this one
and this one, 3. So this one is perfect,
everyone. Not perfect in any sense, except
that I didn’t violate anything. So I’m allowed to put that row. Now when I extend this further, and start
putting the new guys, for this guy, there is a violation. And you can scan your eyes and
try to find the violation. And I’m highlighting it in red. So I’m showing you that for x_1 and
x_3, there are the four patterns. Here’s one pattern, the second one. I didn’t count this one, just because
it’s already happened. So I just highlight four different ones,
and then the third one and fourth one. So I cannot possibly add this row,
because it violates the constraint on these 2 points. So I take it out and keep adding. Another attempt, this is the next guy. It still violates. Why does it violate? For the same argument. Look at the red guys. You find all possible patterns. So I cannot have it. So we take it away. And then the last one that
is remaining is this guy. And that also doesn’t work, because
it violates it for those guys. You can look at it and verify. And the conclusion here is that
I cannot add anything. So that’s what I’m stuck with. And therefore, the number of different
rows I can get under the constraint that 2 is a break point– in this case, is 4. Obviously, the remark I mentioned is
that maybe you can start instead of gradually from 0 0 0, 0 0 1,
maybe you can start more cleverly or something. But however, anyway you try it, it’s
sufficiently symmetric in the bits that it doesn’t make a difference. You will be stuck with at most 4. MODERATOR: OK. In the slide with the Hoeffding
inequality, does anything change when you change– specifically, does a probability measure
change when you change from a hypothesis to dichotomy? PROFESSOR: For this one? MODERATOR: Yeah. PROFESSOR: Yeah. The idea here, M is the number of hypotheses, period. So it’s infinity for perceptrons. We have to live with that. In our attempt to replace it with the
growth function, we are going to replace it by something that is not
infinite, bounded above by 2 to the N. As you can see, 2 to the N is not
really helpful because I have a positive exponential and
negative exponential. And that’s not very decisive. Therefore I am trying to find if I can
put a growth function– not only put the growth function here, but also
show that the growth function is polynomial for the models
of interest that I have, and therefore be able to get
this to be a small quantity for a real learning model, like the perceptron
or the other one, neural networks, et cetera. All of these will have a polynomial
growth function, as we will see. So that’s where the number of
hypotheses, which is M, goes to the number of dichotomies, which
is the growth function. Not a direct substitution,
as we will see. There are some technicalities
involved. But that is what gets me the right-hand
side to be a manageable right hand side, and goes to 0 as N grows,
which tells me that the probability of generalization will be high. MODERATOR: OK. Is there a systematic way
to find the break points? PROFESSOR: There is. It’s not one size fits all. The are arguments, for example, you
can go for neural networks. And sometimes you find it by finding
a particular combination that you cannot break, and argue that this
is the break point. Sometimes you can argue
by– Let me try to find a crude estimate
for the growth function. Let’s say the growth function
cannot be more than this. And then as you go by, you realize
that this is not exponential. So there has to be a break point
at some point. This would be less than 2 to the N, and
therefore will be a break point. So in that case, the estimate for the
break point will be just an estimate. It will not be an exact value. But it will a maximum. We have a question in house. STUDENT: Hi. So in this slide, the top end is the
number of testing points and the lower end is the number of training points. PROFESSOR: Yeah. N is always the size of the sample. And it’s a question of interpretation
between the two, whether that sample is used for testing, which means that you
have already frozen your hypothesis, and you are just verifying,
testing it. Or in the other case, you haven’t
frozen your hypothesis. And you are using the same sample
to go around and find one. And you are charged for the going
around aspect by M. STUDENT: So let’s say that our
customer gives us k sample points. How do we decide how many of them do
we reserve for testing points, how many for training? PROFESSOR: This is
a very good point. There will be a lecture down the
road called validation, in which this is going to be addressed
very specifically. There are rules of thumb. There are some mathematical results, but there is a rule of thumb. There are few rules of thumb that I’m
going to say without proof, that stood the test of time. And one of the rules of thumb has to
apply to, how many do we reserve, in order to first not to diminish the
training set very much, and still have a big enough test set so that
the estimate is reliable? So this will come up. Thank you. There is another question. STUDENT: Hi, professor. I have one question. So for 2 hypotheses that have the same
dichotomy, is it true that the in-sample error is the same
for the 2 hypotheses? PROFESSOR: OK. If it has the same dichotomy, it’s even
a stronger condition than this, because it returns exactly
the same values. Now the in-sample error is
the fraction of errors I got right and wrong. The target function is fixed. So that is not going to change. So obviously, I’m going to get
the same pattern of errors. And if I get the same pattern of errors,
then obviously I’m getting the same fraction of errors,
among other things. Now if you’re asking, for these
2 hypotheses, what is the out-of-sample error? That’s a different story, because for the
out-of-sample error, you take the hypothesis in its entirety. So in spite of the fact that it’s the
same on the set of points, it may be not the same on the entire input space,
which it isn’t because they’re different hypotheses. And therefore, you get
a different E_out. But the answer is yes. You will get the same in-sample error. STUDENT: Oh, yes. I see. That’s why I was asking. Because I think that the out-of-sample
error is different for 2 hypotheses. So can we replace the M with– PROFESSOR: Exactly. And the biggest technicality
in the proof– We were saying, we’re going to
replace M by the growth function. That’s a very helpful thing. Now, there has to be a proof. And I will argue for the proof, and
the overlapping aspects, and some of this. The key point is, what do
I do about this fellow? Because when I consider the sample, this
one is very much under control. As you said, if I have 2 hypotheses
that are the same here, they are the same here. But they are not the same here. So the statement here depends on
E_out, depends on the whole input space. So how am I going to
get away with that? That’s really the main technical
contribution for the proof. And that will come up next time. STUDENT: Sure, thank you. PROFESSOR: Sure. MODERATOR: So– why is it called a growth function? PROFESSOR: A growth function. I really– The person who introduced this
called it a growth function. I guess he called it a growth function,
because it grows, as you increase N. I don’t think there is any
particular merit for the name. MODERATOR: Is there– what is a real-life situation similar
to the one in the puzzle, where you realize that this break point
may be too small? PROFESSOR: OK. The first order of business is to
get the break point out of the way– that there is a break point, we are in business. Second one is, how does the value
of the break point relate to the learning situation? Do I need more examples when
I have a bigger break point? The answer is yes. What is the estimate? And there’s a theoretical
estimate, a bound. Maybe the bound is too loose. So we’ll have to find practical rules
of thumb that translate the break point to a number of examples. All of this is coming up. So the existence of the break point
means learning is feasible. The value of the break point tells us
the resources needed to achieve a certain performance. And that will be addressed. MODERATOR: Is there a probabilistic
statement for the Hoeffding inequality that is an alternative to the
case-by-case discussion on M’s growth rate in N? PROFESSOR: There are
alternatives to Hoeffding. So there are alternatives
to Hoeffding, and you can get different results
or emphasize different things. I am sticking to Hoeffding. And I’m not indulging too much into its
derivation, or the alternatives, because this is a mathematical
tool that I’m borrowing. And I’m taking it for granted. And I picked the one that will help
us the most, which is this one. So yes, there are variations. But I am deliberately not getting
into them, in order not to dilute the message. I want people to become so incredibly
familiar and bored with this one, then they know it cold. Because when we get to modify it,
including the growth function and the other technical points, I’d like the
base point to be completely clear in people’s mind, so that they don’t get
lost with the modifications. So that’s why I’m sticking to this. MODERATOR: I think that’s it. PROFESSOR: Very good. We’ll see you next time.