# Lecture 05 – Training Versus Testing

ANNOUNCER: The following program

is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about

error and noise. And these are two notions that relate

the learning problem to practical situations. In the case of error measures, we

realized that in order to specify the error that is caused by your hypothesis,

we should try to estimate the cost of using your h, instead of f

which should have been used in the first place. And that is something the user can

specify, of the price to pay when they use h instead of f. And that is the principled way

of defining an error measure. In the absence of that, which happens

for quite a bit of time, we go to plan B and resort to analytic properties, or

practical properties of optimization, in order to choose the error measure. Once you define the error between the

performance of your hypothesis versus the target function on a particular

point, you can plug this in, into different error quantities, like the

in-sample error and the out-of-sample error, and get those values in

terms of the error measure by getting an average. In the case of the training set, you

estimate the error on the training points, and then you average with

respect to the N examples that you have. And in the case of out-of-sample,

theoretically the definition would be that you also evaluate the error between

h and f on a particular point x, give the weight of x according to its

probability, and get the expected value with respect to this x. The notion of noisy targets came from

the fact that what we are trying to learn may not be a deterministic

function, the only function in mathematics, where y is uniquely

determined by the value of x. But rather, when y is affected by x– y is distributed according to

a probability distribution, which gives you y given x. And we talked about, for example, in

the case of credit application, two identical applications may lead

to different credit behavior. Therefore, the credit behavior is

a probabilistic thing, not a deterministic function of

the credit application. You can go back to our first example,

let’s say, of the movie rentals. If you rate a movie, you may rate the

same movie at different times differently, depending on your

mood and other factors. So there’s always a noise factor

in these practical problems. And that is captured by the transitional

probability from x to y, probability of y given x. When we look at the diagram involving

this probability– so now we replace the target function,

which used to be a function, by a probability distribution, which

can be modeled as a target function plus noise. And these feed into the generation

of the training examples. And when you look at the unknown input

distribution, which we introduced technically in order to get the benefit

of Hoeffding inequality, that also feeds into the

training example. This determines x. And this determines y given x. And then you generate these examples

independently, according to this distribution. So when we had x being the only

probabilistic thing, and y being a deterministic function of x, then

x_1 was independent of x_2, independent of x_N. And then you compute each y, according

to the function, on the corresponding x. When you have the noisy version, then

the pair x_1 and y_1 is generated according to the joint probability

distribution, which is P of x, the original one, times P of y given

x, the one you introduced to accommodate the noise. And then the independence lies

between different pairs. So x_1, y_1 would be independent of x_2,

y_2, independent of x_3, y_3 and so on. And when you get the expected values

for errors, you now have to take into consideration the probability with

respect to both x and y. So what used to be the expected value

with respect to x, is now the expected value with respect to x and y. And then you plug in x into h, and

correspond it to the probabilistic value of y that happened to occur. And that would be now the out-of-sample

error in this case. Now in this lecture, I’m going to start

the theory track that will last for this particular route three

lectures, followed by another theory lecture on a related but

different topic. And the idea is to relate training to

testing, in-sample and out-of-sample, in a realistic way. So the outline will be the following. We’ll spend some time talking about

training versus testing, a very intuitive notion. But we’d like to put the mathematical

framework that describes what is training versus testing. And then we will introduce quantities

that will be mathematically helpful in characterizing that relationship. And after I give you a number of

examples to make sure that the notion is clear, we are going to introduce

the key notion, the break point. And the break point is the one that will

later result in the VC dimension, the main notion in the

theory of learning. And finally, I end up with a puzzle. It’s an interesting puzzle that will

hopefully fix the ideas that we talked about in the lecture. So now let’s talk about training

versus testing. And I’m going to take a very simple

example that you can relate to. Let’s say that I’m giving

you a final exam. So now I want to help you out. So before the final exam, I give you

some practice problems and solutions, so you can work on and prepare

yourself for the final exam. That is very typical. Now if you look at the practice problems

and solutions, this would be your training set, so to speak. You’re going to look at the question. You’re going to answer. You’re going to compare it

with the real answer. And then you are going to adjust your

hypothesis, your understanding of the material, in order to do it better, and

go through them and perhaps go through them again, until you get them right or

mostly right or figure out the material. And now you are more ready

for the final exam. Now the reason I gave you the practice

problems and solutions is to help you do better on the final, right? Why don’t I just give you the

problems on the final, then? Excellent idea, I can see! Now the problem is obvious. The problem is that doing well

on the final is not the goal, in and of itself. The goal is for you to learn the

material, to have a small E_out. The final exam is only

a way of gauging how well you actually learned. And in order for it to gauge how well

you actually learned, I have to give you the final at the point you have

already fixed your hypothesis. You prepared. You studied. You discussed with people. You now sit down to take

the final exam. So you have one hypothesis. And you go through the exam. And therefore, your answer on, let’s

say, the 50 questions of the final– hopefully, it’s not going to be

that long if there’s a final– will reflect what your understanding

will be outside. So the distinction is conceptual. And now, let’s put mathematically

what is training versus testing? It will be an extremely simple

distinction, although it’s an important distinction. Here is what testing is, in terms

of a mathematical description. You have seen this before. This is the plain-vanilla Hoeffding. This part is how well you did

on the final exam. This is how well you understand

the material proper. And since you have only one

hypothesis– this is a final, you are fixed, and you just take the exam. Your performance on the exam tracks well

how you understand the material. And therefore, the difference

between them is small. And the probability that it’s not small

is becoming less and less, when the number of questions,

in this case, goes up. So that is what testing is. How about training? Almost identical, except

for one thing– this fellow. Because in the case of training,

this is how you performed on the practice problems. In the practice problems, you had

the answers, and you modified your hypothesis. And you looked at it, and you

got an answer wrong. So you modified your hypothesis again. You are learning better. That’s all very nice. But now the practice set

is contaminated. You pretty much almost

memorized what it is. And there’s a price to pay for that, in

terms of how your performance on the practice, which is E_in in this case,

tracks how well you understand the material, which is still E_out. And the price you pay is

how much you explored. And that was reflected by

the simple M, which was the number of hypotheses in the very simple

derivation we did. So if you want an executive summary of

this lecture, we are just going to try to get M to be replaced by something

more friendly, because you realize that M– if you just measure the

complexity of your hypothesis set by the number of hypotheses– this is next

to useless in almost all cases. Something as simple as the perceptron

has M equals infinity. And therefore, this guarantee

is no guarantee at all. If we can replace M with another

quantity, and justify that, and that quantity is not infinite even if

the hypothesis set is infinite, then we are in business. And we can start talking about the

feasibility of learning in an actual model, and be able to establish

the notion in a way that we can apply to a real situation. That’s the plan. We’re talking about M, so the

first question is to ask, where did this M come from? If we are going to replace it, we need

to understand where it came from, to understand the context

for replacing it. Well, there are bad events that

we have talked about. And the bad events are called

B, because they are bad. That’s good! And then– these are the bad events. What is the bad event that

we were trying to avoid? We were trying to avoid the situation

where your in-sample performance does not track the out-of-sample

performance. If their difference is bigger than

epsilon, this is a bad situation. And we’re trying to say that

the probability of a bad situation is small. That was the starting point. Now we applied the union bound,

and we got the probability of several bad events. This is the bad event for

the first hypothesis. You can see here that there

is m, a small m. m is 1, 2, 3, 4, up to M.

So there are M hypotheses, capital M hypotheses that I’m talking about. And I would like the probability of

any of them happening to be small. Why is that? Because your learning algorithm is free

to pick whichever hypothesis it wants, based on the examples. So if you tell me that the probability

of any of the bad events is small, then whichever hypothesis your algorithm

picks, they will be OK. And I want that guarantee to be there. So let’s try to understand the

probability of the B_1 or B_2 or B_M. What does it look like? Well, if you look at a Venn diagram,

and you place B_1 and B_2 and B_3 as areas here, these areas– these are different events. They could be disjoint, in which case

the circles will be far apart. Or they could be coincident, which

will be on top of each other. They could be independent, which means

that they are proportionately overlapping. There could be many situations. Now the point of the bound is that we

would like to make that statement regardless of the correlations

between the events. And therefore, we use the union bound,

which actually bounds it by the total area of the first one, plus the total

area of the second one, et cetera, as if they were disjoint. Well, that will always hold regardless

of the level of overlap. But you can see that this is a poor

bound because in this case, we are estimating it to be about three times

the area, when it’s actually closer to just the area, because the overlap

is so significant. And therefore, we would like to be able

to take into consideration the overlaps, because with no overlaps,

you just get M terms. And you’re stuck with M, and

infinity, in almost all the interesting hypothesis sets. Now of course, you can go– in principle, you can go and I give you the hypothesis set,

which is the perceptron. And you can try to formalize, what

is this bad event in terms of the perceptron. And what happens when you go to the

other perceptron, and try to get the full joint distribution of all of these

guys, and solve this exactly. Well, you can, in principle–

theoretically. It’s a complete nightmare,

completely undoable. And if we have to do this for every

hypothesis set you propose, there wouldn’t be learning theory around. People will just give up. So what we are going to do, we are

going to try to abstract from the hypothesis set a quantity that is

sufficient to characterize the overlaps, and get us a good bound,

without having to go through the intricate details of analyzing how the

bad events are correlated. That would be the goal. And we will achieve it, through

a very simple argument. So that’s where M comes from. When we asked, can we improve on M? Maybe M is the best we can do. It’s not like we

wish to improve it, so it has to be improved. Maybe that’s the best we can say. If you have an infinite hypothesis, then

you’re stuck, and that’s that. But it turns out that, no, the overlap

situation we talked about is actually very common. Yes, we can improve on M. And the reason

is that the bad events are extremely overlapping in

a practical situation. Let’s take the example we know, which

is the perceptron, to understand what this is. I’m going through the example because

now we have lots of binary things– +1 versus -1 for the target,

+1 versus -1 for the hypothesis, agreeing versus

disagreeing, et cetera. I want to pin down exactly what is

the bad event, in terms of this picture, so that we understand what

we are talking about. Here is the target function

for a perceptron. And it returns +1 for some

guys, -1 for some guys. That’s easy. And then you have a hypothesis,

a perceptron. And this is not the final hypothesis. This is a badly performing hypothesis. But it is a general perceptron. If you find any vector of weights,

you’ll find another blue line. So now in terms of this picture, could

someone tell me what is E_out? What is the out-of-sample error for this

hypothesis, when it’s applied to this target function? It’s not that difficult. It is actually just these areas,

the differential areas. This is where they disagree. One is saying +1. One is saying -1. So these two areas– if you get the total

area if it’s uniform, the total probability if it’s not– then this will give

you the value of E_out. That’s one quantity. How about E_in? For E_in, you need a sample. So first, you generate a sample. Here’s a constellation of points. Some of these points, as you

see, will fall into the bad region, here and here. And I color them red. So the fraction of red compared to

all the sample gives you E_in. That is understood. This is E_in and E_out. And these are the guys that I

want to track each other. OK, fine. I understand this part. And in words. Now you’ll look at: what is the change

E_in and E_out, when you change your hypothesis? So here’s your first hypothesis. Now take another perceptron. You probably already suspect that

this is hugely overlapping. Whatever you’re talking about, it must

be overlapping, because they’re so close to each other. But let’s pin down the specific event

that, we would like to argue, is overlapping. So the change in E_out when you go from,

let’s say, the blue hypothesis, this blue hypothesis, to

the green hypothesis– the change in E_out would be the area

of this yellow thing, not very much. A very thin area. That’s where E_out changed, right? So if you look at the area, that gives you

delta E_out. If you look at the delta E_in, the change of the labels

of data points– if one of the data points happens

to fall in this yellow region, then its error status will change from one

hypothesis to another, because one hypothesis got it right, and

the other one got it wrong. Now the chances of a point

falling here is small. So you can see why we are arguing that

the change delta E_out and the change delta E_in is small. The area is small, and the probability

of a point falling there is small. Moreover, they are actually moving in

the same direction because the change is actually depending on the

area of the yellow part. So this– let’s say that this is increasing. If they increase, they increase both,

because I get a net positive area for the delta E_out. And the probability of falling

there also increases. Now, the reason I’m saying that, is

because what we care about are these. We would like to make the statement

that, how E_in tracks E_out for the first hypothesis, for the blue

perceptron, is comparable to how E_in tracks E_out for the second one. Why are we interested in that? Because we would like to argue that this

exceeding epsilon happens often, when this exceeds epsilon. The events are overlapping. We are not looking for the absolute

value of those, we are just saying that, if this exceeds epsilon,

this also exceeds epsilon most of the time. And therefore, the picture we had

last time is actually true. These guys are overlapping. The bad events are overlapping. And at least we stand a hope that we

will get something better than just counting the number of hypotheses, for

the complexity we are seeking. So we can improve M. That’s good news. We can improve M. We’re going

to replace it with something. What are we going to replace it with? I’m going to introduce to you now the

notion that will replace M. It is not going to be completely

obvious that we can actually replace M with this quantity. That will require a proof. And that will take us

into next lecture. The purpose here is to define the

quantity, and make you understand it well, because this is the quantity that

will end up characterizing the complexity of any model you use. So we want to understand it well. And we are going to motivate that it can

replace M. It will be plausible. It makes sense. It’s not a crazy quantity. It also counts the number

of hypotheses, of sorts. And therefore, let’s define the quantity

and become familiar with it. And then next time, we will like the

quantity so much that we’ll bite the bullet, and go through the proof that we

can actually replace M with this quantity. So what is the quantity? The quantity is based

on the following. When we count the number of hypotheses,

we obviously take into consideration the entire input space. What does that mean? These are four different perceptrons. So I take the input space. And the reason these guys are

different is because they are different on at least one point

in the input space. That’s what makes two

functions different. And because the input space is infinite,

continuous, that’s why we get an infinite number of hypotheses. So let’s say that, instead of counting

the number of hypotheses on the entire input space, I’m going to restrict

my attention only to the sample. So I generate only the input points,

which are finite points, put them on the diagram. So I have this constellation

of points. And when I look at these points alone,

regardless of the entire input space, those perceptrons will classify them. These guys will turn into red and blue,

according to the regions they fall in. Now, in order to fully understand what

it means to count only on the number of points, we have to wipe

out the input space. So that’s what I’m going to do. That’s what you have. So you can imagine the perceptron

is somewhere. And it’s splitting the points. And now what I’m counting is– for this constellation, which is

a fixed constellation of points, how many patterns of red

and blue can I get? Now when you do this, you’re not

counting the hypotheses proper, because the hypotheses are

defined on the input space. You are counting them

on a restricted set. But still, you’re counting. You’re counting the number

of hypotheses. For example, if I give you a hypothesis

set where you get all possible combinations of red and blue,

that’s a powerful hypothesis. If I give you a hypothesis where you get

only few, that’s not so powerful a hypothesis. So the count here also corresponds in

our mind to the strength, or the power, of the hypothesis set, which in our mind

is what we try to capture by the crude M. So we are going to

count the number of hypotheses. I’m putting them between quotations. Why? Because now the hypotheses are defined

only on a subset of the points. So I’m going to give them a different

name, when I define them only on a subset of the points, in order not to

confuse the hypotheses, on the general input space, with this case. I’m going to call them dichotomies. And the idea is that

I give you N points. And there is a dichotomy between what

goes into red, and what goes into blue. That’s where the name came from. So when you look only at the points, and

you look at this, which ones are blue and which ones are

red, are a dichotomy. And if you want to understand

it, let’s look at this. Let’s say that you’re looking

at the full input space. And this is your perceptron. And this is the function

it’s implementing. And then you put a constellation

of points. The way to understand dichotomies is to

think that I have an opaque sheet of paper, that has holes in it. And you put that opaque sheet of paper

on top of your input space. So you don’t see the input space. You only see it through the

eyes of those points. So what do you see when you put this? You end up with this here. You don’t see anything. You don’t see where the hypothesis is. You just see that these guys

turned blue, and these guys turned red or pink. Now as you vary the perceptron, as you

vary the line here, you are not going to notice it here, until the line

crosses one of the points. So I could be running around here, here,

here, and here, and generating an infinite number of hypotheses, for

which I’m charging a huge M. And this guy is sitting here, looking. Nothing happened. It’s the same thing. I’m counting it as 1. And then when you cross, you end

up with another pattern. So all of a sudden, these

guys are blue. And these guys are red. That’s when, let’s say, this guy

is horizontal here rather than vertical here. So you can always think that we reduced

the situation to where we’re going to look at the problem exactly

as it is, except through this sheet that has only N holes. Let’s put, in mathematical terms, the

dichotomies which are the mini hypotheses, the hypotheses restricted

to the data points. A hypothesis formally is a function. And the function takes the full

input space X, and produces -1, +1. That’s the blue and red

region that we saw. A dichotomy, on the other hand,

is also a hypothesis. We can even give it the same name,

because it’s returning the same values for the points it’s allowed

to return values on. But the domain of it is not

the full input space, but very specifically, x_1 up to x_N. These are– each one of these points belongs to

X, to the input space. But now I’m restricting

my function here. And again, the result is -1,

+1, exactly as it was here. That’s what a dichotomy is. Now if I ask you how many hypotheses

there are, let’s say for the perceptron case? Very easy. It can be infinite. In the case of the perceptron,

it’s infinite. Why? Because this guy is seriously

infinite. So the number of functions is

just infinite, by a margin! So that’s fine. Now if you ask yourself, what is

the number of dichotomies? Let’s look at the notation first,

and then answer the question. The dichotomy is a function

h applied to one of those. So when I talk about it, the value, I

would say h of x_1 or h of x_2, one value at a time. If I decide to use the fancy notation,

I say I’m going to apply small h to the entire vector, x_1, x_2, up to x_N. I would be meaning that you tell me the

values of h of x on each of them. So you return a vector of

the values, h of x_1, h of x_2, up to h of x_N. That’s not an unusual notation. Now if you apply the entire set of

hypotheses H to that, what you are doing is that you are applying

each member here, which is h, to the entire vector. Each time you apply one of those

guys, you get -1, -1, +1, +1, -1, +1, -1, et cetera. So you get a full dichotomy. And then you apply another h, and you

get another dichotomy, and so on. However, as you vary h, which has

an infinite number of guys, many of these guys will return exactly the same

dichotomy, because the dichotomies are very restricted. I have these N points only. And I’m returning +1 or -1

on them only. So how many different ones

can I possibly get? At most, 2 to the N. If H is

extremely expressive, it will get you all 2 to the N. If not, it will get

you smaller than 2 to the N. So I can start with the most infinite

type of hypothesis. And if I translate it into dichotomies,

I have an upper bound of 2 to the N for the number

of dichotomies I have. So this thing now becomes a candidate

for replacing the number of hypotheses. Instead of the number of hypotheses,

we’re talking about the number of dichotomies. Now we define the actual quantity. Capital M is red. And I keep it red throughout. And we are going now to define small

m, which I will also keep as red. That will hopefully, and provably

as we will see next time, replace M. It’s called the growth function. What is the idea of

the growth function? The growth function counts

the most dichotomies you can get, using your hypothesis

set on any N points. So here is the game. I give you a budget N.

That’s my decision. You choose where to place

the points, x_1 up to x_N. Your choice is based on your attempt

to find as many dichotomies as possible, on the N points, using

the hypothesis set. So it would silly, for example, to take

the points and put them, let’s say, on a line, because now you are

restricted in separating them. But you can see the most I can

get if I put them in this general constellation. And then you count the number of

dichotomies you are going to get. And what you’re going to report to me is

the value of the growth function on the N that I passed on to you. So I give you N, you go through this

exercise, and you return a number that is the growth function. Let’s put it formally. The growth function is going to be

called m, in red as I promised. And it is the maximum. Maximum with respect to what? With respect to any choice of

N points from the input space. That is your part. I gave you the N. So I

told you what N is. And then you chose x_1 up to x_N with

a view to maximizing something. What are you maximizing? Well, we had this funny notation. H applied to this entire

vector is actually the set of dichotomies, the vectors, -1, +1,

-1, +1, and then the next guy and the next guy–

the actual vectors here. When you put this cardinality on top

of them, you’re just counting them. You’re asking yourself: how

many dichotomies do I get? So you’re maximizing, with respect to the

choice of x_1 up to x_N, this thing. That will give you the most expressive

facet of the hypothesis set on N points, that number. I tell you 10. And you come back with the number 500. It means that by your choice of the x_1 up

to x_10, you managed to generate 500 different guys, according to the

hypothesis set that I gave you. Now because of this, you can see now

that there is an added notation here. It used to be m, but it actually

depends on the hypothesis set, right? It’s the growth function for

your hypothesis set. So I’m making that dependency explicit,

by putting a subscript H. Furthermore, this is

a full-fledged function. M was a number. I give you a hypothesis set. It’s an number. Well, it happens to be infinite,

but it’s a number. Here, I’m giving you

a full function. That is, I tell you N, you tell me

what the growth function is. So it’s a little bit more complicated. And because it is this way, m_H is

actually a function of N. That’s the growth function. So that is the notion. Now what can we say about

the growth function? Well, if the number of dichotomies is

at most 2 to the N, because that’s as many +1, -1, N-tuples you can

produce, then the maximum of them is also bounded by the same thing, at most

2 to the N. Well, if we are going to replace M with m, I would say

2 to the N is an improvement over infinity. If we can afford to do it. Maybe it’s not a great improvement,

nonetheless improvement. Now, let’s apply the definition to

the case of perceptrons, in order to give it flesh, so we understand

what the notion is. It’s not just an abstract quantity. We take the perceptrons, and we would

like to get the growth function of the perceptrons. Well, getting the growth function of

the perceptron is quite a task. If I tell you what is M

for the perceptron? Infinity. And then you go home. What is the growth function

of the perceptron? You have to tell me what is the growth

function at N equals 1, what is at N equals 2, at N equals

3, at N equals 4. It’s a whole function. So we say, 1 and 2 is easy. Let’s start with N equals 3. So I’m choosing 3 points. And I chose them wisely, so that I can

maximize the number of dichotomies. And now I’m asking myself, what is the

value of the growth function for the perceptron for the value

N equals 3? Well, it’s not that difficult. You can see, I can actually get

everything there is to get. Why? Because I can have my line here, or I

can have my line here, or I can have my line here. That’s 3 possibilities times 2 because

I can make it +1 versus two -1’s, or -1 versus two +1’s. We are counting 6 so far. And then I can have my hypothesis

sitting here. That will make them all +1. Or I can have it sitting here, which

makes them all -1. That’s 8. That’s all of them. The perceptron hypothesis is as strong

as you can get, if you only restrict your attention to 3 points. So the answer would be what? Is it already 8? Wait a minute. Someone else chose the points co-linear,

and then found out that if you want these guys to go to the -1

class, and this guy to go to the +1 class, there is no perceptron

that is capable of doing this. Correct? You cannot pass a line that will make

these two guys go to +1, and this guy go to -1, if these are co-linear. Does this bother us? No. Because we are taking the maximum. So this, the quantity you computed

here, since you got to the 8– you cannot go above 8. That defines it. And indeed, you can with authority

answer the question that the growth function for this case,

m at N equals 3, is 8. Now let’s see if we are still in

luck when we go to N equals 4. What is the growth function

for 4 points? We’ll choose the point in

general position again. We are not going to have any

co-linearity, in order to maximize our chances. But then we are stuck with

the following problem. Even if you choose the points in

general position, there is this constellation– there is this particular pattern on the

constellation, which is -1, -1, and +1, +1. Can you generate this

using a perceptron? No. And the opposite of it,

you cannot either. If this was -1, -1, and

this one, +1, +1. Can you find any other 4 points, where

you can generate everything? No. I can play around, and there is always

2 missing guys, or even worse. If I choose the points unwisely,

I will be missing more of them. So the maximum you are getting is that

you are missing 2 out of all the possibilities. And the growth function here is 14, not

16, as it might have been if you had the maximum. Now this is a very satisfactory

result, because perceptrons are pretty limited models. We use them because they are

simple, and there’s a nice algorithm that goes with them. So we have to expect that the quantity

we are measuring the sophistication of the perceptrons with, which is the

growth function, had better not be the maximum possible. Because if it’s the maximum possible,

then we are declaring: perceptrons are as strong as can be. Now they break. And they are limited. And if I pick another model, which,

let’s say– just for the extreme case– the set of all hypotheses. What would be the growth function

for the set of all hypotheses? It would be 2 to the N, because

I can generate anything. So now, according to this measure that

I just introduced, the set of all hypotheses is stronger

than the perceptrons. Satisfactory result, simple

but satisfactory. Now what I’m going to do– I’m going to take some examples, in

which we can compute the growth function completely for all values of

N. You can see that if I continued with this and say, let’s

go with the perceptron. 5 points. You put the 5 points,

and then you try. Am I missing this? Or maybe if I change the position

of the points. It’s just a nightmare, just to get 5. And basically, if you just do it by

brute force, it’s not going to happen. So I’m taking examples where we can

actually, by a simple counting argument, get the value of the growth

function for the entire domain, N from 1 up to infinity, in

order to get a better feel for the growth function. That’s the purpose of this portion. Our first model, I’m going

to call positive rays. Let’s look at what positive

rays look like. They are defined on the real line. So the input space is

R, the real numbers. And they are very simple. From a point on, which we are going to

call ‘a’– this is the parameter that decides one hypothesis versus

the other in this particular hypothesis set. All the points that are

bigger go to +1. All the points that are

smaller go to -1. And it’s called positive ray, because

here is the ray– very simple hypothesis set. Now in order to define the growth

function, I need a bunch of points. So I’m going to generate some points. I’m going to call them x_1 up to x_N. And I am going to choose them

as general as possible. I guess there is very little generality

when you’re talking about a line. Just make sure that they don’t

fall on each other. If they fall on each other, you cannot

really dichotomize them at all. If you put them separately,

you’ll be OK. So you have these N points. Now when you apply your hypothesis,

the particular hypothesis that is drawn on the slide, to these points,

you are going to get this pattern. True? And you’re asking yourself, how many

different patterns I can get on these N points by varying my hypothesis,

which means that I’m varying the value of ‘a’? That is the parameter that gives me

one hypothesis versus the other. Formally, the hypothesis set is a set

from the real numbers to -1, +1. And I can actually find

an analytic formula here. If you want an analytic formula,

you remember the sign? This is, I think– If you apply it, that’s exactly

what I described. Now we ask ourselves, what

is the growth function? Here is a very simple argument. If you have N points, the value of the

dichotomy– which ones go to blue and which ones go to red– depends on

which segment between the points ‘a’ will fall in. If ‘a’ falls here, you get this pattern. If ‘a’ falls here, this guy will be red. And the rest of the guys will be blue. So I get a different dichotomy. I get different dichotomies when

I choose a different line segment. How many line segments are

there to choose from? I have N points. I have N minus 1 sandwiched ones, and

one here when all of them are red, and one here when all of them are blue. Right? So I have N plus 1 choices. And that’s exactly the number of

dichotomies I’m going to get on N points, regardless of what N is. So I found that the growth function,

for this thing, is exactly N plus 1. Let’s take a more sophisticated model,

and see if we get a bigger growth function. Because that’s

the whole idea, right? The next guy is positive intervals. What are these? They’re like the other guys, except

they’re a little bit more elaborate. Instead of having a ray,

you have an interval. Again, you’re talking

about the real line. And you are going to define

an interval from here to here. And anything that lies within

here, will map to +1 and will become blue. And anything outside, whether it’s right

or left, will go to -1. That’s obviously more powerful than the

previous one, because you can think of the positive ray as having

an infinite interval. That’s fine. So you put the points. We have done this before. And they get classified this way. And I’m asking myself, how many

different dichotomies I can get now by choosing really 2 parameters, the

beginning of the interval and the end of the interval. These are my 2 parameters, that will tell

me one hypothesis versus the other. How many different patterns can I get? Again, the function is very simple. It’s defined on the real numbers.

And now the counting argument, which is an interesting one. The way you get a different dichotomy

is by choosing 2 different line segments, to put the ends

of the interval in. If I start the interval here and

end it here, I get something. If I start the interval here and end

it here, I get something else. If I start the interval here and

end here, I get something else. And that is exactly one-to-one mapping

between the dichotomies and the choice of 2 segments. So if this is the case, then I can

very simply say that the growth function, in this case, is the number

of ways to pick 2 segments out the N plus 1 segments. And that would be N plus 1 choose 2. There is only 1 missing. When you count, there are 2 rules– make sure that you count everything,

and make sure that you don’t count anything twice. Very simple. So we counted almost everything. But the missing guy here is what? Let’s say that all of them are blue. Is this counted already? Yes, because I can choose this

segment and this segment. And that is already counted in this. But if they’re all red,

what does that mean? It means that the beginning of the

interval, and the end of the interval, happen to be within the same segment. So they didn’t capture any point. And that, I didn’t count. And it doesn’t matter which segment

they’re in, because I will get just the all reds. So it’s one dichotomy. So all I need to do is just add 1. And that’s the number. Do a little algebra, and you get this. That is the growth function

for this hypothesis set. And now I’m happy, because

I see it’s quadratic. It’s more powerful than the previous

guy, which was linear. Now let’s up the ante, and

go to the third one. Convex sets. This time, I’m taking the

plane, rather than the line. So it’s R squared. And my hypotheses are simply

the convex regions. If you look at the values of x at

which the hypothesis is +1, this has to be a convex region,

any convex region. A convex region is a region where,

if you pick any 2 points within the region, the entirety of the line segment

connecting them lies within the region. That’s the definition. So this is my artwork

for a convex region. You take any 2 points and– So this is an example of that. The blue is the +1. And the red is the -1. That’s the entire space. So this is a valid hypothesis. Now you can see that there is

an enormous variety of convex sets that qualify as hypotheses. But there are some which

don’t qualify. For example, this one is not convex,

because of this fellow. Here’s the line segment, and

it went out of the region. So that’s not convex. We understand what the

hypothesis set is. Now we come to the task. What is the growth function

for this hypothesis set? In other to answer this, what you

need is– you put your points. I give you N, and you place them. So here is a cloud of points. I give you N, and you say, it seems

like putting them in general position is a good idea. So let’s put them in

a general position. And let’s try to see how many patterns

I can get out of these, using convex regions. Man, this is going to be tough

because I can see– Let’s see. First, I cannot get all

of them, right? Because let’s say I take the outermost

points, and map them all to +1. This will force all the internal points

to be +1, because I’m using a convex region. Therefore, I cannot get +1’s for

the out guys, and any -1 whatsoever inside. So that excludes a lot of dichotomies. Now I have to do real

counting. But wait a minute. The criterion for choosing the cloud of

points was not to make them look good and general, but to maximize

your growth function. Is there another choice for the

points that gives me more hypotheses than these? As a matter of fact, is there another

choice, for where I put the points, that will give me all possible dichotomies

using convex regions? If you succeed in that, then you

don’t care about this cloud. The other one will count, because

you are taking the maximum. Here is the way to do it. Take a circle, and put your points

on the perimeter of that circle. Now I maintain that you can get any

dichotomy you want on these points. What is the argument? Well, pick your favorite one. I have a bunch of blues

and a bunch of reds. Can I realize this using

a convex region? Yes. I just connect these guys. And the interior of this

goes to +1. And whatever is outside

goes to -1. And I am assured it’s convex, because

the points are on the perimeter of a circle. That means what? That means that the growth

function is 2 to the N, notwithstanding the other guy. You realize now a weakness in

defining the growth function as the maximum, because in a real learning

situation, the chances are the points you’re going to get are not going to

end up on a perimeter of a circle. They are likely to be

all over the place. And some of them will be interior

points, in which case you’re not going to get all possibilities. But we don’t want to keep studying the

particular probability distribution, and the particular data

set you get, and so on. We would like to have

a simple quantity. And therefore, we’re taking the maximum

overall, which will have a simple combinatorial property. The price we pay is that, the chances are

the bound we are going to get is not going to be as tight as possible. But that’s a normal price. If you want a general result that

applies to all situations, it’s not going to be all that tight

in any given situation. That is the normal tradeoff. But here, the growth function is

indeed 2 to the N. Just as a term, when you get all

possible hypotheses, all possible dichotomies, you say that the hypothesis

set shattered the points– broke them in every possible way. So we can say, can we shatter

this set, et cetera? That’s what it means. You get all possible combinations

on them. Just as a term. Now let’s look at the 3 growth functions

on one slide, in order to be able to compare. We started with the positive rays, and

we got a linear growth function. And then we went on to the

positive intervals. And we had a quadratic function. And that is good, because we are getting

more sophisticated and the growth function is getting bigger. And then we went to convex

sets, which are– It’s powerful and two-dimensional

and all, but not that powerful. Convex sets are still– It’s really, although we got a bigger

one, it’s inordinately bigger. Maybe we should have gotten N cubed. But that’s what we have. At least it goes this way. So sometimes that thing

will be too much. But in general, you can see the trend

that, with more sophisticated, you get a bigger growth function. Now let’s go back to the big picture,

to see where that growth function will fit. Remember this inequality? Oh, yes. We have seen it. We have seen it often. We are tired of it! What we are trying to do is replace M.

And we decided to replace it with the growth function m. M can be infinity. m is a finite number, at most

2 to the N, so that’s good. What happens if we replace

M with small m? Let’s say that we can do that, which

we’ll establish in the next lecture. What will happen? If your growth function happens to

be polynomial, you are in great shape. Why is that? Because if you look at this quantity,

this is a negative exponential. epsilon can be very, very small. epsilon squared can be really,

really, really small. But this remains a negative exponential

in N. And for any choice of epsilon you wish, this will kill the

heck out of any polynomial you put here, eventually. Right? I can put a 1000th-order polynomial,

and can have epsilon equal 10 to the minus 6. And if you’re patient enough, or if your

customer has enough data, which would be an enormous amount of data, you

will eventually get this to win. And you will get the probability to be

diminishingly small, which means that you can generalize. That’s a very attractive observation,

because now all you need to do is just declare that this is

polynomial, and you’re in business. We saw that it’s not that easy

to evaluate this explicitly. But maybe, there is a trick that will

make us able to declare that it is polynomial. And once you declare that a hypothesis

set has a polynomial growth function, we can declare that learning is feasible

using that hypothesis, period. We may become finicky and ask ourselves,

how many examples do you need for what, et cetera? But at least, we know we can do it. If you’re given enough examples, you

will be able to generalize from a finite set, albeit big, to the general

space with a probability assurance. So that’s pretty good. I’m happy that this is the case. So maybe we can, as I mentioned, just

prove that m_H is polynomial, the growth function is polynomial. Can we do that? Maybe we can. Maybe we cannot. Here’s the key notion that

will enable us to do that. We are going to define what

is called the break point. You give me a hypothesis set, and

I tell you it has a break point. Perceptrons, 4. Another set, the break point is 7. Just one number. That’s much better than giving

me a full growth function for every N. Just one number. So what is the break point? The definition is the following. It’s the point at which you fail

to get all possible dichotomies. So you can see that, if the break point

is 3, this is not a very fancy hypothesis set. I can’t even generate all 8

possibilities on 3 points. If the break point is 100, well, that’s

a pretty respectable guy, because I can generate everything up to 99 points,

all 2 to the 99 of them. And then I start failing at 100. So you can see that the break point

also has a correspondence to the complexity of the hypothesis set. If no data set of size k can

be shattered by H– that is, if there are no choice of

k points in which you are able to generate all possible dichotomies. Then you call k a break point for H. So let’s look at– what is the– So that’s what it means. You can’t shatter, so less than

2 to the k, which are all the possibilities for k data points. So for the 2D perceptron, can you

think of what is the break point? We did it already. We didn’t explicitly say

it in those terms. But this is the hint. For 3, we did everything. For 4, we knew we cannot

do everything. So it doesn’t matter whether

it’s 14 or 15 or 12 or 5. As long as it breaks, it breaks. It’s not 16. And therefore, in this case,

the break point is 4. That number 4 will characterize the

perceptrons. Just to tell me, I have a hypothesis set. And it is defined– I don’t want to know the input space. Wait a minute. OK, I’m not going to tell

you the input space. I’m going to tell you the hypotheses. The hypotheses are produced by the– I don’t want to hear it. Just tell me the break point, and I will

tell you the learning behavior. Also, if you have a break point, every bigger point is also

a break point. That is, if you cannot get all

possibilities on 10 points, then you certainly cannot get

all of them on 11. If you could get them on 11,

just kill one. And you will have gotten them on 10. Let’s look at the 3 examples, and

find what are the break points. Positive rays had this guy. This is a formula. We can plug in for N. And we

ask ourselves, when do I get to the point where I no longer get 2 to the N,

numerically for a particular value. What is the break point here? N equals 1. I get 1 plus 1. That’s 2. That also happens to be 2 to the 1. 2: N plus 1 is 3. Oh, that’s less than 4. So 2 must be a break point. This is since we invested in computing the

function, we are just lazy now and just substituting. But you could go for the original thing,

and say that’s obvious. Because this particular combination

of points– if I want the rightmost point to be

red, and the left one to be blue, there is no way for the positive

ray to generate that. And therefore, that 2 is a break point. There’s something where I fail. Let’s go for this one. We need faster calculators now. 1, 1/2, et cetera. Wow. It’s exactly. When I put 1, it gives me 2. It must be the correct formula. Let’s write 2. At 4, I get 2. And– it calculates. What is the break point? It must be bigger than the other guy,

because it’s more elaborate. And you realize it’s 3. If you put 2 points,

you will get the 4. And if you put 3, you’ll get

7, which is short of 8. Again, that’s not a mystery. That’s what you cannot get

using the interval. You cannot get the middle point to be

red while the other ones are blue. So you cannot get all possibilities

on 3 points. Therefore, 3 is a break point. What is the break point

for the convex sets? Tell me how many points where I can fail. Well, I’m never going to fail. So if you like, you can

say this is infinity. Let’s define it this way. So also, the break point–

just a single number– has the property we want. It gets more sophisticated as the

model gets more sophisticated. So what is the main result? The main result is that the

first part will be– if you don’t have a break point,

I have news for you. The growth function is

2 to the N. OK, yes. That’s the definition. Thank you. So that cannot possibly

be the main result. So what is the main result? The main result is that if you

have a break point, any break point, 1, 5, 7000. Just tell me that there

is a break point. You don’t even have to tell

me what is the break point. We are going to make a statement

about the growth function. The growth function is– do I hear a drum roll? [MAKES DRUM SOUND] It’s guaranteed to be polynomial in N. Wow, we have come a long way. I used to ask you what

are the hypotheses, and count them. That was hopeless because

it’s infinity. We defined the growth function,

and we have to evaluate it. That was painful. Then we found the break point. Maybe it’s easier to compute

the break point. I just want to find a clever way,

and say that I cannot get it. Now all I need to hear from you

is that there is a break point. And I’m in business as far as the

generalization is concerned, because I know that regardless of what polynomial

you get, you will be able to learn eventually. I will become more particular, and ask you

what is the break point, when I try to find the budget of examples you need

in order to get a particular performance. But in principle, if I just want to say

you can use this hypothesis set, and you can learn, I just want you

to tell me I have a break point. That’s all I want. This is a remarkable result. And I have to give you a puzzle

to appreciate it. The idea of the puzzle

is the following. If I just tell you that there’s

a break point, the constraint on the number of dichotomies you get, because there is a break point, is enormous. If I tell you a break point is,

let’s say, 3, how many can you get on 100 points? On those 100 points, for any choice of

3 guys, you cannot have all possible combinations– at any 3 points,

all 100 choose 3 of them. So the combinatorial restriction

is enormous. And you will end up losing possible

dichotomies in droves, because of that restriction. And therefore, the thing that used to

be 2 to the N, if it’s unrestricted, will collapse to polynomial. Let’s take a puzzle, and try to

compute this in a particular case. Here is the puzzle. We have only 3 points. And for this hypothesis set, I’m telling

you that the break point is 2. So you cannot get all possible four

dichotomies on any 2 points. If you put x_1 and x_2, you cannot get

-1 -1, -1 +1, +1 -1, and +1 +1. All of them. You cannot get it. One of them has to be missing. So I’m asking you, given that this is

the constraint, how many dichotomies can you get on 3 points? You can see, this is what I’m trying to

do because I’m telling you that the restriction on 2 will– If I didn’t have the restriction,

I would be putting eight. So I’m just telling you this case. So how many do I get? For visual clarity, I’m going to

express them as either black or white circles, just for you to be able to– instead of writing -1 or +1. This dichotomy is fine. It doesn’t violate anything. I’ve only one possibility. So we keep adding. Everything is fine. As a matter of fact, everything will

remain fine until we get to four, because the whole idea is that I cannot

get all four on any of them. So if I have less than four, I cannot

possibly get four combinations. You see what the point is. This is still allowed. I’m going through it as a binary one. So this is 0 0 0, 0 0 1, et cetera. I’m still OK, right? Am I still OK? [MAKES BUZZER SOUND] You have violated the constraint. You cannot put the last row, because

it now violates the constraint. I have to take it out. So let’s take it out. Try the next guy. Maybe we are in luck. Are we OK? OK. That’s promising. So let’s go for the next guy. Maybe we’ll get it. Are we OK? [MAKES BUZZER SOUND] Tough. So we have to take out the last row. How about this one? Nope. We take it out. We don’t have too many

options left, right? Actually, this is the last guy. It had better work. Does it work? No. So that’s what we can do. We lost half of them. Now you may think, maybe you

messed it up because you started very regularly. Just started from all 0, 0 0 1. But if I started differently,

I may be able to achieve more. It’s conceivable. Please don’t lose sleep over it. The only row you are going to be able

to add to this table is this one. This is indeed the solution. And you can verify it at home. Now we know that indeed the

break point is a very good restriction. And we are going, in the next lecture,

to prove that it actually leads to a polynomial growth, which is

the main result we want. Let me stop here. And we will take the questions

after a short break. Let’s start with the questions. MODERATOR: The first question is,

what if the target or the hypotheses are not binary? PROFESSOR: There is a counterpart for the entire theory that

I’m going to develop, for real-valued functions and other

types of functions. The development of the theory is

technical enough, that I’m going to develop it only for the binary case,

because it is manageable. And it carries all of the

concepts that you need. The other case is more technical. And I don’t find the value of going to

that level of technicality useful, in terms of adding insight. What I’m going to do is, I’m going

to apply a different approach to real-valued functions, which is

the bias-variance tradeoff. And it’s a completely different approach

from this one, that will give us another angle on generalization

that is particularly suitable for real-valued functions. But the short answer is that, if the

function is not binary, there is a counterpart to what I’m

saying that will work. But it is significantly more technical

than the one I am developing. MODERATOR: Just as a sanity check. When the hypothesis set can

shatter the points, this is a bad thing, right? PROFESSOR: OK. There is a tradeoff that will stay

with us for the entire course. It’s bad and good. If you shatter the points, it’s good

for fitting the data, because I know that if you give me the data, regardless

of what your data is, I’m going to be able to fit them because, I

have something that can generate a hypothesis for any particular

set of combinations. So if your question is,

can I fit the data? Then shattering is good. When you go to generalization,

shattering is bad, because basically you can get anything. So it doesn’t mean anything

that you fit the data. And therefore, you have less hope

of generalization, which will be formalized through the

theoretical results. And the correct answer is, what is the

good balance between the two extremes? And then we’ll find a value for which

we are not exactly shattering the points, but we are not very restricted,

in which we are getting some approximation, and we’re getting some generalization. And that will come up. MODERATOR: Is there a similar

trick to the one you used for convex sets in higher dimensions? PROFESSOR: So if you– The principle I explained,

I explained it in terms of two-dimensional and perceptrons. If you look at the essence

of it, the space is X. It could be anything. The only restrictions I have

are binary functions. So this could be a high-dimensional space. And the surfaces will be very

sophisticated surfaces. And all I’m reading off, as far as this

lecture is concerned, is how many patterns do I get on a number

of N points. MODERATOR: Also a question

on the complexity. Why is usually polynomial time

considered as acceptable? PROFESSOR: OK. Polynomial, in this case, is polynomial

growth in the number of points N. It just so happens that we are working

with the Hoeffding inequality that gives us a very helpful term, which

is the negative exponential. And therefore, if you get a polynomial,

as I mentioned, any polynomial, you are guaranteed that for

a large enough N, the probability– the right-hand side of the Hoeffding,

including the growth function, will be small. And therefore, the probability of

something bad happening is small. Now obviously, there are other functions

that also will be killed by the negative exponential. For example, if I had a growth function

of the form, let’s say, e to the square root of N, that’s

not a polynomial. But that will also be killed by the

negative exponential, because it’s square root versus the other one. It just so happens that we are in the

very fortunate situation that the growth function is either identically

2 to the N, or else it’s polynomial. There is nothing in between. If you draw something that is

super polynomial and sub exponential and try to find the hypothesis set

for which this is a growth function, you will fail. So I’m getting it for free. I’m just taking the simplicity

of the polynomial, because lucky for me, the polynomials are the

ones that come out. And they happen to serve the purpose. MODERATOR: OK. A few people are asking, could you

repeat the constraints of the puzzle? Because they didn’t get the– PROFESSOR: OK. Let’s look at the puzzle. I am putting 3 bits on every row. I’m trying to get as many different rows

as possible, under the constraint that if you focus on any 2 of them– so

if I focus on x_1 and x_2 and go down the columns, it must be that one

of the possible patterns for x_1 and x_2 is missing. Because I’m saying that 2 is

a break point, so I cannot shatter any 2 points. Therefore, I cannot shatter x_1 and x_2,

among others, meaning that I cannot get all possible patterns. There are only four possible patterns,

which is, if you take it as a binary 0 0, 0 1, 1 0, 1 1. And I’m representing them using

the circles. In this case, the x_1 and x_2

get 0 0, so to speak. If I keep adding a pattern– So let’s look at here. x_1 and x_2, how many patterns do they have? They have this pattern. They have it again. That doesn’t count. So there’s only one pattern

here, plus one, is two. So on x_1 and x_2, I have

only two patterns. So I haven’t violated anything, because

I will be only violating if I get all four patterns. So I’m OK, and similarly

for the other guys. Things become interesting when you

start getting the fourth row. Now again, if you look at the first 2

points, I get one pattern here and one pattern here. There are only two patterns. Nothing is violated as far these

2 points are concerned. But the constraint has to be satisfied

for any choice of 2 points. So if you particularly choose x_2 and x_3,

and count the number of patterns, you realize, 0 0,

0 1, 1 0, 1 1. I am in trouble. That’s why we put it in red. Because now these guys have

all possible patterns. And I know, by the assumption of the

problem, that I cannot get all four patterns on any 2 points. So I cannot get this. So I’m unable to add this row

under those constraints. And therefore, I’m taking it away. And I’m going through the exercise. And every time I put a row, I keep

an eye on all possible combinations. So here, I put– let’s look at x_1 and x_2. 1 pattern, 2, 3. I’m OK. x_2 and x_3, 1 pattern, which

is here and here. 2, 3. I’m also OK. And then you put x_1, x_3. Here is a pattern. It repeats here. 0 0 and 0 0. So that’s one. And then I get this one

and this one, 3. So this one is perfect,

everyone. Not perfect in any sense, except

that I didn’t violate anything. So I’m allowed to put that row. Now when I extend this further, and start

putting the new guys, for this guy, there is a violation. And you can scan your eyes and

try to find the violation. And I’m highlighting it in red. So I’m showing you that for x_1 and

x_3, there are the four patterns. Here’s one pattern, the second one. I didn’t count this one, just because

it’s already happened. So I just highlight four different ones,

and then the third one and fourth one. So I cannot possibly add this row,

because it violates the constraint on these 2 points. So I take it out and keep adding. Another attempt, this is the next guy. It still violates. Why does it violate? For the same argument. Look at the red guys. You find all possible patterns. So I cannot have it. So we take it away. And then the last one that

is remaining is this guy. And that also doesn’t work, because

it violates it for those guys. You can look at it and verify. And the conclusion here is that

I cannot add anything. So that’s what I’m stuck with. And therefore, the number of different

rows I can get under the constraint that 2 is a break point– in this case, is 4. Obviously, the remark I mentioned is

that maybe you can start instead of gradually from 0 0 0, 0 0 1,

maybe you can start more cleverly or something. But however, anyway you try it, it’s

sufficiently symmetric in the bits that it doesn’t make a difference. You will be stuck with at most 4. MODERATOR: OK. In the slide with the Hoeffding

inequality, does anything change when you change– specifically, does a probability measure

change when you change from a hypothesis to dichotomy? PROFESSOR: For this one? MODERATOR: Yeah. PROFESSOR: Yeah. The idea here, M is the number of hypotheses, period. So it’s infinity for perceptrons. We have to live with that. In our attempt to replace it with the

growth function, we are going to replace it by something that is not

infinite, bounded above by 2 to the N. As you can see, 2 to the N is not

really helpful because I have a positive exponential and

negative exponential. And that’s not very decisive. Therefore I am trying to find if I can

put a growth function– not only put the growth function here, but also

show that the growth function is polynomial for the models

of interest that I have, and therefore be able to get

this to be a small quantity for a real learning model, like the perceptron

or the other one, neural networks, et cetera. All of these will have a polynomial

growth function, as we will see. So that’s where the number of

hypotheses, which is M, goes to the number of dichotomies, which

is the growth function. Not a direct substitution,

as we will see. There are some technicalities

involved. But that is what gets me the right-hand

side to be a manageable right hand side, and goes to 0 as N grows,

which tells me that the probability of generalization will be high. MODERATOR: OK. Is there a systematic way

to find the break points? PROFESSOR: There is. It’s not one size fits all. The are arguments, for example, you

can go for neural networks. And sometimes you find it by finding

a particular combination that you cannot break, and argue that this

is the break point. Sometimes you can argue

by– Let me try to find a crude estimate

for the growth function. Let’s say the growth function

cannot be more than this. And then as you go by, you realize

that this is not exponential. So there has to be a break point

at some point. This would be less than 2 to the N, and

therefore will be a break point. So in that case, the estimate for the

break point will be just an estimate. It will not be an exact value. But it will a maximum. We have a question in house. STUDENT: Hi. So in this slide, the top end is the

number of testing points and the lower end is the number of training points. PROFESSOR: Yeah. N is always the size of the sample. And it’s a question of interpretation

between the two, whether that sample is used for testing, which means that you

have already frozen your hypothesis, and you are just verifying,

testing it. Or in the other case, you haven’t

frozen your hypothesis. And you are using the same sample

to go around and find one. And you are charged for the going

around aspect by M. STUDENT: So let’s say that our

customer gives us k sample points. How do we decide how many of them do

we reserve for testing points, how many for training? PROFESSOR: This is

a very good point. There will be a lecture down the

road called validation, in which this is going to be addressed

very specifically. There are rules of thumb. There are some mathematical results, but there is a rule of thumb. There are few rules of thumb that I’m

going to say without proof, that stood the test of time. And one of the rules of thumb has to

apply to, how many do we reserve, in order to first not to diminish the

training set very much, and still have a big enough test set so that

the estimate is reliable? So this will come up. Thank you. There is another question. STUDENT: Hi, professor. I have one question. So for 2 hypotheses that have the same

dichotomy, is it true that the in-sample error is the same

for the 2 hypotheses? PROFESSOR: OK. If it has the same dichotomy, it’s even

a stronger condition than this, because it returns exactly

the same values. Now the in-sample error is

the fraction of errors I got right and wrong. The target function is fixed. So that is not going to change. So obviously, I’m going to get

the same pattern of errors. And if I get the same pattern of errors,

then obviously I’m getting the same fraction of errors,

among other things. Now if you’re asking, for these

2 hypotheses, what is the out-of-sample error? That’s a different story, because for the

out-of-sample error, you take the hypothesis in its entirety. So in spite of the fact that it’s the

same on the set of points, it may be not the same on the entire input space,

which it isn’t because they’re different hypotheses. And therefore, you get

a different E_out. But the answer is yes. You will get the same in-sample error. STUDENT: Oh, yes. I see. That’s why I was asking. Because I think that the out-of-sample

error is different for 2 hypotheses. So can we replace the M with– PROFESSOR: Exactly. And the biggest technicality

in the proof– We were saying, we’re going to

replace M by the growth function. That’s a very helpful thing. Now, there has to be a proof. And I will argue for the proof, and

the overlapping aspects, and some of this. The key point is, what do

I do about this fellow? Because when I consider the sample, this

one is very much under control. As you said, if I have 2 hypotheses

that are the same here, they are the same here. But they are not the same here. So the statement here depends on

E_out, depends on the whole input space. So how am I going to

get away with that? That’s really the main technical

contribution for the proof. And that will come up next time. STUDENT: Sure, thank you. PROFESSOR: Sure. MODERATOR: So– why is it called a growth function? PROFESSOR: A growth function. I really– The person who introduced this

called it a growth function. I guess he called it a growth function,

because it grows, as you increase N. I don’t think there is any

particular merit for the name. MODERATOR: Is there– what is a real-life situation similar

to the one in the puzzle, where you realize that this break point

may be too small? PROFESSOR: OK. The first order of business is to

get the break point out of the way– that there is a break point, we are in business. Second one is, how does the value

of the break point relate to the learning situation? Do I need more examples when

I have a bigger break point? The answer is yes. What is the estimate? And there’s a theoretical

estimate, a bound. Maybe the bound is too loose. So we’ll have to find practical rules

of thumb that translate the break point to a number of examples. All of this is coming up. So the existence of the break point

means learning is feasible. The value of the break point tells us

the resources needed to achieve a certain performance. And that will be addressed. MODERATOR: Is there a probabilistic

statement for the Hoeffding inequality that is an alternative to the

case-by-case discussion on M’s growth rate in N? PROFESSOR: There are

alternatives to Hoeffding. So there are alternatives

to Hoeffding, and you can get different results

or emphasize different things. I am sticking to Hoeffding. And I’m not indulging too much into its

derivation, or the alternatives, because this is a mathematical

tool that I’m borrowing. And I’m taking it for granted. And I picked the one that will help

us the most, which is this one. So yes, there are variations. But I am deliberately not getting

into them, in order not to dilute the message. I want people to become so incredibly

familiar and bored with this one, then they know it cold. Because when we get to modify it,

including the growth function and the other technical points, I’d like the

base point to be completely clear in people’s mind, so that they don’t get

lost with the modifications. So that’s why I’m sticking to this. MODERATOR: I think that’s it. PROFESSOR: Very good. We’ll see you next time.

快乐分享的天才

I have not seen a better professor than proff Yaser ! explains concepts with such ease and with so much depth ! simply awesome 🙂 thanks to Caltech for this course !

This is one of the best lectures! But I thought the puzzle was rushed up, didn't really follow it.

Excellent lecture. Howe the puzzle relates to the growth function and break point concept is not clear at all.

Hi! I think I got the ideas so I will try to explain them to you as I understood them =) The whole point in this lecture as I understood was to give as small probability bound for the Hoeffding inequality in the training part at 8:55, that is we wanted to replace the big capital M with a smaller number, because the M could have been infinitely big, because the whole input space is infinite. The Hoeffding gave us a guarantee that we can generalize, that is get a small out-of-sample error.

Note that M was the total number of hypothesises. Mostafa argued next that in practise many of the hypothesises are extremely overlapping, for example Hypoth.1 would give almost exactly the same performance as Hypoth.2, so we don't have to consider both of them. This is where he defined the concept of 'dichotomies' at 21:55, that is: "Hypothesises that give significantly different results, that is hypothesises which are not very overlapping in performance".

He next argued that the number of all possible dichotomies can be AT MOST 2^n. The growth function was defined as the maximum number of dichotomies with given inputs (x1,x2, …, xN), note that this is not the same as the number of all possible dichotomies = 2^n. This might be a bit unclear so I will use Mostafa's example in 36:15 and 52:50. Notice that with the positive rays when N = 2, that is we have inputs (x1, x2) the number of all possible dichotomies is 2^2 = 4 correct?.

The cases are: (red,red), (blue, blue), (red,blue), (blue, red) so we have 4 of them =) But with positive rays we cannot get all the possible dichotomies. We can get the N+1 of them which is 2+1 = 3. We can get the cases (red,red), (blue, blue), (red, blue) but WE CANNOT get the case (blue, red) with positive rays. I suggest you look at the 52:50 at this point few times =). The 'break point' k is the minimum number inputs (x1, x2, …, xk) when you CANNOT get all the possible dichotomies.

At 52:50 if you look at the case of positive rays, when the number of inputs is 1, then the growth function and the number of all possible dichotomies are equal g(1) = 1 +1 = 2^1 = 2. Therefore 1 is NOT a break point because you CAN get all possible dichotomies, the cases are (red), (blue), you can do this with positive rays. The break point is 2 with positive rays because you can't get all possible dichotomies, that is you can't get the case (blue, red), g(2) = 2 + 1 = 3 != 2^2 = 4.

Sorry by the way about my notation, by g(N) I meant the growth function which Mostafa labeled as small m. Another way of defining the break point is the smallest number for which the growth function does not equal the number of all possible dichotomies, that is g(k) != 2^k. Lastly he concluded that if we have a break point with the set of hypothesises, the growth function is POLYNOMIAL IN N, which is good news, because now we can bound the Hoeffding inequality to be a small probability.

This means that we can learn, that is generalize well =) He also says that he will proof this result on later lectures. The puzzle is a about, "how does the break point restrict the number of all possible inputs". Here you should watch 56:35. If your break point is 3, but you have 100 inputs (x1,x2, …, x100) the break point constraint will diminish most of the all 2^100 possible dichotomies. In the puzzle 57.25, he tells us that the break point constraint is 2, but we have 3 points.

Now freeze the image on 58:57 and watch it =) The constraint was 2 points correct? This means we CANNOT get all 2^2 = 4 possible different dichotomies. Look at the columns x2, x3. There is (0,0), (0,1), (1,0), (1,1) so there are four of them! But this wasn't allowed! So we must discard the (1,1) case. Next look at 59:58. There are cases (0,0), (0,0), (0,1), (1,0), (1,1). Again we have four DIFFERENT dichotomies there so the constraint is violated. Lastly look 59:31 columns x1, x3.

There are (0, 0), (0, 1), (0, 0), (1, 0), (1, 1). Again the constraint is violated so we must discard the dichotomy. I suggest you look the example again when Mostafa explains it. The key things is (as I understood), the break point will discard MOST of ALL POSSIBLE DICHOTOMIES, which will make the growth function polynomial in N and therefore we get a good bound for the Hoeffding and now we can generalize that is learn =) I hope I helped you…hope I didn't confuse even more x)

Hi, if you look at the page again I tried to explain the puzzle for user montintinmontintin in my comments (the last ones). Hope it helps =)

Sorry I made a mistake in one of my comments: It was: ""The puzzle is a about, "how does the break point restrict the number of all possible inputs"". IT SHOULD BE "The puzzle is a about, "how does the break point restrict the number of all possible DICHOTOMIES".

Thanks for your explanations. helpful.

No probs =) Glad if it helped

Wouldn't another solution for puzzle be:

0 0 0

0 0 1

0 1 1

1 1 1

This was a great lecture 🙂 A nontrivial piece of the theory presented clearly and concisely, kudos for that 🙂

Yes

How do we decide on N or the number of points to be taken for calculating dichotomies?

The growth function for the perceptron is N*(N-1)+2

In slide 7: why at most 2 to the power N?!!

Thank you Caltech and Prof Yaser Abu Mostafa in particular. Its good to be here..

Really great lecture.

wow; so refreshing; i just feel intelligent after carefully listening lectures 1 to 5;

Till Now I have not seen the machine learning lectures better than these.

In fact he is very good at choosing the words

Finally, a tie and a shirt that match the suit !

I am not English native speaking and can say that lecture is hard but very usefull.

It's called a "growth function" because the maximum number of dichotomies (in the case of 2 classes only or as we call it the binary classification) is 2 to the Nth (where N is the number of the input points), and 2 to the Nth is an exponential function and it is known that exponential functions grows very quickly (with each small change in x, a vast change in y happens), so this is from where the term "growth function" came.

I did not understand the Breakpoints concept. It is confusing

He's giving solid answers in the Q&A-session. Nice job!

I finally understood concepts that were "taught" in my course.

Prof Abu-Mustafa – you're a star!

I understood the growth function upper limit (2^N)… but I have some doubt about how to define the growth function (the number of max possible dichotomies) for perceptron when N = 4. Why it is not 2^4, If I must get the growth function for all possible configuration (the max). Is there some link where I can read some think about?

I have a question, In the positive rays why break point is 2? If you consider positive rays we can get 3 hypothesis.

you are honor for Egypt

related o positive rays ;wont that growth function b equal to 2(N+1)

The question asked about the binary hypothesis used is an interesting one…

I don't understand the puzzle at the end even though it was explained twice. Ok so for x1, x2, x3 if break point is 2, it means that we will not be able to get all dichotomies for the given x1, x2, x3 right? So why does a point like x1(white), x2(black), x3(black fail? For this point why can't we get all dichotomies? or why does a point like x1(white), x2(white), x3(white) doesn't fail?

I think I have some trouble understanding the concept of the puzzle itself, so would be great if someone could help me.

what are the prerequisites of this course im starting to think im ill equipped to understand what he is saying.

i have a prolem with definition of break point

finally got it… thanks professor.

He is really awesome, I totally get the concept, is there any way to get more example that we can do it by myself for more understanding, understanding the concept something and apply it something else

27:23 . How do know for certain the number of hypothesis is 2^N ? I think its becoz we have 2 output (hence the 2) & we have N inputs , hence we have raised it to the power of N ? Thats the number of outputs combinations we could possible have.

Can someone recommend a source to quickly go through all the probability concepts used in this course?

I didn't understand the puzzle at all. Anyone who can help me get what's going on? Please!!!

Considering the two dimensional input space, if the number of data set is 2 (i.e. 2 data points), then doesn't the perceptron behave like a positive light ray? In that way the break-point for 2D perceptron will be 2 rather than 4. But this is obviously wrong as for 3 data points, all 8 possible dichotomies exist thereby breaking the "if there is a breakpoint for a smaller data set, the breakpoint is still there for the bigger data set". What am I missing here?

OMG … You are awesome… Such a great lecture… Thank you Caltech and Prof. Yaser Abu-Mostafa..

Could someone explain what does merging P(X) and P(y|X) mean ?

What does P(x,y) try to capture?

what do you mean by negative exponential of e "completely killing off" the hoeffding inequality's RHS?

This Prof is the best in ML and Data Science, I have watched all the touted lectures from fancy schools, none matches this

Wow! This is the first time that I see tough concepts like VC dimension exposed to undergraduates in such a comprehensible way. Thanks so much Prof. Abu-Mustafa! BTW, to understand the puzzle at the end of class, I strongly suggest you watch twice. Then you will really appreciate what Prof. Abu-Mustafa is trying to explain. His key idea was to show how existence of a finite break point brings down the growth function (which could have been exponential 2^N) to the polynomial order.

Dear Prof.Abu-Mostafa, thank you. Thank you for such a clear and lucid explanation of these abstractions, so trumped up in other texts. Bless you, sir, for finally enabling me to really understand what I had simply given up hope on ever understanding.

"It will kill the heck out of any polynomial…"

Quips such as these often make me grin. Combined with how neatly I understand what you're saying, viewing these videos is simply a pleasurable experience.

By far the best machine learning course I've ever seen, kudos to Prof. Yaser Abu-Mostafa.

Tell me if I am right so far:

1. One dichotomy maps one or more hypothesis; or equivalently, several hypothesis are mapped in a dichotomy. This is, a configuration of size N has several equivalent solutions. Thus, for the maximum size of the set of dichotomies m, we can says that m<=M

2. In the binary case, the number of dichotomies for a particular sample of size N and particular configuration in the sample space of dimension D, is twice the number of different ways that you can dissect the sample space with a hyperplane of dimension D-1, producing unique pair of sets of labeled elements (either +1 or -1)

3. If xi is the coordinate of the i-th element of the sample in the space X of dimension D, then h(xi) is either +1 or -1 according to the sample configuration and hypothesis tested. Therefore, h(x), where x is a vector of xi elements and length N, is an output vector of +-1. Thus, the maximum number of possible different h(x) are 2^N. So, for the perceptor (a root node of a classification tree), the growth function is m=2^N

4.

"We are in business!!" Love this expression!

It clearly shows from his teaching that Prof. Abu Mostafa is in love with the subject. The way he teaches it, having fun in the process lets the student have fun too and in turn fall in love with the subject !

57:25 We have only 3 (actually 2) points.

in case you need more comprehensive lecture slides with extra examples and supplementary materials, you can have a look at the corresponding ML foundation course by Prof. Hsuan-Tien Lin ( co-author of the book learning from data by abu-mostafa) from NTU college. Link: https://www.csie.ntu.edu.tw/~htlin/mooc/

it has got the same slides but with more information.

Best lectures in the Milky Way Galaxy…! Love Him.

In case anyone wonders – why there are a lot of different hoeffdings inequalities – I think it depends on the bound of Xi.

Theorem 1 of Hoeffding (and Wassily) paper states that for 0 < Xi < 1 – then E[Xn – E[x] > eps] <= exp{-2nt^2}. This is multiplied by 2 if we are using absolute value. Theorem 2 of Hoeffding Wassily, which is for bounded random variables, E[Xn – E[x] > eps] <= exp{(-2nt^2)/((b-a)^2)}.

All of Yasser's ideas also follow the more general case, without the limitation that your Xi's will be bounded between 0 and 1.

You can check the original paper here: https://www.csee.umbc.edu/~lomonaco/f08/643/hwk643/Hoeffding.pdf

17:05 that's where E_out changed

If you have difficulties understanding the puzzle, keep in mind that there is a PARTICULAR constraint in that puzzle (which is breakpoint = 2), if there's no constraint then breakpoint = 4 as in the previous example.

1:13:00 "for 2 hypotheses that have the same dichotomy, the in-sample error is the same"

Is it true in the case of MSE (Mean Squared Error)? – I think not

It is true here because the selected metric for in-sample error was the "fraction of incorrect prediction"

Am I understanding this correctly?

Thank you

You’re the best! You explain concepts in a clear way! Thanks, Dr. Mustafa