Measurement of Genetic Exposure

Measurement of Genetic Exposure

December 4, 2019 2 By Kody Olson


[Dr. Teri Manolio]
So, what we thought we’d
talk about next is kind of basic measurement approaches,
some of the technology and the ways that these things are
measured and reported, just to kind of get some
of the lingo together. We’re going to talk about
measuring genetic variation with a variety of different
measures listed here and in your handout, then a little
bit about linkage disequilibrium and why it’s important
in these measurements, and then shift a little bit to
talk about familial resemblance and family history. And I’m a big Gary Larson
fan, and this is, “Hey, what are you
looking at, buddy? You want trouble,
you found it.” And it’s — “Understanding
only German, Fritz was unaware the clouds
were becoming threatening,” as you can see. So, Tom has just thrown a fair
amount of terminology at you, and I’ll throw a little bit
more, and really a lot of the difference between,
or the differences in challenges in communication
between epidemiologists and geneticists are merely
because of a little bit of language difficulties. So when we first started trying
to measure genetic variation there weren’t very good
measures of it. There were certain things that
were known to be genetic, and among them were
blood group markers, because they clearly clustered
in families and were inherited in families, or enzymes and
that, and one of the very first linkage studies —
linkage is looking for coinheritance in families of
a trait and a genetic marker was this one from the fellow
that I actually trained with when I did my PhD, Alec Wilson,
looking at relationships between the catechol-O-
methyltransferase gene, COMT, which is a gene
related to adrenergic signaling and that — and 25 —
only 25 polymorphic marker systems, and they describe here
that they measured the COMT activity in five
large families. These were very large families
from Ohio — 518 individuals. And then they tested
associations with 25 genetic markers, including the
ABO, the Rh blood group, and then a variety of others,
and there were only 25 across the entire genome,
and found a LOD score, which at the time was thought
to be quite respectable, 1.27, and this is — LOD stands for
log of the odd score — we won’t go into — Tom’s gonna
do linkage a little bit later. But anyway, 1.27 with only
25 markers was actually pretty respectable. And a close estimated
recombination fraction, meaning that the marker and the
presumed trait locus were close together for this particular
enzyme here. So this actually worked,
which was exciting. Moving from that relatively
rapidly into the — about the 1980s or so were
restriction fragment length polymorphisms, and we were just
talking about that in response to the question, in terms of
bacterial endonucleases that actually chop a DNA sequence
at a certain point. So, they’ll sort of find a
string of DNA, you know, CCGAT, and wherever they see
a CCGAT, they chop the DNA, and that’s probably the way
bacteria insert things into their own and other bacteria’s
genomes that allows them to evolve. But for whatever reason they’re
there, and they do define polymorphic marker loci that can
be detected as differences in the length of DNA after
you digest the DNA with these endonucleases. So, depending on where it chops,
you may get a longer or a shorter piece
of DNA. And you can use that then to
establish linkage relationships and pedigrees. And this is an example of this
from one of the first papers to describe it. Assume you had — here’s
a string of DNA, actually two different strings
of DNA, and you have your two — this does not want to stay on —
your two places where endonuclease B can chop right
here, and here’s endonuclease A, and it may chop here and here
in this particular person, but that CCGTA may be here
and over here in this particular person. And so then when you go
to run these in a gel, you chop them up, you can label
them, and you see that this particular person, or this —
if these were both the chromosomes of one person,
they would have two different fragments here, suggesting that
they have a polymorphism there, whereas they have the same
site for endonuclease B, and you’d only see
one fragment. And this was the basis for RFLP
base measurements of genetic variability, a very laborious,
very challenging process by which you had to find all of
these endonucleases and then actually chop up the
DNA with them. And what was foreseen then,
and certainly came to pass, was that since they’re being
used simply as genetic markers, any trait that segregates in a
pedigree — segregates means that it’s inherited
in different ways, in the ways that Tom showed —
either dominant inheritance or recessive or whatever —
and that such a procedure would not require any knowledge
at all of the biochemical nature of the trait, or of the nature
of the alterations in the DNA responsible. All you’re doing is putting
little signposts in the DNA and then trying to find them,
you know, based on the size of the fragment that you’re
able to detect. And this was used very
successfully in a variety of traits. It was used to the identify the
neurofiber mitosis gene — Barker and colleagues in 1987
looked at 15 Utah kindreds and showed the gene responsible
for neurofibromatosis was located near the centromere,
near the middle part of the — of chromosome 17, the part that
attaches to the synaptic apparatus, and allows the
chromosomes to separate during cell division. And this is an example of this,
it’s kind of a nice one, but you’d need many, many, many,
many of these in order to come up with
a LOD score. This is a family where mom has
two chromosomes that are — have the same polymorphism —
dad has one of each and then each of the kids — all of
the kids are affected, and so is dad — and basically
you can see, here’s one band here at the 2.4, and a second
band in everybody but the person with the 11 variants,
who was not affected. So this is just demonstrating
co-segregation of the disease with the A2, the 1.9 kilobase
allele, and not with the A1 allele in each of four
affected offspring. And as I say, you need to do
this in many, many people in order to be confident with
it — confident in it. So those were RFLPs,
they were cumbersome, difficult to work with,
and there weren’t very many of them across
the genome. Tom mentioned variable numbers
of tandem repeats; mini satellites and micro
satellites — I haven’t been able to find anyone who can
explain why these are called satellites, but regardless,
their repetition in tandem of a short, maybe six to 100
base pair motif that spans about half a kb
to several kbs. And this really opened the
way to DNA fingerprinting. This is still used in forensic
sciences to identify — it was used — actually the
Genome Institute and NCBI were involved in identifying the
remains from the 9/11 disasters, and in other — in Katrina
and in other things. And these are still used
in forensic databases — provided the first highly
polymorphic multiallelic markers for linkage studies,
and were associated with many interesting features of human
genome biology and evolution. There are a lot of these
across the genome. They’re sort of curiosities
at this point, but one that’s quite well known,
I think, to cardiovascular epidemiologists is the 5 kb
kringle 4 repeat. Kringle is actually a name given
to a particular region of a protein that kind
of goes in loops, and it looks like a Danish
pastry that’s called a kringle; you may see them around
here in the morning. But at any rate, in the
apolipoprotein, AG protein, and in plasminogen,
so these are common in cardiovascular epidemiology. And this is just an
example of that. Here’s the gene for APOA, CDNA;
this is the complementary DNA. What you do is you find it,
and RNA — it’s very easy to pick up an RNA with that
tail of the polyA tail that Tom mentioned. If you have a column of
basically Ts that the A binds to, you just run your
mix of things that might include a messenger RNA
down that column, and the As will stick to the Ts,
and you can pull them out. This is probably how the
cell does it, too, just in a much more
elegant way. But then you can make a
complement to that RNA, which is much more stable,
and that’s called the CDNA, and it’s just a way of
looking at structure. So here’s a kringle 4 repeat,
and there could be anywhere from one to 37 of
these in humans. There’s also a kringle five
region that tends not to be repeated. And what’s shown very nicely
here is just kind of a run in order — gels from a variety
of people that have either a 12-repeat — here’s the 12 here,
and they also have in the other allele 24 — and there it is
there, or 13 and the 25, I think, 14 and on up, and you
can just see them kind of laddering up here, which I
thought was a nifty picture, so that’s what that
looks like. And this just shows that the
molecular weight of this, which is related to the number
of kringle repeats, as you can see here — here’s
the number of kringle repeats; it’s a ratio, actually to
that single kv5 repeat, and the molecular weight goes
up, and as molecular weight goes up, the lipoprotein
little a levels go down, LP little a has been associated
with coronary disease; it’s not entirely clear how that
association works or why, but at any rate, this was
a nice example of it. Also called, sometimes,
variable number of tandem repeats are microsatellites,
which are much shorter. There are two to six
base paired motifs, most of them actually are di-,
tri- or tetranucleotides, so 2, 3, or 4 repeated anywhere
from 20 to 50 times. And these are highly polymorphic
in a population. They were extremely useful for
mapping and linkage studies in families, and you may
be familiar with the Marshfield clinic, produced
the Marshfield map. There were similar maps,
the deCODE map and a number of others. They placed about 400 of these
microsatellites across the genome, and provided the primers
so that you could, you know, test these in your own studies,
and these could be highly automated. So the National Heart, Lung,
and Blood Institute and the Centers for Inherited
Disease Research at NIH both funded very, very
large linkage studies — not only in humans, but there
was a dog map and a couple of other animal
maps as well. And this was used in great
abundance up until about probably five years
ago or so. In fact, SIDR retired its
microsatellite pattern just last year, and there was,
you know, sort of sighs of relief or sighs of sadness,
depending on how you look at it. So these were used for linkage
studies, and they produced things that look — graphs that
looked a lot like this. This is from my former
colleague Dan Levy at the Framingham study,
where basically one had one of these markers maybe
every 10 megabases or so, if you had 400 of them
across the genome, so every 10 million bases. And you really didn’t need them
any more frequently than that, because studying families,
particularly smaller families that are closely related,
you don’t get any additional information in this
interval, because families share such large pieces of their
chromosomes, essentially. So once you’ve put in 400
markers, you really don’t get much more independent
information from 800 markers or 1200
or 1600. Across the genome,
when you wanted to look at a specific region,
then you well might, particularly in
unrelated people. But anyway, this is what
microsatellites did for us. But unfortunately, a lot of
these really didn’t turn out to come up with much
in the way of genes, and some new tricks
were needed. “High above the hushed crowd,
Rex tried to remain focused; he couldn’t shake one nagging
thought: he was an old dog, and this was a new trick,”
so it was time for some new tricks in
this field. And the new trick,
as Tom mentioned, was — were single nucleotide
polymorphisms. These had been identified and
sort of discovered along the way that other polymorphisms
had been identified. And they were thought not
to be terribly useful, because the dogma had been
you needed something that was highly polymorphic
in a population, meaning that most people in this
room would have two copies, and those two copies might —
two different copies, and those two different copies
would be likely to be different from the person sitting next to
them, and different from the person sitting next to them,
so that there was lots of variability in that. Well, with the SNP, most
SNPs are bi-allelic, so there are only
two possibilities: it’s only an A or a T,
as you can see, here C or an A,
or C or a T here. Most of the rest of the genome,
99.9 percent of it is the same, but in just these couple of
spots you have a little bit of a difference, a single
base pair spelling change in your DNA. And how could that possibly,
you know, tell you much of anything unless you measured
thousands of them, people said, or maybe tens of thousands
of hundreds of thousands, and the technology was not
available at the time these were first identified
to be able to do that. Of course the technology has
caught up, and actually far surpassed our abilities
to understand it. But now we have the technology
to be able to measure these and analyze them. What was needed was some way
of mapping the relationships among these, so the linkage maps
that Marshfield and deCODE and others put together they
were able to put together because they had large families
that they could follow, they could genotype,
and look at segregation of their markers throughout
those families. With these markers, families
really wouldn’t help you, because they were — so often
they would be shared among family members. You really needed to look
across unrelated people. So, just to give you an idea
of what this looks like, here’s sort of a
generic chromosome, and here’s, like, a segment
of it that contains a gene. And your generic gene has these
red things that are exons, and then there’s maybe
some SNPs in the exons, and there may also be some SNPs
in the introns in between, or in their promoter or
untranslated regions on either end. Usually, there are more SNPs,
as we mentioned, in those regions than
there are in the exons, because they tend not to be as
well tolerated through natural selection in
the exons. And then there are these sort
of patterns of association among these. And these triangles tend
to throw people. I know when I first
saw them I was like, “What in the world
are these things?” You see them in a lot in
diagrams; here are some stretches of DNA,
here are the genes, and then you see
these triangles, and they’re labeled with various
numbers and that sort of thing. And really, we’ve all been
looking at these for a very long time, we just
didn’t realize it, these are essentially
correlation matrices, and if you’ve ever gotten,
you know, maps or tables from the AAA, you ask, you know,
sort of, “How far is it from Boston to Providence?” “It’s 59 miles. From Boston to
New York it’s 210, Boston to Philadelphia,
et cetera.” Well if you were, say,
to instead of putting these numbers in maybe you
color-code them so that the cities that were close
together were dark red, and the cities that were far
apart were bright white, you could color-code them like
this, turn them on their side, make them into squares, and
there’s your linkage diagram. So all of this is —
sorry, your LD diagram. So all this is is a relationship
among various SNPs. And when you see these,
you know, don’t let them throw you. It’s really just Boston to
Providence when it’s nice and dark red
like that. So what that meant then is that
one tag SNP can serve as the proxy for many, many SNPs,
and so you have these stretches of — here are,
you know, two chromosomes in one person and two in another
and two in another. And you can see that these
white places are where everybody is the same,
and then there are some polymorphisms here. And for instance, here’s this
SNP 3, which is actually, you know, very closely
related to SNP 4. Every place that you have a G in
SNP 3, you have an A in SNP 4. Every place you have
a C in SNP 3, you have a
G in SNP 4. And likewise, or in contrast,
in SNP 5, sometimes when you have an A in SNP 4,
you’ve got a G in SNP 5. Sometimes when you have an A —
I’m sorry. This — these are perfectly
well correlated as well. So these are a block,
and so SNP 2 and SNP 1 — and so these form
a linkage block. And so this is a little
hard to see — yeah, so here you have an A,
and there’s a G here. Sometimes you have a G
and there’s a G here. So knowing SNP 4 doesn’t tell
you a lot about SNP 5. But looking at SNP 5 and SNP 6,
they actually are very closely correlated, as is
SNP 6 and SNP 7. And they form
another block. So these are just linkage blocks
of SNPs that travel together, and could be measured together. And then you may have —
sometimes you have one that’s just kind of
out there by itself. And so, taking away the
intervening sequence that doesn’t contribute a whole
lot of information, you can just pick
one of these SNPs, and you’d get all of the
information that was in between, so you
just pick one. I pick the one with
the prettiest color, but you could pick whichever
one you want. And, similarly, you could
just pick one here, and you’d still get all of
this information intervening, and you can kind of stick those
together and the sequence of those, what are called tag SNPs
because they tag that whole area, is also known
as a haplotype. And maybe you have 35 percent
of your population has this particular haplotype and
30 percent has that and 10 percent has this one,
et cetera. And then you can —
you basically identify different sort of types
within a population, and then use those in terms of
association of relationships to various traits. So there are a number of ways
of sort of estimating the correlation between SNPs. The two most common are
D prime and R squared. Lewontin’s D is
shown here. It’s just the probability of the
two, say, ancestral alleles traveling together, versus the
— minus the probability of the two variant alleles
traveling together. In order for the variant
allele to get — sorry, the variant allele
and the ancestral allele traveling together. So in order for the variants and
the ancestral to get hooked up together, you have to have a
recombination event there. And the more — the further
apart, in general, that SNPs are, the more
likely there is to be a recombination event. So if this doesn’t happen
very often, D is very big. There’s a D prime, and I confess
I’ve forgotten what the max D is, but it’s just a way of
correcting D prime for — by a constant. But one of the problems with
this measure is that it tends to overestimate linkage
disequilibrium, particularly for rare alleles, because you’re
looking at the probability of a crossover event measured
across populations. If the alleles are very rare,
the probability is going to be low that there’s a crossover
event just because the alleles are rare. Whereas a correlation —
just a simple correlation coefficient, R squared,
is actually a much better, more reliable measure,
and there’s a better discussion of this in
Devlin and Risch. So D prime varies one —
zero to one, zero is they’re completely
in equilibrium, one they’re in complete
disequilibrium, and when D prime is zero,
typing one SNP gives no information at all
about the other SNP. But as I mentioned,
it doesn’t account for allele frequencies,
and R squared is the preferred measure. So when R squared is 1.0,
two SNPs are really — are in perfect LD, so every
time you see a SNP, you know, SNP A in one of them,
you see SNP in B in — SNP G in
the other. And the allele frequencies
are identical for both SNPs, and typing one SNP provides
complete information on the other, so that’s what —
when you have an LD of 1.0. You might have an LD of,
you know, .98, and perhaps that’s because the
allele frequencies are not quite the same, but, you know,
for the most part, they travel together. So, what can
LD do for us? It’s actually very,
very useful. It can mess you up as well
as really being helpful, and the — in design, it’s used
to estimate the theoretical power to detect associations,
because if you knew the two SNPs were correlated at a —
with an R Squared of 1.0, you’d know that your power would
be the same measuring SNP A as measuring SNP B. If, on the other hand,
your R squared is only .5, your power is going to be
much less to detect than association with SNP A if
you’re measuring SNP B, because they’re not
well correlated, so you’re adding some
noise, essentially. And it does help you then
to evaluate the degree of completeness of your sampling,
and the choice of the most informative genetic
variants to genotype. And just note that sample size
increases by about one over R squared to achieve the same
power to detect an association with your SNP that is not quite
as tightly correlated as the one that you really
want to measure, what you hope would be
the disease-causing SNP. So I realize that went
by a little fast; any questions on
that LD concept? Okay, all right. So what you’ll often see, then,
in genome-wide association studies is basically a plot
across the, you know — one of the nice things about DNA
is that it’s a linear molecule, so you can just sort of line
up all the SNPs as they occur on the DNA, and what’s shown
here is for a group of British cases and controls with
coronary artery disease, and then German families
with coronary disease. You see the association
statistics here, and what they’re generally
plotted on the Y axis is the minus log of the p-value,
just because it makes it easy to sort of relate to them,
so 10 — a P of 10 to the minus second, or .01,
would be a 2 down here. A P of 10 to the minus 10th
would be a 10 up here, and these, as you can see,
are very strong associations, so 10 to the minus 16th,
14th, 16th, et cetera. And then you’ll see this
linkage block here, and remember this is just —
here’s a block, you know, Boston to Providence these
things are very close together, they travel together,
so that if you were to be looking at, say, these two SNPs,
they travel together, they’re not going to
give you much — too much independent
information. Up here, for example, these now
are — seem as though they may be in slightly different blocks,
and certainly these are in different blocks
from those. So if you were trying to pick
things, SNPs that you would then type in a follow-up study,
you might way to type those that are in different
LD blocks. And one of the other things —
you know, one of the neat things about genetics is that
it is constantly changing, and things that were, you know,
held to be God’s, you know, solid truth last year
are no longer. One of the things that was
widely known and widely taught was that recombination happens
at random across the genome, and there’s no rhyme
or reason to it; it’s a totally
random event. That is clearly not the case —
what happens here where you can see there’s been a recombination
event here, but this block tends to be pretty much intact,
as does that block. This is just shown in
this family study, and shown here in the HapMap
where there were many more SNPs typed, and many more
people examined. But what’s become very clear
is that there are hot spots of recombination, and so
recombination is not a random event. It actually happens in
particular regions much more often than in other
regions, and that really threw off people when they were
sort of trying to map genes and figure out where they
were located based on linkage information. This is another kind of similar
example of the kind of statistics that you get out
of these kinds of studies. So, again, plotted the minus
log 10 of the p-value. And in this particular region,
there are three genes: there’s the interleukin 12
receptor 2, the interleukin 23 receptor,
and then sort of a hypothetical protein. When they say hypothetical
protein what they mean is that there’s a region of the
genome that’s called an open reading frame, which could
be coding for protein. It basically doesn’t have
a stop codon for a while. And so, that’s a good thing,
and probably it goes for protein. Yes, sir. [Male Speaker]
[Inaudible] [Dr. Teri Manolio]
Okay, so this was —
the SNP study done in this particular study here,
where they only typed a relatively small number
of SNPs in this region, so you notice that the
blocks are bigger. The same region was typed much
more densely in the HapMap. So there are — you know,
there’s like three million in the HapMap across the genome,
and here there were only probably 300,000 or so. And there are more people,
so you can still see sort of the same blocks. They’re not lined up very well,
and that was a mistake of the editors. But you still see sort of the
same blocks there, okay? Great, okay. So — yeah, so here you
have these three genes, this hypothetical protein
and then these other two, and here’s your association
signal, and you’re thinking, “Well, gee, it kind of looks
like it’s in that gene.” But if you then look
at the LD patterns, you can see that there
are your genes, that they’re actually two blocks
of linkage disequilibrium. They’re not real great, I mean,
they’re not real solid, but they’re certainly there,
and pretty obvious that, you know, it’s probably not this
gene that’s associated with this signal, nor this
hypothetical protein, but it’s probably something in
here, in these two LD blocks. So it can be very helpful for
sort of narrowing down an association region. And these are used — they’re
plotted in different ways. Sometimes you’ll see people plot
D prime against R squared, back in the earlier days,
you know, like way back in 2006, when people sort
of were used to the D prime measure, which is shown
here in blue, and weren’t as used to
the R squared measuring and didn’t like it because it
didn’t make as pretty pictures as the blue one. You sometimes see them
plotted together. This is TCF7L2, it’s the
strongest genome-wide association signal found for
type 2 diabetes to date. And this is the — sort of the
gene is shown, but this it the direction of transcription,
and then how the various SNPs are associated. And this is a similar sort of
plot of linkage disequilibrium now in the three populations
studied in the HapMap, and I’ll talk about the HapMap,
I think, a little bit more later. But what they did was to look
at the Yorba people from Ibadan, Nigeria, which is a
population of African ancestry that’s — African ancestry
populations are very old if you follow the Out-of-Africa
hypothesis, which is no longer a hypothesis, it’s really, you
know, pretty well established. The — most human variation was
in Africa and remained there, and a small piece of it then
left and went into Europe, Asia, and colonized
the Americas. So the African populations,
recent African populations, tend to have less linkage
disequilibrium because they’re an older population;
there’s been more time for it to break up than
younger populations — the CEU is the CEPH population
that’s a European ancestry group, and this is the
Han Chinese and Japanese from Tokyo, an Asian population,
and they also have had less time for their
LD to break up. And so you can see these
triangles are a little bit denser in these two populations
than they are in the Yorba. And you see that over and
over again in populations of recent African ancestry. And we’ll show you in a bit
how useful that can be. So what was desired, then,
was to produce a HapMap, to do more efficient association
studies in unrelated people. We wanted to use just the
density of SNPs that you needed to find association
between SNPs and disease. So you don’t want to type
any more than you have to, but you don’t want to miss
any regions that have a disease association. And the goal was really to
produce a tool to assist in finding genes affecting
health and disease. Recognizing, as I just
mentioned, that ancestral populations differ in
their degree of LD, recent African ancestry
populations have shorter stretches of linkage
disequilibrium, so you need more SNPs for a
complete genome coverage in that group. SNPs were really a gateway then
to genome-wide association studies, and Tom has mentioned
those, and we’ll be talking about them a lot. In fact, a lot of the
perspective that you’re getting from Tom and me comes
from the fact that genome-wide association is sort
of all the rage, and it’s all the rage because
it’s working where, you know, many of the previous methods
of interrogating the genome didn’t work, in terms of
identifying genetic variants, likely because, particularly
for complex diseases, you were dealing with genes
of very small effect, whereas linkage studies worked
great for Mendelian diseases where the genes are
a very large effect. So SNPs are much more
numerous than others, they’re much — other kinds
of markers that I mentioned; they’re much easier
to assay. Genome-wide studies attempt to
capture the majority of the genomic variation, which is
10 million common SNPs: SNPs that are present in about
5 percent or greater of the population. And this variation is
inherited in groups, as I mentioned, so you
don’t have to test all 10 million points. And the blocks are shorter,
as I mentioned, so you need to test more points the less
closely people are related. And now we can do studies
with hundreds of thousands of markers. And this was then the impetus
for developing the HapMap. This was published in
“Nature” in 2005, but the data actually were made
available almost as they were produced, as soon as they were
QC’d they were made available through the HapMap Web site,
and basically were used for many, many genomic discoveries,
including the TCF7L2 example that I showed you. The more expansive and expanded
HapMap was published in 2007, last year, of over
3.1 million SNPs. These, again, are the common
SNPs that were identified and put into
linkage patterns. At the same time, and perhaps
stimulated by the HapMap, genomic technology
improved dramatically. So this is a slide I borrowed
from my colleague Stephen Channick
at the NCI. Back in 2001, we thought we
were driving a really hard bargain if we could get a single
SNP genotype for about a dollar. So here’s, you know, 10 to
the second cost per genotype in cents, in American cents. So back in 2001, with the TaqMan
assay, which was sort of the gold standard at the time,
a dollar a genotype was really good, and we were
getting — at the NIH, you know, people wanted
three and four dollars, because they weren’t using
efficient platforms. It was one of the reasons that
we produced some of the large-scale genotyping services
that we did, because they could be done much more
efficiently. And over time, these
costs came down. These are the various platforms
and the various producers. And you’ll notice also that
the numbers of SNPs that were genotyped went up,
and in fact the flexibility of the platforms went
down a little bit, too, because you basically had to
buy into whatever 10,000 SNP platform a particular
manufacturer was providing, or 100,000 SNP
or whatever. Early on when these things were
expensive, people didn’t want to measure 100,000. They just wanted to measure
10 or five, or maybe 50. But over time, this sort
of paradigm has shifted. And the cost has continued
to come down. I haven’t updated this slide
in a very long time, but believe me, it continues
to look like this. The million-SNP chip was
introduced by both of these companies about
6-8 months ago or so, and the costs of those are
running in or around the $500 to $600 range now,
so truly, you know, dramatically increased capacity
and decreased cost. So what that means is that in
2001, if you wanted to type all 10 million SNPs,
which is what you’d have to do, since you didn’t
have a linkage disequilibrium patterns, at a dollar a SNP,
it would be roughly the budget of the entire National
Institutes of Health, which wasn’t likely to happen
in a 2,000-person study. In 2008, we can type about
a million SNPs at a cost of about .05 cents for about
a million dollars. So about $500 per person
for a million-SNP chip. And really, these are, you know,
still a good piece of change, but it’s manageable, whereas
before it really was not. This is just a — sort of an
overview of the coverage of the various, more
recent platforms. The Affymetrix Gene Chip 500k
was used for the Wellcome Trust Case-Control Consortium
that we’ll talk about in — I think at some length. And in several of the other
studies that were reporting out in early 2007, it gave a
relatively poor coverage and an R squared of .08, so that’s
the question asking what proportion of the SNPs in the
genome are you covering in an R squared of — sorry,
.8 or better, and in the Yorba
it was only 46. In the European population and
the Asian population it was a little better. The SNP array is 6.0,
and I left out 5.0, sorry — these numbers
are much, much better, and the Illumina platform —
similarly, these numbers went up and up on
the Perlegen 600k, about these kinds
of numbers. So we’re getting very,
very good coverage now, and it’s only continuing
to improve. Something just to be aware
of is that the polymorphism literature can be a little
bit difficult to follow, because sometimes the
polymorphism is named for the amino acid change,
the angiotensinogen gene M235T is the methionine
to threonine, I believe. The nucleotide sequence —
so here are the — I forgot what this is — angiotensin
receptor, I believe, and this is a nucleotide change,
so it’s an A to C change in the CDNA, the complementary DNA we
talked about at position 1166. It could be in the promoter
region, this is a minus-six, usually the — when you’re
numbering promoters, it starts upstream of
the initiation site, so it has a negative sign,
could be for a restriction enzyme site, so these are
various restriction enzymes that cut the DNA in
different places. They could be for the gene
product, such as APOE e2. This is a particular protein
that’s produced by the APOE gene. There are a number
of legacy systems, particularly for the major
histocompatibility complex — the immune system that’s used
for typing for bone marrow donation and that
sort of thing. And that has — it’s a very,
very, very, very polymorphic locus, and it has a legacy
system of naming that goes way back. So it could be from the
reference SNP numbers. These are from dbSNP that
Tom mentioned to you. Reference SNP is sort of
the consensus sequence; the submitted SNP is what’s
submitted by, you know, whoever submits something to
dbSNP, says, “We found a new SNP, here it is, and here’s
our SS number.” And as Tom mentioned,
good sources for this information are OMIM, HUGO,
and the UCSC Genome Browser actually is
a neat one. If you haven’t looked at it,
it’s — we won’t show it to you here, but you can Google
UCSC Genome Browser; that’s how I found
most things genomic. And if you put in either a
gene — I tend to remember APOE because it’s
cardiovascular — and just to ask it to, you know,
show me the segment of the genome around APOE,
it will show you all of the SNPs
in the region. It will show you the
conservation in various different species,
and a whole bunch of other things, so it’s really
pretty cool. I don’t have time to go into
other genomic technologies. One to be aware of that’s sort
of coming on the horizon and will probably drive genome-wide
association out of business is sequencing, the system-measured
variation at every point in every gene or candidate region
in dozens to hundreds of people to find all of the
functional variants. That’s the way that
it’s used now. We anticipate that within
probably not too many years, the thousand-dollar genome,
as it’s been called, will be a reality, which means
we can sequence an individual’s genome for about $1,000. Remember that the first genome
project probably cost about $2.5 billion, so that’s a
several orders of magnitude improvement in cost. And those costs are coming down,
you know, day by day. Gene expression is measuring
changes in messenger RNA, which is the transcription
part in cases and controls, or in response to stimulation,
and you’ll see some expression studies. Epigenetics are to measure
changes on top of the DNA, that’s what the epi part means,
that turn — either turn the DNA on or off, or at least make
it available or less available for transcription. So depending on how DNA
is methylated, it may — the polymerase — RNA polymerase
may not recognize a site as a transcription start site,
and may kind of skip over it and then not transcribe that,
or the DNA may be wrapped around histones, which are the
proteins that kind of bind it up into chromatin. And it may be wrapped so tightly
or in such a way that it’s not accessible to unwinding
to then be transcribed. That’s what histone
deacetylation does, that can turn genes
on and off. So we’re not going to talk
about those very much. So let’s just pause for
a breath for a second, and this is, “Gee, I never
realized we’d have to know so much geography.” And you may not have realized
you’d have to know quite so much molecularl biology,
but that’s probably, you know, the most — at least
genetic structure and function that we’ll need
to know. So just to summarize on
genotyping points before I get to familial information,
there’s been unbelievably rapid progress from a small
number of blood group markers to more than 10 million
SNPs, CNVs, structural variants, sequence variants, and the
technology’s continuing to change; it’s one of the
challenging things about this field. I haven’t talked at all about
copy number variants. They’re sort of the latest,
greatest new thing, and they are — they basically
are being typed through SNPs, so I won’t go into them much,
but we can talk about them if you like. And as I mentioned, there’s more
to come in lecture four on genome-wide association studies. Quality control is a major
issue, and we’ll be talking about that
as well. But I did want to talk a little
bit about familial resemblance. This may be a group
of gentlemen — whoops, no video signal,
that’s not good. So — familial relationships,
okay. Basically, there are a couple
of ways of looking — I’m gonna touch it anyway,
let’s see. Come on, touch screen
to enlarge, yes — so the traits more similar
among related than among unrelated persons makes sense,
that would be resemblance, and clustering is often a
measure of risk of disease in the relative of somebody who
has it being greater than the risk of somebody who
doesn’t have it, or of people in the
general population. This has been called the
sibling-relative risk. I like to call it the
relative-relative risk, or Risch’s lambda sub S,
it’s also referred to. One can also look at
distributions of a continuous trait. This doesn’t have to be
in related individuals, but if there — it’s also called
mixtures of distributions, or commingling analysis,
where, say, you find two or three means in a population,
so instead of a nice mean distribution, you see, like,
a big group and then a smaller group and then a
smaller group. That suggests that maybe there’s
a major gene that’s producing each of those three. You don’t often see those kinds
of things, and when you do, they’re not necessarily
related to genetics, but in cholesterol measures,
for example, people with heterozygous familial
hypercholesterolemia will give you a bump in kind of
the middle of the distribution, with a long tail. And then those who have the
homozygous state will be way, way out here, but a little
bump in that, too. So that’s another way
of looking at them. This is an example of relative
risk; this is sibling-relative risk, and it’s actually a
good — a risk of a good thing, living to age 90
at various ages, depending on whether you had a
sibling that was a centenarian or a sibling who had died at
age 73 — shown here is at — in people who were age 64 who
had a centenarian as a sibling, there was really not any greater
chance that they would live to be age 90, but as they got
older, there was much greater risk, and particularly when
they got up into their 80s, they were much more likely to
make it to age 90 if they had had somebody before them to
whom they were related who had made it
to age 100. So that’s a nice example
of a relative risk. You can also find these with
larger families then it’s easier to at least assess
relative risks in larger families. This is a group — the group
in Iceland is blessed by having a relatively small
country that has not had a lot of in-migration and
out-migration and does have a total national obsession with
genealogy, so they absolutely love genealogies; they can all
trace their ancestry back to the — like the
10th century or so. When they meet each
other, they say, “Oh, I knew your
grandmother. She was my uncle’s, you know,
school teacher,” so anyway. And this is a representative —
truly representative pedigree of people with atrial
fibrillation here, going as you can see here,
six generations with the various affected
individuals shown. And this allowed us to then
look at the risk ratio — these were basically prevalence
ratios of atrial fibrillation in first-degree relatives,
in second-degree relatives, third, fourth, and fifth,
and you notice that this kind of decreases in almost
a halving exponentially, which is very consistent with
the inheritance of a major gene. And in fact, Arnar and others
then published the genome-wide association study of atrial
fibrillation just last year and showed that they found
a genetic variant related to this. So sibling-relative risks are
one way of looking at these for discrete traits. For continuous traits,
you can look at correlations among relatives. This is when Garrod was looking
at — Garrod Archibald, I think, one of the earliest
geneticists — looked at relationships among relatives;
he studied height and showed that basically an offspring’s
height is the midpoint of the two parents’ heights,
and one can regress that, basically. So you can regress one
relative’s value on the other in just a simple
regression analysis, shown here. The height of the offspring is
the mid-parent mean plus — by a beta coefficient,
plus a population mean. And then twice this parent
offspring correlation is an estimate of heritability,
or the proportion of variants in the entire population that’s
explained by, presumably, genetics; probably some shared
family environment as well. If the trade is under genetic
control, you expect the correlations among closer
relatives to be greater than those among
distant relatives. And here are some familial
correlations — after Wendy Post, et. al
in hypertension. Spouse correlations are often
used as sort of a control for familial correlations. If there’s a high
spouse correlation, we generally assume in the
U.S. spouses are unrelated, and so that suggests that
shared environment may be more important
than genetics. But in 855 pairs,
the correlation between spouses was .05;
the expected would be zero. Parent/offspring pairs,
it was .15; the expected, if it was a single
gene that was causing this, would be .5, because parents
and offspring share half — exactly half their genes. Siblings share, on average,
half their genes — their variants, and the
correlations here were similar, suggesting that there
might be some environmental factors as well that are
bringing this down. And avuncular pairs, which are
niece-uncle, nephew-aunt, et cetera, were smaller than
that, and that would be expected as well. So this is suggestive,
it’s not real strong, but it’s some suggested
familial correlations for a continuous trait. And as I mentioned, assessing
the familial and genetic nature is generally done by
looking at heritability. It’s often designated as either
a capital H or an H squared, or sometimes sigma squared G
over sigma squared P, which is the proportion of
the phenotypic variants P explained by the genotypic —
or genetic variants of G. And I just reiterated that here
it’s both a population and an environment-specific parameter,
so it changes from population to population depending on how
much environmental influence there is, there will be —
if there’s more environmental influence adding to the
total phenotypic variance, this proportion is
going to go down. If you can keep the
environment constant, it’s going to — everything
is going to look genetic, and so this proportion
will go up. Keep in mind that its value
does not indicate the role of genes or variants in any
specific individual, but it allows you to sort of
predict the expected degree of familial aggregation
of a trait. And it was anticipated
the traits that had high heritability should prove
fruitful in identifying trait-related genes. Probably the trait with the
highest heritability that’s known is height. Height actually did not yield
itself very well to identifying genes in — or genetic variants
or genes in linkage studies, but actually — has done —
has been really a gold mine in gene-wide association
studies. And just another way of looking
at this percent of variants explained for angitensin
converting activity, ACE activity in fathers,
mothers and siblings, and these are just the — a
major gene effect affecting this, and the proportion
of variants explained. And just a — sort of to point
out that up until now, we really haven’t — we hadn’t
found any genes at all, but even those that we’d found
really don’t seem to explain the vast majority of the
heritability that had previously been
identified. So height, 90 percent
variability, the variants found to date
explain only about 3 percent of them. Does that mean that, you know,
there are many, many, many more variants to be found,
or does it mean that environmental influences haven’t
been taken into account as well? It’s not quite clear. Type-2 diabetes has — sorry,
a lambda sub S risk to your sibling if you have
diabetes about threefold — three to fourfold. So far the variants that
have been identified have lambda sub S of
only about 1.07. C-reactive protein
was estimated — has been estimated in the Reiner
and Ridker papers that were recently published as having —
they’ve estimated about 10.5 percent of the variants
explained by the variants that they identify. I’m not sure that I trust that;
that seems awfully high, and the total variance
is 30-50 percent. This needs to be replicated;
it’s — these are new studies. And a recent psoriasis study,
for example, a lambda sub S in siblings is 4 to 11, maybe
about 7 to 8 on average. There were about nine variants
in this particular paper that were
at 1.3. If you were to multiply
all of those out, if you had each one of them,
you might be explaining, you know, a lambda sub S of —
in the 8 to 9 range. So these — seems as though
you’re getting more and more of the variants
explained. These are also newer
and newer studies, and I suspect that they
won’t replicate. Keep in mind that the first
estimates that you get of a relative risk in
any risk factor, whether it’s smoking or
whatever, tend to be overestimates, because you’ve
had some variability in order to be able to find that
estimate, and we’ll talk about that in
a bit as well. Tom had asked me to comment just
briefly, and that’s all I’ll do, on Hardy-Weinberg equilibrium
because he’ll be talking about it a fair amount
in the next talk. Remember that he talked about
Mendel’s second law: the occurrence of two alleles of
the SNP in the same individual are two independent events,
and those basically segregate separately. There are ideal conditions
at which an equilibrium is established and maintained
among them; that was described by two,
actually, epidemiologists, Hardy and Weinberg. And those conditions are:
random mating, which we generally do
not have in the U.S.; no in- or out-migration;
no inbreeding; no selection, that is, equal survival of
the offspring; no mutation, large population sizes;
and the gene frequencies are equal in male and —
males and females. Very few of these conditions
actually hold, but they’re not all that
critical for estimating Hardy-Weinberg equilibrium. And if alleles — big A
and little a of a SNP — given SNP have frequencies P
and 1 minus P, then the expected frequencies
of the three genotypes, and probably all of us learned
this in high school, that our P squared 2 times PQ,
or P times 1 minus P, and 1 minus P squared. And this is a very useful
equation to test. It used to be used to sort of
identify whether there was selection pressure against
one genotype or another. These days, it’s actually more
likely to indicate genotyping error, particularly because
heterozygotes on the current platforms are much tougher to
type than the homozygotes, so what you tend to have
is fewer heterozygotes than you would expect by
Hardy-Weinberg equilibrium. So it’s worthwhile keeping
that one in mind. I think that’s about
where I’m at. So keep in mind that familial
clustering is an indicator of possible genetic influence;
it’s just a hint. It doesn’t necessarily mean
that there are genes at play. It may overestimate the genetic
component due to either poor assessment of the environment,
or poor adjustment for shared environment among families. And methods for assessing
it include twin studies, parent-offspring correlations,
sibling- or relative-relative risk, and percent of
variants explained. And current genes that we’ve
identified so far for complex diseases really explain only a
tiny fraction of heritability, and that unexplained
heritability has been called the dark matter of
complex disease genetics. So I think I’ll stop
at that point, and I believe there’s a question
there in the back, so thank you. Questions? [Male Speaker]
I’m just curious about
the [Inaudible] — [Dr. Teri Manolio]
So, I think they’d like you
to use the microphone. I’m sorry — we wanted
to tape this. We’re actually not
live-webcasting it, but we wanted to have it
available for posterity, so that when Martha or others
ask, you know, “Can you give a course?” we can say,
“Look at our Web site.” [Male Speaker]
Oh, I’m just curious
about the heights. I mean, you said it’s
gold mine — that the genome-wide-association
is a gold mine. Is there any — [Dr. Teri Manolio]
Oh, it’s a gold mine because
there are, like, 20 different variants for
it now, but they — but each one explains a very,
very, very small proportion of the variants, so variant T,
and then variants CE. Yeah, so it’s
done very well. Diabetes has been
another biggie. Crohn’s disease has come
up with 15 or 20 or so. But again, they don’t explain,
you know, the heritability that has been estimated. And my personal belief is
that we’ve overestimated the heritability; we’re not
accounting for the shared environment nearly well enough,
but that’s just my belief.