模塊14 - 使用標記來預測育種值 (Module 14 - Using Markers to Predict Breeding Values)

字幕列表影片播放

The goal of this module is to cover the fundamentals of
marker aided selection in forest trees, concentrating on integration
of markers into basic quantitative genetic theory.
In previous modules we have discussed when, and or where, in the tree
breeding cycle markers may be employed to improve breeding efficiency.
That is, where marker-assisted selection
(MAS) would be beneficial.
In this module, the author introduces mathematical approaches to incorporation
of marker information in the selection process.
This is done by carefully defining the mixed linear
models that capture random genetic variables and fixed
environmental variables influencing the trait of interest.
The models are expressed and solved using
matrix algebra. Familiarity with matrix algebra
will significantly enhance your understanding of the module’s content.
This module is organized in four sections
and begins with a discussion of the best linear unbiased
prediction (BLUP) of breeding values using a
numerical relationship matrix, called the A matrix.
This matrix is usually derived from pedigrees of
individuals. The relationships are sometimes called additive
genetic covariances. The second section
focuses on incorporation of markers in models and the various approaches
to marker aided selection we have introduced in previous modules.
The third section addresses a key issue
related to the use of markers for selection, namely, how missing
genotypes are inferred or imputed so that breeding values
can be calculated. The final section is about the
realized genetic matrix for G-BLUP.
The genetic relationship matrix based on the markers is called
the G matrix, and BLUP analysis based on the G matrix is
called G-BLUP. We will start, however, with
definitions for a number of terms.
When an individual is crossed,
either as a female or a male, with a number of other parents,
we measure the performance of the progeny of those crosses
and estimate a mean value of that parent.
This is typically done for trees used as females with many pollen donors.
The deviation of the parent mean (X) from
the population mean (μ)
is called the general combining ability (GCA) of that tree.
We can use a linear model to define the GCA of a tree
as y sub i is equal to μ
plus GCA plus error.
GCA can be defined as the expected value of offspring from a given parent
after it has been crossed with many pollen parents.
The GCA of a given tree is
half of the parental additive genetic value
(½ a sub f).
The other half, unmeasured here, comes from the pollen parents.
The breeding value (BV)
is the value of genes transmitted to progeny.
We can define breeding value of a tree in a linear model as the average
parental breeding values of the male (a sub m) and the
female (a sub f), the fixed effects
(that is, the intercept), and the error (e sub i).
The genetic value is the value of genes to the parent tree itself.
It includes both the additive and non-additive
(dominance) effects. Dominance effects cannot
be passed on to progeny by breeding.
The difference between the genetic and the breeding value is, therefore, largely dominance
deviation (assuming epistatic effects are negligible).
We can use a linear model to define genetic values,
as the average of parental breeding values, the Mendelian sampling
(m sub i), and the error variance.
Mendelian sampling is the deviation of offspring from mid-parent breeding values.
In other words,
offspring from a cross deviate from the mid-parent breeding values because of
random sampling of parental genes caused by segregation and assortment.
They receive different sets of genes from parents,
which affects their phenotype (deviation from
mid-parental mean).
There are two approaches to calculating breeding values (BVs)
based on progeny test information that is available for a sample of parents.
We may fit a general combining ability model
(GCA, or parental model)
to obtain the parental GCA value and
multiply this estimate by two to get breeding values.
These models are easy to run in programs like SAS Statistical Software
because there are relatively few mixed model equations
to be solved. If we need to estimate breeding values of genetic groups, parents,
and progeny simultaneously,
individual-tree models are preferred. These models
are called ‘animal models’ in the animal breeding literature.
In the individual-tree models, trees are no longer independent.
Progeny from the same female are correlated
(that is, they are genetically related).
The individual-tree models rely on both additive genetic variance
and a matrix of information that
describes the relationship of every tree to every other tree
(both parents and progeny) in the data set.
This is a key feature of the BLUP approach to estimating breeding values.
This approach, as noted in module 5, is desired in programs
that have advanced beyond one or two generations of breeding.
So, let’s look at the details of the BLUP approach to see how general combining ability,
breeding value, and genetic value are calculated.
The traditional approach of using the BLUP procedure to predict
breeding values is based on individual tree phenotypes
and the genetic relationship matrix (A)
derived from the individual tree pedigrees. The A must be known,
and can be estimated if enough genetic markers exist.
Best linear unbiased prediction (BLUP)
is performed using matrix algebra. While this may
be foreign to many of you, bear with the presentation to see if you can extract
the essence of the analysis. The type
of model used will depend on the nature of the trait being measured. We use
linear mixed models to predict breeding values
of continuous response variables, such as growth,
and generalized linear mixed models for binary traits, such as disease
incidence. The statistical model chosen
is the foundation of progeny test analysis and must be defined with utmost care.
We will now take a few moments to describe
each basic element in a linear mixed model.
y is the n by 1 row vector
of phenotypic observations
(think of a vector like a column of data for trees).
“n by 1” represents the dimensions of the row vector, where n is the
number of trees (that is, the number of rows) and 1
is the number of columns. X is the
design matrix that relates elements of the fixed effect
vector b to the row vector y
b is the p by 1 row vector of fixed effects
(for example, the intercept, sites, and
blocks within sites), where p is the number of rows
and 1 is the number of columns.
These are non-genetic factors that contribute to the phenotype observed.
Z is the design matrix that relates
elements of the a and e vectors to y
a is the q by 1 row vector of random, that is, genetic effects
for family and family by site interaction for instance).
q is the number of levels
of random effects or number of trees in the data.
e is the n by 1 row vector of random residuals
with n by 1 dimensions, where n is the number
of observations (trees).
Although these terms likely seem abstract to you now,
hopefully they will become clearer after our example on the next slide.
To achieve a thorough understanding of linear mixed models in the
context of predicting breeding values, you will likely need to
review this slide and the next slide several times.
In this slide,
we adapt the linear mixed model described in the previous slide
to develop a model for predicting breeding values.
Suppose that we have measured height of five trees grown in two different locations
(L1 and L2).
We can assume that the trees come from a large population, which is a reasonable assumption,
and therefore, we will treat the trees as random.
The linear mixed model is shown in a standard statistical format
(MODEL 1) as (y sub ij
= l sub i + t sub j +
e sub ij) where y sub ij is the
j-th tree height measured in the i-th location;
li is the i-th location effect, tj is the j-th tree effect (breeding value)
and eij is the error term associated with the j-th tree in location i.
The same model is shown in compact matrix notation as MODEL 2.
In reality we may have many more fixed and random terms and writing linear models in statistical format makes them much longer.
The other advantage of matrix format is that it is easier to talk about the assumptions of the model
and it is easier to describe different variance-
covariance structures of the model for the matrix format.
The full matrix format of the same model is given in MODEL 3.
This is like taking the x-ray pictures of matrices (X and Y)
and vectors (y, b, a, and e).
We will now describe each element of the mixed model
as it applies to the predicting breeding values.
y is a 5 by 1 row vector of height observations
where 5 is the number of trees.
This is the number of rows of the vector.
X is the design matrix that relates the
fixed location effects (b) to the height observations (y).
For example, the first column of X is for location 1
and the second column is for location 2.
Looking at the first column of X matrix
for location 1 shows that observations 87 and 90 are
coming from location 1 (because they have element 1 instead of 0).
b is the 2 by 1 row vector of fixed effects.
These are non-genetic factors that contribute to the phenotype observed.
In this example, we treat location as a fixed effect,
thus b has two rows, one for each location.
Those are the coefficients we solve for fixed effects.
Z is the design matrix that relates height observations to random tree effects and the
residual error (e).
a is the 5 by 1 row vector of random or genetic effects
(e.g., family and family by site interaction).
This is a column of random effects for which we want to estimate breeding values.
And e is the vector of residuals
associated with each measurement (and tree).
We assume that all trees have the same error variance
and that errors are independent and have a normal distribution.
In matrix notation, error variance
(the R matrix) can be expressed as
R=I sub n
with n by n dimensions
(where n is the number of observations)
and sigma square sub e is the error variance calculated from data.
The diagonals of the identity matrix are all 1
and off-diagonal elements are all 0.
We also assume that the random effects are associated
with a variance and that the variance is homogeneous (that is, the same for all).
In matrix notation,
variance of the random effects (the G matrix)
can be expressed as G
=I sub t
where I sub t is the identity matrix with t by t dimensions
(and t is the number of random effects)
and
The diagonals of the I sub t matrix are all 1 and off diagonal elements are all 0.
Then we can define the variance of observations as the sum of two variances
(the error variance and the random effects variance).
The variance of the y vector is equal to V,
which is equal to the Z matrix
(that is, the matrix that relates the phenotypic effects to the random effects and the residual error),
multiplied by the G matrix,
multiplied by the transposed Z matrix,
plus the error variance (the R matrix).
It is important to keep in mind that the above assumptions may not hold.
For example, if we have two locations,
they may have two different growth rates and thus the error variance might be different (heterogeneous).
In addition, errors associated with observations might be correlated.
Similarly, genetic variance might be different in different locations.
To solve the linear mixed model given in the previous slide,
the model is presented in matrix form.
Working out the solutions yourself will take in-depth knowledge of matrix algebra, but
try to understand the overall concept.
In this model, X is the matrix that relates fixed effects to y,
X prime is the transposed
X matrix, G is the genetic variance-covariance matrix,
G to the power of -1 is the inverse of G,
R is the error variance-covariance matrix,
and R to the power of -1
is the inverse of R. The vectors a
(random effects) and b (fixed effects)
are as described previously.
When the R matrix has a simple structure (that is, errors are normally
and independently distributed), R can be factored out
to provide the simplified model shown here.
The solutions for fixed effect factors
(e.g., location or site effects)
are called best linear unbiased estimates (BLUE).
The solutions for random effects are called best linear
unbiased predictions (BLUP).
In the mixed model equations, the G matrix of random effects is
replaced with the inverse of the additive genetic relationship matrix A
(that is, A to the power of -1.
This is a powerful tool in BLUPs
to predict breeding values using all the information from relatives,
and thus increase the accuracy.
The A matrix in the mixed model is multiplied by the ratio
of
(error variance) and
These variances are assumed unknown
and are calculated from data using the restricted maximum likelihood method
(REML).
The ratio of these two variances is sometimes called the shrinkage factor.
It is the inverse of heritability.
For low heritability traits,
the predicted values shrink towards the mean
more when compared to traits with high heritability.
In this slide we have presented an example of a simple additive genetic relationship
matrix based on a pedigree of five trees.
Trees that are not genetically related have 0 coefficients.
For example, trees 1 and 2,
or 4 and 5.
Since trees 3 and 5 are half-sibs
(they have female 1 as their common mother),
the coefficient of additive genetic relationships between 3 and 5 is 0.25.
The relationship between tree 4 and its parent, female tree 2,
is 0.5.
Similarly, the coefficient between trees 3 and 5
with their common mother tree is 0.5.
These coefficients are based on the probability of alleles being
identical by descent.
As noted previously,
the additive genetic relationship matrix is the backbone of parental
and individual tree models. Variances and means are seldom constant;
they change from one generation to another because we improve the population.
The A matrix,
as it is called, is typically constructed from known pedigrees.
The introduction of numerous, low cost markers may eventually
improve our ability to define the level of “kinship”
between trees by looking at the proportions of shared alleles.
As we shall see in a few slides,
using markers to calculate the proportion of alleles that are “identical by descent”
(IBD)
between parents and progeny is a powerful tool for improving breeding value estimates.
We now shift attention back to the general combining ability
model first introduced on slide 5.
You may recall that the GCA,
or parental, model may be used to predict parental breeding values.
When a GCA model is fit, the objective
is to obtain GCA values of parents which will be used for backward selection
(selection of the best parents).
Such selection might be useful, say, for thinning out seed orchards to the best parents.
Within family selection is not considered here.
Note that the equation shown on this slide,
is the same equation first presented on slide 10.
The b part of the vector in the mixed model
equation includes the predictions for fixed effects,
such as the intercept, sites, age, and so forth.
The predictions for fixed effects are best linear unbiased estimates
(BLUE).
The solutions for the a vector are the best linear unbiased predictions
(BLUP) for random effects. We sometimes call the large matrix
(marked with the red square) a C matrix.
The diagonal elements of the C matrix are the variances associated with the predictions.
Parents with a large number of progeny
would have small variances (that is, they would be estimated
quite accurately). The equation given in this slide
shows how the mixed model equations are solved for fixed effect
(intercept) and random effects (e.g. GCA or BV)
in the linear mixed model given in slide 3.
We shall now shift our attention to the individual tree model first introduced in slide 5.
Recall that when a breeding value
model for individual trees is fit,
the objective is to obtain the breeding value of grandparents, genetic groups,
parents, and offspring simultaneously.
The a vector now contains values for all genetic groups of interest.
The BLUP solutions of mixed model equations for the parents,
offspring, or grandparents are breeding values.
There is no need for extra calculation to obtain breeding values.
Selection is subsequently completed,
typically, by taking into consideration the performance of both the family
and the individual offspring within family,
with carefully defined weightings for each.
This is called an index selection strategy,
and it is used to create advanced breeding populations.
This is defined as forward selection.
In order to obtain genetic values,
we need to have full-sib family structure or cloned progeny.
Since GCA and breeding values are predictions they are associated with an error.
Breeders are interested in knowing the degree of reliability of these predictions when making decisions.
There is more than one statistic used to measure the accuracy or reliability of breeding values.
The most common statistics to measure their reliability is the accuracy value.
The correlation between true and predicted breeding values is called the accuracy of breeding values (BV.)
We can use the standard error of a predication to calculate accuracy.
In the formula shown on the slide,
SE is the standard error of the prediction,
F is the inbreeding coefficient,
and sigma squared sub A is the additive genetic variance.
The additive genetic variance is the true breeding value (BV.)
For non-related trees, F is zero.
During this portion of the module, we focus on marker aided selection (MAS.)
In previous modules we introduced the concepts of Linkage Equilibrium (LE) MAS,
Linkage Disequilibrium (LD) MAS,
and Gene MAS to describe the distinction between the proximity,
or relationship between a marker
and the causal mutation affecting a trait phenotype.
Here we review the concept and illustrate the expectations at the population level.
The optimal condition for marker based selection would be Gene MAS,
where the marker is actually the causal mutation.
Unfortunately, finding these are rare events
and targeting them would be time-consuming and costly.
In the population, there is only one option under Gene MAS -
the allele (genotype) that you see reflects the phenotype you will get all the time.
With association genetic approaches to selection
we are realistically targeting markers that are in tight linkage disequilibrium with the causal mutation.
In the diagram shown here you can see that the marker,
depicted by the uppercase M and lowercase m alleles,
is always in the same allelic configuration
with the causal mutation depicted by the uppercase Q and lower case q alleles.
Depending on the extent of LD,
the presence of the marker allele will, more or less, always reflect the actual genotype.
As can be seen above,
with Linkage Equilibrium (LE) MAS,
the correlation between marker and causal mutation alleles is poor at the population level.
While the desired
While the desired correlation may be excellent within a family,
it is likely to disintegrate across families and have no utility for selection at the population level.
Finally, we introduce the concept of including markers in the selection process.
The linear statistical model to describe the variance in a
trait of interest (say, y)
is depicted in its simplest form where a phenotype is described by the population mean,
a marker effect (M), and the residual error.
For the remainder of this module
let us assume that we are dealing with markers that are in strong LD with causal mutations.
An LD marker will have a measurable effect on the trait,
though it may be very small. In the above DNA sequence,
all the nucleotides are the same, except at the base position highlighted with red letters
(A/C) and a rectangle.
By calculating the mean performance of individuals grouped by their SNP genotype at this locus,
we can estimate the effects of allelic substitution.
In this case, there appears to be a completely additive genetic effect of substituting
substituting the nucleotide base C with the base A.
That is, for each A allele there is a plus
five increase in the trait being measured.
The relevance of the difference depends on the scale of measurement,
the proportionate increase, and the value of each point of increase.
Several possibilities exist for how the marker genotype
may be fit in the linear model to predict breeding values.
We assume that markers in question are in LD with the trait.
In the first approach, markers can be treated as covariates
(fixed effects) in the linear mixed model.
In the second approach, markers are considered as random effects in the linear mixed model.
In the third approach, markers are not fit as factors
in the model, but they are used to generate genomic relationships between individuals.
The matrix of relationships based on markers
is called the G matrix.
The A matrix based on pedigree that we discussed previously is
replaced by the G matrix in linear mixed models to obtain BVs.
The third approach is fundamentally different from the first two approaches.
We shall explain all three methods in the following slides.
When marker effects are considered fixed,
the marker can be fitted as a covariate.
Alternatively, the marker alleles may be fit as a class variable,
allowing for different means for each genotype.
This latter approach also allows one to model for dominance effects and
can be extended to fitting haplotypes
(that is, multi-locus SNP genotypes).
We draw your attention to the
marker effect m.
The m vector for a given marker has the elements of -1
0 or 1,
which designates the number of a specific allele, say
“allele A”, carried by an individual.
If the individual is homozygous for the “A” allele,
that is, it has two copies of the A allele,
then it would receive a value of “one”.
A heterozygous individual would have a single copy of the “A” allele,
and would be assigned a value of “zero”.
A homozygote for an alternative “C” allele
would have no allele “A”, and would be scored as a “minus one”.
Now let’s look at the allelic substitution effect
more closely in the next slide.
The average effect of allelic substitution represents the average change in
phenotypic value when the A allele is
randomly substituted for the C allele. For example,
substituting A with C may increase the phenotype
mean by 5 units.
The additive effect can be estimated as the difference between the two homozygous means
divided by two.
If the marker-QTL effects are treated as fixed effects,
there is a strong tendency to overestimate them.
Fitting markers as fixed effects tends to lead to larger estimation errors.
For example,
SNPs with small minor allele frequency are
susceptible to this estimation error.
For this reason, and others, we may want to fit markers as random effects.
In this slide, the marker effect
(m) is fitted as a random effect in a mixed model.
We can shrink the estimate of marker effects
according to the amount of data used by treating markers
as a random effect. The less data there is to estimate m,
the more the estimate will shrink towards the mean.
In other words, fitting markers as random effects regresses,
or shrinks, estimates back to zero to account for the lack of information.
If the choice of variance explained by markers (
then the resulting estimates are BLUPs.
Differences between random and fixed effects are small if the amount of data is large
(small error variance
Furthermore, treating markers as random allows calculation of percent of phenotypic variation explained by markers.
In general, trees are genotyped using a large number of markers.
Using more markers provides more information to predict genetic merit of individuals.
We will have more confidence to rank them and make selections.
This is of particular importance because most of the traits we are interested in,
such as growth, are controlled by many genes.
Even if there is a large quantitative trait locus (QTL)
that is explaining a large proportion of phenotypic variance,
we still would like to use more markers to genotype trees for that particular QTL.
In this way, more than one marker is associated with a large QTL.
They all can be significant in association testing
(explaining the same variance).
The question is how to account for multiple markers in the prediction of genetic merit?
The simplest way is the multiple regression approach,
fitting all the markers simultaneously as given in this slide.
In the example, tree1 has marker 2 but not
marker 1.
p is the number of significant markers from the association study
and g is the vector of markers.
Linear regression for multiple markers can be written in matrix form as we saw in slide 20.
For the sake of simplicity,
most of the models introduced in this module have been simplistic
in that they model the effects of only one or a few markers at a time.
In forest trees, application of a single marker in prediction of breeding values is not likely to occur.
Instead, with ever decreasing genotyping costs and
marker availability, breeding programs
are moving to utilize large numbers of markers
for genome-wide selection. Unfortunately,
as the number of markers increases, the number of missing data points also tends to increase.
It rapidly becomes economically and
logistically difficult to fill the missing data points
with new genotyping runs.
Removing records (trees in the test, for instance, or specific SNPs)
that do not have complete data sets will
likely result in a dramatic reduction in power of association of markers and traits -
so much so that the analyses are virtually meaningless.
yet, predictions of the genetic merit of trees across markers requires complete genotyping information or gene content.
It is therefore important to use efficient statistical methods to accurately infer missing genotypes.
The term ‘imputation’ refers to
using a reference panel of haplotypes to replace
missing gentoypes of a subsample of individuals.
We will discuss the concept and process of imputing missing data points
in the next few slides.
Human geneticists rely on genetic maps
and linkage disequilibrium information from nearby markers to replace missing genotypes.
Many software programs exist to impute missing data,
which rely on haplotype reference datasets,
such as HapMap, for population haplotypes or reference panels.
Most programs employ maximum likelihood or Bayesian methods to
predict missing genotypes.
They can be computationally intensive
and convergence can be a problem for large data sets.
Reference data sets for forest trees generally do not exist
since we do not yet have a complete sequenced genome.
Plus we do not know the order and location of markers on the genome of trees.
Consequently, methods developed by human geneticists do not work well for trees.
Thus, alternative approaches have been developed by animal breeders,
which can be employed in tree breeding. One of them is to use
pedigrees, that is, expected additive genetic relationships between trees.
We explain this concept in the next slide.
The minor allele is the allele with a frequency less than 0.5 in the population.
Geneticists believe that it’s the minor allele that have a greater effect on the phenotype than major alleles.
We calculate minor allele frequency for each locus across all the trees genotyped,
and convert genotypes to the number of minor alleles each individual has.
The number of minor alleles is sometimes called ‘gene content’.
Each tree would have 0, 1, or 2
depending on the number of minor alleles it has.
These frequencies are sometimes converted to
-1, 0, and 1, for ease of the matrix calculations.
Minor allele frequencies are used to impute missing genotypes, as we shall see in the next few slides.
Minor allele frequency can also be used to estimate allele additive and dominance substitution effects on a phenotype,
genetic merit of trees across all markers,
or genetic merit based on additive, dominance, and total genetic marker effects.
We will follow the method of Gengler and colleagues from 2007 for imputation of missing genotypes.
This method follows the logic of genetic covariance among relatives.
In other words, the covariance between gene content is proportional to the additive relationship between animals, or in our case, trees.
Genetic covariance arises because two related individuals have alleles that are identical by descent.
In other words, both copies of an allele can be traced back to a single copy in a recent common ancestor
because the mother receives half of its genes from its parents.
If the maternal genotype is not known, but the maternal grand-sire has been genotyped,
the expected value of lowercase q for the mother
is the population mean plus one half of d subscript mgs.
In the previous slide, we described how relatedness
among trees can be used to come up with gene content number of trees with missing genotypes.
The calculations in the previous slide are done by solving mixed model equations as shown in the above slide
to predict missing genotypes.
These predicted imputed genotypes are BLUPs.
y is a row vector of allele content of trees like this:
y = [0 1 0 . 1 2] for seven trees.
For the fourth individual, the genotype is missing.
But it is going to be estimated using information coming from parents and other relatives, for example, sibs.
M is the design matrix connecting trees to the allele content number vector y
e is the error variance; M’ (M prime) is the transpose of the M matrix
1’1 (one prime one) is the vector of 1s to calculate the mean for each allele content
d sub y is the vector of allele content deviations for trees with genotype records
d sub x is the vector of gene content deviations for unobserved trees
A is the additive relationship matrix derived from pedigrees.
The A matrix is composed of four different sub-matrices as follows:
A sub yy is the sub-matrix of covariances among trees with known genotypes
A sub xx is the sub-matrix of covariances among trees with unknown genotypes.
A sub xy and A sub yx are the covariances among trees with known and unknown genotypes.
Epsilon (ε) is the shrinkage factor, which is the ratio of error
sigma-squared sub e and genetic variance or sigma-squared sub a explained by the markers.
When we solve the mixed model equations (as shown previously),
we obtain allele content. The predictions
shown in the slide are predicted allele content number for each tree.
Normally the allele content number would be
0 (homozygous, no minor allele),
1 (heterozygous, or one copy of minor allele),
2 (homozygous, or two copies of minor allele).
However, solutions from the mixed models are continuous
because of the nature of predictions from solving mixed models.
For example, the predicted gene content number for treeID 2
is 0.3504.
Having continuous gene content number is not a problem for most software to predict breeding values of trees.
This method is efficient when the population
has relatedness among the members because
information from relatives is
used to predict marker genotype for trees.
When we plot the predicted gene content
number we see the frequency of trees having gene content number
close to 0, 1, or 2 as shown in this slide.
After imputing missing genotypes,
we would have a complete table of trees and marker genotypes. Rows would be treeID
and columns would be genotype or gene content number.
Now we are ready to use the new predicted genotypes in
predictions of breeding values in mixed models as described in earlier slides.
We are now switching to a fundamentally different way of using markers to predict breeding values.
The method is based on using markers to construct genetic relationships between individuals and using
the genomic relationship matrix in a BLUP approach to predict breeding values.
When we use pedigree relationships to make genetic evaluations, we assume that full-sibs on average share 50% of their DNA.
In reality,
they may share 45 or 55%
or something considerably different from either side
because each sib inherits a different mixture of alleles from the parents.
This is because of the Mendelian segregation effect.
With the large number of markers now available,
we can estimate the number of alleles shared by indivduals instead of assuming constants
(for example, a constant of 0.5 for full-sibs
and or 0.25 for half-sibs).
Determining genomic relationships from markers is based on the proportion of chromosome segments shared by individuals.
This includes identification of genes that are identical by state.
In the next few slides, we will introduce the Genomic-BLUP and cover
the concept of genetic similarity.
Before tackling genomic relationship
derived from marker genotypes, it is useful to first introduce the concept of genetic similarity between individuals using markers.
In the above picture, we expect that the two mice on the right have more alleles
that are identical (either by descent or state)
than the mice on the left. Due to the sharing of more, or
fewer, alleles that are identical,
individuals are phenotypically similar or dissimilar.
There are many methods proposed to obtain
genetic similarities between individuals using markers.
They can be categorized as follows:
the frequency method, the regression method,
and the normalized method.
For ease of illustration we give an example of the frequency method.
In this slide, the M matrix is
the gene content number of trees. Trees are in the rows and
gene content for each marker is in the columns.
tree1 has 2 homozygous loci
(locus 1 and 3),
tree2 has no homozygous loci, and
tree3 has 3 homozygous loci.
When we multiply the M matrix with its transpose,
we obtain the MM prime matrix.
The MM prime matrix is important because we obtain the number
(frequency) of shared alleles between individuals from this matrix.
These frequencies are then used to calculate the genomic relationship matrix, G.
The G matrix is then used to predict breeding values
of trees with no phenotype.
The elements of MM prime are:
1. Tree1 and Tree2 share no alleles
because the diagonal element between the two trees is 0
2. Tree1 and Tree3 share 2 alleles
3. Tree2 and Tree3 share 0 alleles in common.
The frequency of alleles shared by relatives in the MM prime matrix are
weighted by the allele frequencies in the
population to obtain final coefficients
of genomic relationships as we shall demonstrate in the next slide.
To estimate genomic relationships between trees based on markers
(G matrix), we need to have allele frequencies
in addition to the number of alleles shared by trees
(MM prime matrix), as shown in the previous slide.
The P matrix written in the first step
is the matrix of allele frequencies.
The elements of the P matrix in the slide
are allele frequencies for three loci.
Allele frequencies are expressed as 2
(p sub i - 0.5)
where p sub i is the minor allele frequency.
The second step to obtain the G matrix is to
subtract the P matrix from the M matrix to obtain the Z matrix.
Subtraction of P from M has certain advantages:
First, it sets mean values of the allele effects to 0
Secondly,
subtraction of P gives more credit to rare alleles than to
common alleles when calculating genomic relationships.
Thirdly, genomic inbreeding coefficient is greater
if the individual is homozygous for rare alleles
than if homozygous for common alleles
In the third step, the ZZ
prime matrix is divided by 2
times the sum over all loci of p sub i (1 - p sub i)
to scale G to be analogous to the numerator relationship matrix (A).
p sub i are the observed minor allele frequencies of all genotyped subjects regardless of inbreeding and selection.
Genetic relationships
at individual loci are important to quantify covariances between
individuals for effects of a QTL at that locus
in order to incorporate marker data in genetic evaluation
by marker-assisted BLUP.
With these new tools we can incorporate non-additive genetic effects
in BLUPs to estimate BVs.