Placeholder Image

字幕列表 影片播放

  • The goal of this module is to cover the fundamentals of

  • marker aided selection in forest trees, concentrating on integration

  • of markers into basic quantitative genetic theory.

  • In previous modules we have discussed when, and or where, in the tree

  • breeding cycle markers may be employed to improve breeding efficiency.

  • That is, where marker-assisted selection

  • (MAS) would be beneficial.

  • In this module, the author introduces mathematical approaches to incorporation

  • of marker information in the selection process.

  • This is done by carefully defining the mixed linear

  • models that capture random genetic variables and fixed

  • environmental variables influencing the trait of interest.

  • The models are expressed and solved using

  • matrix algebra. Familiarity with matrix algebra

  • will significantly enhance your understanding of the module’s content.

  • This module is organized in four sections

  • and begins with a discussion of the best linear unbiased

  • prediction (BLUP) of breeding values using a

  • numerical relationship matrix, called the A matrix.

  • This matrix is usually derived from pedigrees of

  • individuals. The relationships are sometimes called additive

  • genetic covariances. The second section

  • focuses on incorporation of markers in models and the various approaches

  • to marker aided selection we have introduced in previous modules.

  • The third section addresses a key issue

  • related to the use of markers for selection, namely, how missing

  • genotypes are inferred or imputed so that breeding values

  • can be calculated. The final section is about the

  • realized genetic matrix for G-BLUP.

  • The genetic relationship matrix based on the markers is called

  • the G matrix, and BLUP analysis based on the G matrix is

  • called G-BLUP. We will start, however, with

  • definitions for a number of terms.

  • When an individual is crossed,

  • either as a female or a male, with a number of other parents,

  • we measure the performance of the progeny of those crosses

  • and estimate a mean value of that parent.

  • This is typically done for trees used as females with many pollen donors.

  • The deviation of the parent mean (X) from

  • the population mean (μ)

  • is called the general combining ability (GCA) of that tree.

  • We can use a linear model to define the GCA of a tree

  • as y sub i is equal to μ

  • plus GCA plus error.

  • GCA can be defined as the expected value of offspring from a given parent

  • after it has been crossed with many pollen parents.

  • The GCA of a given tree is

  • half of the parental additive genetic value

  • (½ a sub f).

  • The other half, unmeasured here, comes from the pollen parents.

  • The breeding value (BV)

  • is the value of genes transmitted to progeny.

  • We can define breeding value of a tree in a linear model as the average

  • parental breeding values of the male (a sub m) and the

  • female (a sub f), the fixed effects

  • (that is, the intercept), and the error (e sub i).

  • The genetic value is the value of genes to the parent tree itself.

  • It includes both the additive and non-additive

  • (dominance) effects. Dominance effects cannot

  • be passed on to progeny by breeding.

  • The difference between the genetic and the breeding value is, therefore, largely dominance

  • deviation (assuming epistatic effects are negligible).

  • We can use a linear model to define genetic values,

  • as the average of parental breeding values, the Mendelian sampling

  • (m sub i), and the error variance.

  • Mendelian sampling is the deviation of offspring from mid-parent breeding values.

  • In other words,

  • offspring from a cross deviate from the mid-parent breeding values because of

  • random sampling of parental genes caused by segregation and assortment.

  • They receive different sets of genes from parents,

  • which affects their phenotype (deviation from

  • mid-parental mean).

  • There are two approaches to calculating breeding values (BVs)

  • based on progeny test information that is available for a sample of parents.

  • We may fit a general combining ability model

  • (GCA, or parental model)

  • to obtain the parental GCA value and

  • multiply this estimate by two to get breeding values.

  • These models are easy to run in programs like SAS Statistical Software

  • because there are relatively few mixed model equations

  • to be solved. If we need to estimate breeding values of genetic groups, parents,

  • and progeny simultaneously,

  • individual-tree models are preferred. These models

  • are calledanimal modelsin the animal breeding literature.

  • In the individual-tree models, trees are no longer independent.

  • Progeny from the same female are correlated

  • (that is, they are genetically related).

  • The individual-tree models rely on both additive genetic variance

  • and a matrix of information that

  • describes the relationship of every tree to every other tree

  • (both parents and progeny) in the data set.

  • This is a key feature of the BLUP approach to estimating breeding values.

  • This approach, as noted in module 5, is desired in programs

  • that have advanced beyond one or two generations of breeding.

  • So, let’s look at the details of the BLUP approach to see how general combining ability,

  • breeding value, and genetic value are calculated.

  • The traditional approach of using the BLUP procedure to predict

  • breeding values is based on individual tree phenotypes

  • and the genetic relationship matrix (A)

  • derived from the individual tree pedigrees. The A must be known,

  • and can be estimated if enough genetic markers exist.

  • Best linear unbiased prediction (BLUP)

  • is performed using matrix algebra. While this may

  • be foreign to many of you, bear with the presentation to see if you can extract

  • the essence of the analysis. The type

  • of model used will depend on the nature of the trait being measured. We use

  • linear mixed models to predict breeding values

  • of continuous response variables, such as growth,

  • and generalized linear mixed models for binary traits, such as disease

  • incidence. The statistical model chosen

  • is the foundation of progeny test analysis and must be defined with utmost care.

  • We will now take a few moments to describe

  • each basic element in a linear mixed model.

  • y is the n by 1 row vector

  • of phenotypic observations

  • (think of a vector like a column of data for trees).

  • “n by 1” represents the dimensions of the row vector, where n is the

  • number of trees (that is, the number of rows) and 1

  • is the number of columns. X is the

  • design matrix that relates elements of the fixed effect

  • vector b to the row vector y

  • b is the p by 1 row vector of fixed effects

  • (for example, the intercept, sites, and

  • blocks within sites), where p is the number of rows

  • and 1 is the number of columns.

  • These are non-genetic factors that contribute to the phenotype observed.

  • Z is the design matrix that relates

  • elements of the a and e vectors to y

  • a is the q by 1 row vector of random, that is, genetic effects

  • for family and family by site interaction for instance).

  • q is the number of levels

  • of random effects or number of trees in the data.

  • e is the n by 1 row vector of random residuals

  • with n by 1 dimensions, where n is the number

  • of observations (trees).

  • Although these terms likely seem abstract to you now,

  • hopefully they will become clearer after our example on the next slide.

  • To achieve a thorough understanding of linear mixed models in the

  • context of predicting breeding values, you will likely need to

  • review this slide and the next slide several times.

  • In this slide,

  • we adapt the linear mixed model described in the previous slide

  • to develop a model for predicting breeding values.

  • Suppose that we have measured height of five trees grown in two different locations

  • (L1 and L2).

  • We can assume that the trees come from a large population, which is a reasonable assumption,

  • and therefore, we will treat the trees as random.

  • The linear mixed model is shown in a standard statistical format

  • (MODEL 1) as (y sub ij

  • = l sub i + t sub j +

  • e sub ij) where y sub ij is the

  • j-th tree height measured in the i-th location;

  • li is the i-th location effect, tj is the j-th tree effect (breeding value)

  • and eij is the error term associated with the j-th tree in location i.

  • The same model is shown in compact matrix notation as MODEL 2.

  • In reality we may have many more fixed and random terms and writing linear models in statistical format makes them much longer.

  • The other advantage of matrix format is that it is easier to talk about the assumptions of the model

  • and it is easier to describe different variance-

  • covariance structures of the model for the matrix format.

  • The full matrix format of the same model is given in MODEL 3.

  • This is like taking the x-ray pictures of matrices (X and Y)

  • and vectors (y, b, a, and e).

  • We will now describe each element of the mixed model

  • as it applies to the predicting breeding values.

  • y is a 5 by 1 row vector of height observations

  • where 5 is the number of trees.

  • This is the number of rows of the vector.

  • X is the design matrix that relates the

  • fixed location effects (b) to the height observations (y).

  • For example, the first column of X is for location 1

  • and the second column is for location 2.

  • Looking at the first column of X matrix

  • for location 1 shows that observations 87 and 90 are

  • coming from location 1 (because they have element 1 instead of 0).

  • b is the 2 by 1 row vector of fixed effects.

  • These are non-genetic factors that contribute to the phenotype observed.

  • In this example, we treat location as a fixed effect,

  • thus b has two rows, one for each location.

  • Those are the coefficients we solve for fixed effects.

  • Z is the design matrix that relates height observations to random tree effects and the

  • residual error (e).

  • a is the 5 by 1 row vector of random or genetic effects

  • (e.g., family and family by site interaction).

  • This is a column of random effects for which we want to estimate breeding values.

  • And e is the vector of residuals

  • associated with each measurement (and tree).

  • We assume that all trees have the same error variance

  • and that errors are independent and have a normal distribution.

  • In matrix notation, error variance

  • (the R matrix) can be expressed as

  • R=I sub n

  • with n by n dimensions

  • (where n is the number of observations)

  • and sigma square sub e is the error variance calculated from data.

  • The diagonals of the identity matrix are all 1

  • and off-diagonal elements are all 0.

  • We also assume that the random effects are associated

  • with a variance and that the variance is homogeneous (that is, the same for all).

  • In matrix notation,

  • variance of the random effects (the G matrix)

  • can be expressed as G

  • =I sub t

  • where I sub t is the identity matrix with t by t dimensions

  • (and t is the number of random effects)

  • and

  • The diagonals of the I sub t matrix are all 1 and off diagonal elements are all 0.

  • Then we can define the variance of observations as the sum of two variances

  • (the error variance and the random effects variance).

  • The variance of the y vector is equal to V,

  • which is equal to the Z matrix

  • (that is, the matrix that relates the phenotypic effects to the random effects and the residual error),

  • multiplied by the G matrix,

  • multiplied by the transposed Z matrix,

  • plus the error variance (the R matrix).

  • It is important to keep in mind that the above assumptions may not hold.

  • For example, if we have two locations,

  • they may have two different growth rates and thus the error variance might be different (heterogeneous).

  • In addition, errors associated with observations might be correlated.

  • Similarly, genetic variance might be different in different locations.

  • To solve the linear mixed model given in the previous slide,

  • the model is presented in matrix form.

  • Working out the solutions yourself will take in-depth knowledge of matrix algebra, but

  • try to understand the overall concept.

  • In this model, X is the matrix that relates fixed effects to y,

  • X prime is the transposed

  • X matrix, G is the genetic variance-covariance matrix,

  • G to the power of -1 is the inverse of G,

  • R is the error variance-covariance matrix,

  • and R to the power of -1

  • is the inverse of R. The vectors a

  • (random effects) and b (fixed effects)

  • are as described previously.

  • When the R matrix has a simple structure (that is, errors are normally

  • and independently distributed), R can be factored out

  • to provide the simplified model shown here.

  • The solutions for fixed effect factors

  • (e.g., location or site effects)

  • are called best linear unbiased estimates (BLUE).

  • The solutions for random effects are called best linear

  • unbiased predictions (BLUP).

  • In the mixed model equations, the G matrix of random effects is

  • replaced with the inverse of the additive genetic relationship matrix A

  • (that is, A to the power of -1.

  • This is a powerful tool in BLUPs

  • to predict breeding values using all the information from relatives,

  • and thus increase the accuracy.

  • The A matrix in the mixed model is multiplied by the ratio

  • of

  • (error variance) and

  • These variances are assumed unknown

  • and are calculated from data using the restricted maximum likelihood method

  • (REML).

  • The ratio of these two variances is sometimes called the shrinkage factor.

  • It is the inverse of heritability.

  • For low heritability traits,

  • the predicted values shrink towards the mean

  • more when compared to traits with high heritability.

  • In this slide we have presented an example of a simple additive genetic relationship

  • matrix based on a pedigree of five trees.

  • Trees that are not genetically related have 0 coefficients.

  • For example, trees 1 and 2,

  • or 4 and 5.

  • Since trees 3 and 5 are half-sibs

  • (they have female 1 as their common mother),

  • the coefficient of additive genetic relationships between 3 and 5 is 0.25.

  • The relationship between tree 4 and its parent, female tree 2,

  • is 0.5.

  • Similarly, the coefficient between trees 3 and 5

  • with their common mother tree is 0.5.

  • These coefficients are based on the probability of alleles being

  • identical by descent.

  • As noted previously,

  • the additive genetic relationship matrix is the backbone of parental

  • and individual tree models. Variances and means are seldom constant;

  • they change from one generation to another because we improve the population.

  • The A matrix,

  • as it is called, is typically constructed from known pedigrees.

  • The introduction of numerous, low cost markers may eventually

  • improve our ability to define the level ofkinship

  • between trees by looking at the proportions of shared alleles.

  • As we shall see in a few slides,

  • using markers to calculate the proportion of alleles that areidentical by descent

  • (IBD)

  • between parents and progeny is a powerful tool for improving breeding value estimates.

  • We now shift attention back to the general combining ability

  • model first introduced on slide 5.

  • You may recall that the GCA,

  • or parental, model may be used to predict parental breeding values.

  • When a GCA model is fit, the objective

  • is to obtain GCA values of parents which will be used for backward selection

  • (selection of the best parents).

  • Such selection might be useful, say, for thinning out seed orchards to the best parents.

  • Within family selection is not considered here.

  • Note that the equation shown on this slide,

  • is the same equation first presented on slide 10.

  • The b part of the vector in the mixed model

  • equation includes the predictions for fixed effects,

  • such as the intercept, sites, age, and so forth.

  • The predictions for fixed effects are best linear unbiased estimates

  • (BLUE).

  • The solutions for the a vector are the best linear unbiased predictions

  • (BLUP) for random effects. We sometimes call the large matrix

  • (marked with the red square) a C matrix.

  • The diagonal elements of the C matrix are the variances associated with the predictions.

  • Parents with a large number of progeny

  • would have small variances (that is, they would be estimated

  • quite accurately). The equation given in this slide

  • shows how the mixed model equations are solved for fixed effect

  • (intercept) and random effects (e.g. GCA or BV)

  • in the linear mixed model given in slide 3.

  • We shall now shift our attention to the individual tree model first introduced in slide 5.

  • Recall that when a breeding value

  • model for individual trees is fit,

  • the objective is to obtain the breeding value of grandparents, genetic groups,

  • parents, and offspring simultaneously.

  • The a vector now contains values for all genetic groups of interest.

  • The BLUP solutions of mixed model equations for the parents,

  • offspring, or grandparents are breeding values.

  • There is no need for extra calculation to obtain breeding values.

  • Selection is subsequently completed,

  • typically, by taking into consideration the performance of both the family

  • and the individual offspring within family,

  • with carefully defined weightings for each.

  • This is called an index selection strategy,

  • and it is used to create advanced breeding populations.

  • This is defined as forward selection.

  • In order to obtain genetic values,

  • we need to have full-sib family structure or cloned progeny.

  • Since GCA and breeding values are predictions they are associated with an error.

  • Breeders are interested in knowing the degree of reliability of these predictions when making decisions.

  • There is more than one statistic used to measure the accuracy or reliability of breeding values.

  • The most common statistics to measure their reliability is the accuracy value.

  • The correlation between true and predicted breeding values is called the accuracy of breeding values (BV.)

  • We can use the standard error of a predication to calculate accuracy.

  • In the formula shown on the slide,

  • SE is the standard error of the prediction,

  • F is the inbreeding coefficient,

  • and sigma squared sub A is the additive genetic variance.

  • The additive genetic variance is the true breeding value (BV.)

  • For non-related trees, F is zero.

  • During this portion of the module, we focus on marker aided selection (MAS.)

  • In previous modules we introduced the concepts of Linkage Equilibrium (LE) MAS,

  • Linkage Disequilibrium (LD) MAS,

  • and Gene MAS to describe the distinction between the proximity,

  • or relationship between a marker

  • and the causal mutation affecting a trait phenotype.

  • Here we review the concept and illustrate the expectations at the population level.

  • The optimal condition for marker based selection would be Gene MAS,

  • where the marker is actually the causal mutation.

  • Unfortunately, finding these are rare events

  • and targeting them would be time-consuming and costly.

  • In the population, there is only one option under Gene MAS -

  • the allele (genotype) that you see reflects the phenotype you will get all the time.

  • With association genetic approaches to selection

  • we are realistically targeting markers that are in tight linkage disequilibrium with the causal mutation.

  • In the diagram shown here you can see that the marker,

  • depicted by the uppercase M and lowercase m alleles,

  • is always in the same allelic configuration

  • with the causal mutation depicted by the uppercase Q and lower case q alleles.

  • Depending on the extent of LD,

  • the presence of the marker allele will, more or less, always reflect the actual genotype.

  • As can be seen above,

  • with Linkage Equilibrium (LE) MAS,

  • the correlation between marker and causal mutation alleles is poor at the population level.

  • While the desired

  • While the desired correlation may be excellent within a family,

  • it is likely to disintegrate across families and have no utility for selection at the population level.

  • Finally, we introduce the concept of including markers in the selection process.

  • The linear statistical model to describe the variance in a

  • trait of interest (say, y)

  • is depicted in its simplest form where a phenotype is described by the population mean,

  • a marker effect (M), and the residual error.

  • For the remainder of this module

  • let us assume that we are dealing with markers that are in strong LD with causal mutations.

  • An LD marker will have a measurable effect on the trait,

  • though it may be very small. In the above DNA sequence,

  • all the nucleotides are the same, except at the base position highlighted with red letters

  • (A/C) and a rectangle.

  • By calculating the mean performance of individuals grouped by their SNP genotype at this locus,

  • we can estimate the effects of allelic substitution.

  • In this case, there appears to be a completely additive genetic effect of substituting

  • substituting the nucleotide base C with the base A.

  • That is, for each A allele there is a plus

  • five increase in the trait being measured.

  • The relevance of the difference depends on the scale of measurement,

  • the proportionate increase, and the value of each point of increase.

  • Several possibilities exist for how the marker genotype

  • may be fit in the linear model to predict breeding values.

  • We assume that markers in question are in LD with the trait.

  • In the first approach, markers can be treated as covariates

  • (fixed effects) in the linear mixed model.

  • In the second approach, markers are considered as random effects in the linear mixed model.

  • In the third approach, markers are not fit as factors

  • in the model, but they are used to generate genomic relationships between individuals.

  • The matrix of relationships based on markers

  • is called the G matrix.

  • The A matrix based on pedigree that we discussed previously is

  • replaced by the G matrix in linear mixed models to obtain BVs.

  • The third approach is fundamentally different from the first two approaches.

  • We shall explain all three methods in the following slides.

  • When marker effects are considered fixed,

  • the marker can be fitted as a covariate.

  • Alternatively, the marker alleles may be fit as a class variable,

  • allowing for different means for each genotype.

  • This latter approach also allows one to model for dominance effects and

  • can be extended to fitting haplotypes

  • (that is, multi-locus SNP genotypes).

  • We draw your attention to the

  • marker effect m.

  • The m vector for a given marker has the elements of -1

  • 0 or 1,

  • which designates the number of a specific allele, say

  • allele A”, carried by an individual.

  • If the individual is homozygous for the “A” allele,

  • that is, it has two copies of the A allele,

  • then it would receive a value ofone”.

  • A heterozygous individual would have a single copy of the “A” allele,

  • and would be assigned a value ofzero”.

  • A homozygote for an alternative “C” allele

  • would have no allele “A”, and would be scored as a “minus one”.

  • Now let’s look at the allelic substitution effect

  • more closely in the next slide.

  • The average effect of allelic substitution represents the average change in

  • phenotypic value when the A allele is

  • randomly substituted for the C allele. For example,

  • substituting A with C may increase the phenotype

  • mean by 5 units.

  • The additive effect can be estimated as the difference between the two homozygous means

  • divided by two.

  • If the marker-QTL effects are treated as fixed effects,

  • there is a strong tendency to overestimate them.

  • Fitting markers as fixed effects tends to lead to larger estimation errors.

  • For example,

  • SNPs with small minor allele frequency are

  • susceptible to this estimation error.

  • For this reason, and others, we may want to fit markers as random effects.

  • In this slide, the marker effect

  • (m) is fitted as a random effect in a mixed model.

  • We can shrink the estimate of marker effects

  • according to the amount of data used by treating markers

  • as a random effect. The less data there is to estimate m,

  • the more the estimate will shrink towards the mean.

  • In other words, fitting markers as random effects regresses,

  • or shrinks, estimates back to zero to account for the lack of information.

  • If the choice of variance explained by markers (

  • then the resulting estimates are BLUPs.

  • Differences between random and fixed effects are small if the amount of data is large

  • (small error variance

  • Furthermore, treating markers as random allows calculation of percent of phenotypic variation explained by markers.

  • In general, trees are genotyped using a large number of markers.

  • Using more markers provides more information to predict genetic merit of individuals.

  • We will have more confidence to rank them and make selections.

  • This is of particular importance because most of the traits we are interested in,

  • such as growth, are controlled by many genes.

  • Even if there is a large quantitative trait locus (QTL)

  • that is explaining a large proportion of phenotypic variance,

  • we still would like to use more markers to genotype trees for that particular QTL.

  • In this way, more than one marker is associated with a large QTL.

  • They all can be significant in association testing

  • (explaining the same variance).

  • The question is how to account for multiple markers in the prediction of genetic merit?

  • The simplest way is the multiple regression approach,

  • fitting all the markers simultaneously as given in this slide.

  • In the example, tree1 has marker 2 but not

  • marker 1.

  • p is the number of significant markers from the association study

  • and g is the vector of markers.

  • Linear regression for multiple markers can be written in matrix form as we saw in slide 20.

  • For the sake of simplicity,

  • most of the models introduced in this module have been simplistic

  • in that they model the effects of only one or a few markers at a time.

  • In forest trees, application of a single marker in prediction of breeding values is not likely to occur.

  • Instead, with ever decreasing genotyping costs and

  • marker availability, breeding programs

  • are moving to utilize large numbers of markers

  • for genome-wide selection. Unfortunately,

  • as the number of markers increases, the number of missing data points also tends to increase.

  • It rapidly becomes economically and

  • logistically difficult to fill the missing data points

  • with new genotyping runs.

  • Removing records (trees in the test, for instance, or specific SNPs)

  • that do not have complete data sets will

  • likely result in a dramatic reduction in power of association of markers and traits -

  • so much so that the analyses are virtually meaningless.

  • yet, predictions of the genetic merit of trees across markers requires complete genotyping information or gene content.

  • It is therefore important to use efficient statistical methods to accurately infer missing genotypes.

  • The termimputationrefers to

  • using a reference panel of haplotypes to replace

  • missing gentoypes of a subsample of individuals.

  • We will discuss the concept and process of imputing missing data points

  • in the next few slides.

  • Human geneticists rely on genetic maps

  • and linkage disequilibrium information from nearby markers to replace missing genotypes.

  • Many software programs exist to impute missing data,

  • which rely on haplotype reference datasets,

  • such as HapMap, for population haplotypes or reference panels.

  • Most programs employ maximum likelihood or Bayesian methods to

  • predict missing genotypes.

  • They can be computationally intensive

  • and convergence can be a problem for large data sets.

  • Reference data sets for forest trees generally do not exist

  • since we do not yet have a complete sequenced genome.

  • Plus we do not know the order and location of markers on the genome of trees.

  • Consequently, methods developed by human geneticists do not work well for trees.

  • Thus, alternative approaches have been developed by animal breeders,

  • which can be employed in tree breeding. One of them is to use

  • pedigrees, that is, expected additive genetic relationships between trees.

  • We explain this concept in the next slide.

  • The minor allele is the allele with a frequency less than 0.5 in the population.

  • Geneticists believe that it’s the minor allele that have a greater effect on the phenotype than major alleles.

  • We calculate minor allele frequency for each locus across all the trees genotyped,

  • and convert genotypes to the number of minor alleles each individual has.

  • The number of minor alleles is sometimes calledgene content’.

  • Each tree would have 0, 1, or 2

  • depending on the number of minor alleles it has.

  • These frequencies are sometimes converted to

  • -1, 0, and 1, for ease of the matrix calculations.

  • Minor allele frequencies are used to impute missing genotypes, as we shall see in the next few slides.

  • Minor allele frequency can also be used to estimate allele additive and dominance substitution effects on a phenotype,

  • genetic merit of trees across all markers,

  • or genetic merit based on additive, dominance, and total genetic marker effects.

  • We will follow the method of Gengler and colleagues from 2007 for imputation of missing genotypes.

  • This method follows the logic of genetic covariance among relatives.

  • In other words, the covariance between gene content is proportional to the additive relationship between animals, or in our case, trees.

  • Genetic covariance arises because two related individuals have alleles that are identical by descent.

  • In other words, both copies of an allele can be traced back to a single copy in a recent common ancestor

  • because the mother receives half of its genes from its parents.

  • If the maternal genotype is not known, but the maternal grand-sire has been genotyped,

  • the expected value of lowercase q for the mother

  • is the population mean plus one half of d subscript mgs.

  • In the previous slide, we described how relatedness

  • among trees can be used to come up with gene content number of trees with missing genotypes.

  • The calculations in the previous slide are done by solving mixed model equations as shown in the above slide

  • to predict missing genotypes.

  • These predicted imputed genotypes are BLUPs.

  • y is a row vector of allele content of trees like this:

  • y = [0 1 0 . 1 2] for seven trees.

  • For the fourth individual, the genotype is missing.

  • But it is going to be estimated using information coming from parents and other relatives, for example, sibs.

  • M is the design matrix connecting trees to the allele content number vector y

  • e is the error variance; M’ (M prime) is the transpose of the M matrix

  • 1’1 (one prime one) is the vector of 1s to calculate the mean for each allele content

  • d sub y is the vector of allele content deviations for trees with genotype records

  • d sub x is the vector of gene content deviations for unobserved trees

  • A is the additive relationship matrix derived from pedigrees.

  • The A matrix is composed of four different sub-matrices as follows:

  • A sub yy is the sub-matrix of covariances among trees with known genotypes

  • A sub xx is the sub-matrix of covariances among trees with unknown genotypes.

  • A sub xy and A sub yx are the covariances among trees with known and unknown genotypes.

  • Epsilon (ε) is the shrinkage factor, which is the ratio of error

  • sigma-squared sub e and genetic variance or sigma-squared sub a explained by the markers.

  • When we solve the mixed model equations (as shown previously),

  • we obtain allele content. The predictions

  • shown in the slide are predicted allele content number for each tree.

  • Normally the allele content number would be

  • 0 (homozygous, no minor allele),

  • 1 (heterozygous, or one copy of minor allele),

  • 2 (homozygous, or two copies of minor allele).

  • However, solutions from the mixed models are continuous

  • because of the nature of predictions from solving mixed models.

  • For example, the predicted gene content number for treeID 2

  • is 0.3504.

  • Having continuous gene content number is not a problem for most software to predict breeding values of trees.

  • This method is efficient when the population

  • has relatedness among the members because

  • information from relatives is

  • used to predict marker genotype for trees.

  • When we plot the predicted gene content

  • number we see the frequency of trees having gene content number

  • close to 0, 1, or 2 as shown in this slide.

  • After imputing missing genotypes,

  • we would have a complete table of trees and marker genotypes. Rows would be treeID

  • and columns would be genotype or gene content number.

  • Now we are ready to use the new predicted genotypes in

  • predictions of breeding values in mixed models as described in earlier slides.

  • We are now switching to a fundamentally different way of using markers to predict breeding values.

  • The method is based on using markers to construct genetic relationships between individuals and using

  • the genomic relationship matrix in a BLUP approach to predict breeding values.

  • When we use pedigree relationships to make genetic evaluations, we assume that full-sibs on average share 50% of their DNA.

  • In reality,

  • they may share 45 or 55%

  • or something considerably different from either side

  • because each sib inherits a different mixture of alleles from the parents.

  • This is because of the Mendelian segregation effect.

  • With the large number of markers now available,

  • we can estimate the number of alleles shared by indivduals instead of assuming constants

  • (for example, a constant of 0.5 for full-sibs

  • and or 0.25 for half-sibs).

  • Determining genomic relationships from markers is based on the proportion of chromosome segments shared by individuals.

  • This includes identification of genes that are identical by state.

  • In the next few slides, we will introduce the Genomic-BLUP and cover

  • the concept of genetic similarity.

  • Before tackling genomic relationship

  • derived from marker genotypes, it is useful to first introduce the concept of genetic similarity between individuals using markers.

  • In the above picture, we expect that the two mice on the right have more alleles

  • that are identical (either by descent or state)

  • than the mice on the left. Due to the sharing of more, or

  • fewer, alleles that are identical,

  • individuals are phenotypically similar or dissimilar.

  • There are many methods proposed to obtain

  • genetic similarities between individuals using markers.

  • They can be categorized as follows:

  • the frequency method, the regression method,

  • and the normalized method.

  • For ease of illustration we give an example of the frequency method.

  • In this slide, the M matrix is

  • the gene content number of trees. Trees are in the rows and

  • gene content for each marker is in the columns.

  • tree1 has 2 homozygous loci

  • (locus 1 and 3),

  • tree2 has no homozygous loci, and

  • tree3 has 3 homozygous loci.

  • When we multiply the M matrix with its transpose,

  • we obtain the MM prime matrix.

  • The MM prime matrix is important because we obtain the number

  • (frequency) of shared alleles between individuals from this matrix.

  • These frequencies are then used to calculate the genomic relationship matrix, G.

  • The G matrix is then used to predict breeding values

  • of trees with no phenotype.

  • The elements of MM prime are:

  • 1. Tree1 and Tree2 share no alleles

  • because the diagonal element between the two trees is 0

  • 2. Tree1 and Tree3 share 2 alleles

  • 3. Tree2 and Tree3 share 0 alleles in common.

  • The frequency of alleles shared by relatives in the MM prime matrix are

  • weighted by the allele frequencies in the

  • population to obtain final coefficients

  • of genomic relationships as we shall demonstrate in the next slide.

  • To estimate genomic relationships between trees based on markers

  • (G matrix), we need to have allele frequencies

  • in addition to the number of alleles shared by trees

  • (MM prime matrix), as shown in the previous slide.

  • The P matrix written in the first step

  • is the matrix of allele frequencies.

  • The elements of the P matrix in the slide

  • are allele frequencies for three loci.

  • Allele frequencies are expressed as 2

  • (p sub i - 0.5)

  • where p sub i is the minor allele frequency.

  • The second step to obtain the G matrix is to

  • subtract the P matrix from the M matrix to obtain the Z matrix.

  • Subtraction of P from M has certain advantages:

  • First, it sets mean values of the allele effects to 0

  • Secondly,

  • subtraction of P gives more credit to rare alleles than to

  • common alleles when calculating genomic relationships.

  • Thirdly, genomic inbreeding coefficient is greater

  • if the individual is homozygous for rare alleles

  • than if homozygous for common alleles

  • In the third step, the ZZ

  • prime matrix is divided by 2

  • times the sum over all loci of p sub i (1 - p sub i)

  • to scale G to be analogous to the numerator relationship matrix (A).

  • p sub i are the observed minor allele frequencies of all genotyped subjects regardless of inbreeding and selection.

  • Genetic relationships

  • at individual loci are important to quantify covariances between

  • individuals for effects of a QTL at that locus

  • in order to incorporate marker data in genetic evaluation

  • by marker-assisted BLUP.

  • With these new tools we can incorporate non-additive genetic effects

  • in BLUPs to estimate BVs.

The goal of this module is to cover the fundamentals of

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B2 中高級

模塊14 - 使用標記來預測育種值 (Module 14 - Using Markers to Predict Breeding Values)

  • 107 4
    Morris Du 發佈於 2021 年 01 月 14 日
影片單字