Title: | Evolutionary Algorithm |
---|---|
Description: | Runs a genetic algorithm using the 'AlphaSimR' machinery <doi:10.1093/g3journal/jkaa017> and the coalescent simulator 'MaCS' <doi:10.1101/gr.083634.108>. |
Authors: | Giovanny Covarrubias-Pazaran [aut, cre] |
Maintainer: | Giovanny Covarrubias-Pazaran <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.5 |
Built: | 2025-01-24 02:51:56 UTC |
Source: | https://github.com/covaruber/evola |
The evola package is nice wrapper of the AlphaSimR package that enables the use of the evolutionary algorithm to solve complex questions in a simple form.
The evolafit
function is the core function of the package which allows the user to specify the problem and constraints to find a close-to-optimal solution using the evolutionary forces.
The evola package is updated on CRAN every 4-months due to CRAN policies but you can find the latest source at https://github.com/covaruber/evola. This can be easily installed typing the following in the R console:
library(devtools)
install_github("covaruber/evola")
This is recommended if you reported a bug, was fixed and was immediately pushed to GitHub but not in CRAN until the next update.
For tutorials on how to perform different analysis with evola please look at the vignettes by typing in the terminal:
vignette("evola.intro")
The package has been equiped with a couple of datasets to learn how to use the evola package:
* DT_technow
dataset to perform optimal cross selection.
* DT_wheat
dataset to perform optimal training population selection.
* DT_cpdata
dataset to perform optimal individual.
The machinery behind the scenes is AlphaSimR.
If you have any questions or suggestions please post it in https://stackoverflow.com or https://stats.stackexchange.com
I'll be glad to help or answer any question. I have spent a valuable amount of time developing this package. Please cite this package in your publication. Type 'citation("evola")' to know how to cite it.
Giovanny Covarrubias-Pazaran
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
Gaynor, R. Chris, Gregor Gorjanc, and John M. Hickey. 2021. AlphaSimR: an R package for breeding program simulations. G3 Gene|Genomes|Genetics 11(2):jkaa017. https://doi.org/10.1093/g3journal/jkaa017.
Chen GK, Marjoram P, Wall JD (2009). Fast and Flexible Simulation of DNA Sequence Data. Genome Research, 19, 136-142. http://genome.cshlp.org/content/19/1/136.
Calculates the realized additive relationship matrix.
A.mat(X,min.MAF=NULL)
A.mat(X,min.MAF=NULL)
X |
Matrix ( |
min.MAF |
Minimum minor allele frequency. The A matrix is not sensitive to rare alleles, so by default only monomorphic markers are removed. |
For vanraden method: the marker matrix is centered by subtracting column means where ms is the coumn means. Then
, where
, the mean value of the diagonal values of the
portion.
If return.imputed = FALSE, the additive relationship matrix is returned.
If return.imputed = TRUE, the function returns a list containing
the A matrix
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
## random population of 200 lines with 1000 markers X <- matrix(rep(0,200*1000),200,1000) for (i in 1:200) { X[i,] <- ifelse(runif(1000)<0.5,-1,1) } A <- A.mat(X) ## take a look at the Genomic relationship matrix colfunc <- colorRampPalette(c("steelblue4","springgreen","yellow")) hv <- heatmap(A[1:15,1:15], col = colfunc(100),Colv = "Rowv") str(hv)
## random population of 200 lines with 1000 markers X <- matrix(rep(0,200*1000),200,1000) for (i in 1:200) { X[i,] <- ifelse(runif(1000)<0.5,-1,1) } A <- A.mat(X) ## take a look at the Genomic relationship matrix colfunc <- colorRampPalette(c("steelblue4","springgreen","yellow")) hv <- heatmap(A[1:15,1:15], col = colfunc(100),Colv = "Rowv") str(hv)
Extracts the index of the best solution for all traits under the constraints specified.
bestSol(object, selectTop=TRUE)
bestSol(object, selectTop=TRUE)
object |
A resulting object from the function evolafit. |
selectTop |
Selects highest values for the fitness value if TRUE. Selects lowest values if FALSE. |
A simple apply function looking at the fitness value of all the solution in the last generation to find the maximum value.
the vector of best solutions in M for each trait in the problem
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. res0<-evolafit(cbind(Weight,Value)~Color, dt= Gems, # constraints: if greater than this ignore constraintsUB = c(10,Inf), # constraints: if smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(0,1), # population parameters nCrosses = 100, nProgeny = 20, recombGens = 1, # coancestry parameters D=NULL, lambda=c(0,0), nQTLperInd = 1, # selection parameters propSelBetween = .9, propSelWithin =0.9, nGenerations = 50 ) bestSol(res0$pop)
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. res0<-evolafit(cbind(Weight,Value)~Color, dt= Gems, # constraints: if greater than this ignore constraintsUB = c(10,Inf), # constraints: if smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(0,1), # population parameters nCrosses = 100, nProgeny = 20, recombGens = 1, # coancestry parameters D=NULL, lambda=c(0,0), nQTLperInd = 1, # selection parameters propSelBetween = .9, propSelWithin =0.9, nGenerations = 50 ) bestSol(res0$pop)
A CP population or F1 cross is the designation for a cross between 2 highly heterozygote individuals; i.e. humans, fruit crops, bredding populations in recurrent selection.
This dataset contains phenotpic data for 363 siblings for an F1 cross. These are averages over 2 environments evaluated for 4 traits; color, yield, fruit average weight, and firmness. The columns in the CPgeno file are the markers whereas the rows are the individuals. The CPpheno data frame contains the measurements for the 363 siblings, and as mentioned before are averages over 2 environments.
data("DT_cpdata")
data("DT_cpdata")
The format is: chr "DT_cpdata"
This data was simulated for fruit breeding applications.
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
Gaynor, R. Chris, Gregor Gorjanc, and John M. Hickey. 2021. AlphaSimR: an R package for breeding program simulations. G3 Gene|Genomes|Genetics 11(2):jkaa017. https://doi.org/10.1093/g3journal/jkaa017.
Chen GK, Marjoram P, Wall JD (2009). Fast and Flexible Simulation of DNA Sequence Data. Genome Research, 19, 136-142. http://genome.cshlp.org/content/19/1/136.
data(DT_cpdata) DT <- DT_cpdata # get best 20 individuals weighting variance by ~0.5=(30*pi)/180 res<-evolafit(formula=cbind(Yield, occ)~id, dt= DT, # constraints: if sum is greater than this ignore constraintsUB = c(Inf,20), # constraints: if sum is smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(1,0), # population parameters nCrosses = 100, nProgeny = 10, recombGens=1, nChr=1, mutRate=0, # coancestry parameters D=A, lambda= (30*pi)/180 , nQTLperInd = 2, # selection parameters propSelBetween = 0.5, propSelWithin =0.5, nGenerations = 40, keepBest=FALSE) Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2 best = bestSol(res$pop)["Yield"];best qa = (Q %*% DT$Yield)[best,]; qa qDq = Q[best,] %*% A %*% Q[best,]; qDq sum(Q[best,]) # total # of inds selected pmonitor(res) plot(DT$Yield, col=as.factor(Q[best,]), pch=(Q[best,]*19)+1) pareto(res)
data(DT_cpdata) DT <- DT_cpdata # get best 20 individuals weighting variance by ~0.5=(30*pi)/180 res<-evolafit(formula=cbind(Yield, occ)~id, dt= DT, # constraints: if sum is greater than this ignore constraintsUB = c(Inf,20), # constraints: if sum is smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(1,0), # population parameters nCrosses = 100, nProgeny = 10, recombGens=1, nChr=1, mutRate=0, # coancestry parameters D=A, lambda= (30*pi)/180 , nQTLperInd = 2, # selection parameters propSelBetween = 0.5, propSelWithin =0.5, nGenerations = 40, keepBest=FALSE) Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2 best = bestSol(res$pop)["Yield"];best qa = (Q %*% DT$Yield)[best,]; qa qDq = Q[best,] %*% A %*% Q[best,]; qDq sum(Q[best,]) # total # of inds selected pmonitor(res) plot(DT$Yield, col=as.factor(Q[best,]), pch=(Q[best,]*19)+1) pareto(res)
This dataset contains phenotpic data for 2 traits measured in 1254 single cross hybrids coming from the cross of Flint x Dent heterotic groups. In addition contains the genotipic data (35,478 markers) for each of the 123 Dent lines and 86 Flint lines. The purpose of this data is to demosntrate the prediction of unrealized crosses (9324 unrealized crosses, 1254 evaluated, total 10578 single crosses). We have added the additive relationship matrix (A) but can be easily obtained using the A.mat function on the marker data. Please if using this data for your own research cite Technow et al. (2014) publication (see References).
data("DT_technow")
data("DT_technow")
The format is: chr "DT_technow"
This data was extracted from Technow et al. (2014).
If using this data for your own research please cite:
Technow et al. 2014. Genome properties and prospects of genomic predictions of hybrid performance in a Breeding program of maize. Genetics 197:1343-1355.
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
Gaynor, R. Chris, Gregor Gorjanc, and John M. Hickey. 2021. AlphaSimR: an R package for breeding program simulations. G3 Gene|Genomes|Genetics 11(2):jkaa017. https://doi.org/10.1093/g3journal/jkaa017.
Chen GK, Marjoram P, Wall JD (2009). Fast and Flexible Simulation of DNA Sequence Data. Genome Research, 19, 136-142. http://genome.cshlp.org/content/19/1/136.
data(DT_technow) DT <- DT_technow DT$occ <- 1; DT$occ[1]=0 M <- M_technow A <- A.mat(M) # run the genetic algorithm # we assig a weight to x'Dx of (20*pi)/180=0.34 res<-evolafit(formula = c(GY, occ)~hy, dt= DT, # constraints: if sum is greater than this ignore constraintsUB = c(Inf,100), # constraints: if sum is smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(1,0), # population parameters nCrosses = 100, nProgeny = 10, recombGens=1, nChr=1, mutRate=0, # coancestry parameters D=A, lambda= (20*pi)/180 , nQTLperInd = 70, # selection parameters propSelBetween = 0.5, propSelWithin =0.5, nGenerations = 20, keepBest=FALSE) Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2 best = bestSol(res$pop)["GY"] qa = (Q %*% DT$GY)[best,]; qa qAq = Q[best,] %*% A %*% Q[best,]; qAq sum(Q[best,]) # total # of inds selected pmonitor(res) plot(DT$GY, col=as.factor(Q[best,]), pch=(Q[best,]*19)+1) pareto(res)
data(DT_technow) DT <- DT_technow DT$occ <- 1; DT$occ[1]=0 M <- M_technow A <- A.mat(M) # run the genetic algorithm # we assig a weight to x'Dx of (20*pi)/180=0.34 res<-evolafit(formula = c(GY, occ)~hy, dt= DT, # constraints: if sum is greater than this ignore constraintsUB = c(Inf,100), # constraints: if sum is smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(1,0), # population parameters nCrosses = 100, nProgeny = 10, recombGens=1, nChr=1, mutRate=0, # coancestry parameters D=A, lambda= (20*pi)/180 , nQTLperInd = 70, # selection parameters propSelBetween = 0.5, propSelWithin =0.5, nGenerations = 20, keepBest=FALSE) Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2 best = bestSol(res$pop)["GY"] qa = (Q %*% DT$GY)[best,]; qa qAq = Q[best,] %*% A %*% Q[best,]; qAq sum(Q[best,]) # total # of inds selected pmonitor(res) plot(DT$GY, col=as.factor(Q[best,]), pch=(Q[best,]*19)+1) pareto(res)
Information from a collection of 599 historical CIMMYT wheat lines. The wheat data set is from CIMMYT's Global Wheat Program. Historically, this program has conducted numerous international trials across a wide variety of wheat-producing environments. The environments represented in these trials were grouped into four basic target sets of environments comprising four main agroclimatic regions previously defined and widely used by CIMMYT's Global Wheat Breeding Program. The phenotypic trait considered here was the average grain yield (GY) of the 599 wheat lines evaluated in each of these four mega-environments.
A pedigree tracing back many generations was available, and the Browse application of the International Crop Information System (ICIS), as described in (McLaren et al. 2000, 2005) was used for deriving the relationship matrix A among the 599 lines; it accounts for selection and inbreeding.
Wheat lines were recently genotyped using 1447 Diversity Array Technology (DArT) generated by
Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). The DArT markers
may take on two values, denoted by their presence or absence. Markers with a minor allele frequency
lower than 0.05 were removed, and missing genotypes were imputed with samples from the marginal
distribution of marker genotypes, that is, , where
is the estimated allele frequency computed from the non-missing genotypes. The number of DArT
MMs after edition was 1279.
data(DT_wheat)
data(DT_wheat)
Matrix Y contains the average grain yield, column 1: Grain yield for environment 1 and so on.
International Maize and Wheat Improvement Center (CIMMYT), Mexico.
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
Gaynor, R. Chris, Gregor Gorjanc, and John M. Hickey. 2021. AlphaSimR: an R package for breeding program simulations. G3 Gene|Genomes|Genetics 11(2):jkaa017. https://doi.org/10.1093/g3journal/jkaa017.
Chen GK, Marjoram P, Wall JD (2009). Fast and Flexible Simulation of DNA Sequence Data. Genome Research, 19, 136-142. http://genome.cshlp.org/content/19/1/136.
McLaren, C. G., L. Ramos, C. Lopez, and W. Eusebio. 2000. “Applications of the geneaology manegment system.” In International Crop Information System. Technical Development Manual, version VI, edited by McLaren, C. G., J.W. White and P.N. Fox. pp. 5.8-5.13. CIMMyT, Mexico: CIMMyT and IRRI.
McLaren, C. G., R. Bruskiewich, A.M. Portugal, and A.B. Cosico. 2005. The International Rice Information System. A platform for meta-analysis of rice crop data. Plant Physiology 139: 637-642.
# example to optimize a training pop for a validation pop data(DT_wheat) DT <- as.data.frame(DT_wheat) DT$id <- rownames(DT) # IDs DT$occ <- 1; DT$occ[1]=0 # to track occurrences DT$dummy <- 1; DT$dummy[1]=0 # dummy trait # if genomic GT <- GT_wheat + 1; rownames(GT) <- rownames(DT) G <- GT%*%t(GT) G <- G/mean(diag(G)) # if pedigree A <- A_wheat A[1:4,1:4] ##Perform eigenvalue decomposition for clustering ##And select cluster 5 as target set to predict pcNum=25 svdWheat <- svd(A, nu = pcNum, nv = pcNum) PCWheat <- A %*% svdWheat$v rownames(PCWheat) <- rownames(A) DistWheat <- dist(PCWheat) TreeWheat <- cutree(hclust(DistWheat), k = 5 ) plot(PCWheat[,1], PCWheat[,2], col = TreeWheat, pch = as.character(TreeWheat), xlab = "pc1", ylab = "pc2") vp <- rownames(PCWheat)[TreeWheat == 3]; length(vp) tp <- setdiff(rownames(PCWheat),vp) As <- A[tp,tp] DT2 <- DT[rownames(As),] DT2$cov <- apply(A[tp,vp],1,mean) head(DT2) # we assign a weight to x'Dx of (60*pi)/180=1 res<-evolafit(formula=cbind(cov, occ)~id, dt= DT2, # constraints: if sum is greater than this ignore constraintsUB = c(Inf, 100), # constraints: if sum is smaller than this ignore constraintsLB= c(-Inf, -Inf), # weight the traits for the selection b = c(1,0), # population parameters nCrosses = 100, nProgeny = 10, recombGens=1, nChr=1, mutRate=0, # coancestry parameters D=As, lambda= (60*pi)/180 , nQTLperInd = 80, # selection parameters propSelBetween = 0.5, propSelWithin =0.5, nGenerations = 30, verbose = TRUE, keepBest=FALSE) Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2 best <- bestSol(res$pop)["cov"] sum(Q[best,]) # total # of inds selected pareto(res)
# example to optimize a training pop for a validation pop data(DT_wheat) DT <- as.data.frame(DT_wheat) DT$id <- rownames(DT) # IDs DT$occ <- 1; DT$occ[1]=0 # to track occurrences DT$dummy <- 1; DT$dummy[1]=0 # dummy trait # if genomic GT <- GT_wheat + 1; rownames(GT) <- rownames(DT) G <- GT%*%t(GT) G <- G/mean(diag(G)) # if pedigree A <- A_wheat A[1:4,1:4] ##Perform eigenvalue decomposition for clustering ##And select cluster 5 as target set to predict pcNum=25 svdWheat <- svd(A, nu = pcNum, nv = pcNum) PCWheat <- A %*% svdWheat$v rownames(PCWheat) <- rownames(A) DistWheat <- dist(PCWheat) TreeWheat <- cutree(hclust(DistWheat), k = 5 ) plot(PCWheat[,1], PCWheat[,2], col = TreeWheat, pch = as.character(TreeWheat), xlab = "pc1", ylab = "pc2") vp <- rownames(PCWheat)[TreeWheat == 3]; length(vp) tp <- setdiff(rownames(PCWheat),vp) As <- A[tp,tp] DT2 <- DT[rownames(As),] DT2$cov <- apply(A[tp,vp],1,mean) head(DT2) # we assign a weight to x'Dx of (60*pi)/180=1 res<-evolafit(formula=cbind(cov, occ)~id, dt= DT2, # constraints: if sum is greater than this ignore constraintsUB = c(Inf, 100), # constraints: if sum is smaller than this ignore constraintsLB= c(-Inf, -Inf), # weight the traits for the selection b = c(1,0), # population parameters nCrosses = 100, nProgeny = 10, recombGens=1, nChr=1, mutRate=0, # coancestry parameters D=As, lambda= (60*pi)/180 , nQTLperInd = 80, # selection parameters propSelBetween = 0.5, propSelWithin =0.5, nGenerations = 30, verbose = TRUE, keepBest=FALSE) Q <- pullQtlGeno(res$pop, simParam = res$simParam, trait=1); Q <- Q/2 best <- bestSol(res$pop)["cov"] sum(Q[best,]) # total # of inds selected pareto(res)
Using the AlphaSimR machinery it recreates the evolutionary forces applied to a problem where possible solutions replace individuals and combinations of variables to optimize in the problem replace the genes or QTLs. Then evolutionary forces (mutation, selection and drift) are applied to find a close-to-optimal solution. Although multiple traits are enabled it is assumed that same QTLs are behind all the traits, differing only in their average allelic effects.
evolafit(formula, dt, constraintsUB, constraintsLB, b, nCrosses=50, nProgeny=20,nGenerations=20, recombGens=1, nChr=1, mutRate=0, nQTLperInd=NULL, D=NULL, lambda=NULL, propSelBetween=1,propSelWithin=0.5, fitnessf=NULL, verbose=TRUE, dateWarning=TRUE, selectTop=TRUE, tolVarG=1e-6, keepBest=FALSE, fixQTLperInd=FALSE, initPop=NULL, simParam=NULL, ...)
evolafit(formula, dt, constraintsUB, constraintsLB, b, nCrosses=50, nProgeny=20,nGenerations=20, recombGens=1, nChr=1, mutRate=0, nQTLperInd=NULL, D=NULL, lambda=NULL, propSelBetween=1,propSelWithin=0.5, fitnessf=NULL, verbose=TRUE, dateWarning=TRUE, selectTop=TRUE, tolVarG=1e-6, keepBest=FALSE, fixQTLperInd=FALSE, initPop=NULL, simParam=NULL, ...)
formula |
Formula of the form y~x where |
dt |
A dataset containing the average allelic effects (a) and classifiers/genes/QTLs. |
constraintsUB |
A numeric vector specifying the upper bound constraints for the breeding values applied at each trait. The length is equal to the number of traits/features in the formula. If missing is assume an infinite value for all traits. Solutions (individuals in the population) with higher value than the upper bound are assigned a -infinite value if the argument |
constraintsLB |
A numeric vector specifying the lower bound constraints for the breeding values applied at each trait. The length is equal to the number of traits/features in the formula. If missing is assume a -infinite value for all traits. Solutions with lower value than the lower bound are assigned a +infinite value if the argument |
b |
A numeric vector specifying the weights that each trait has in the fitness function (i.e., a selection index). The length should be equal to the number of traits/features. If missing is assumed equal weight (1) for all traits. |
nCrosses |
A numeric value indicating how many crosses should occur in the population of solutions at every generation. |
nProgeny |
A numeric value indicating how many progeny (solutions) each cross should generate in the population of solutions at every generation. |
nGenerations |
The number of generations that the evolutionary process should run for. |
recombGens |
The number of recombination generations that should occur before selection is applied. This is in case the user wants to allow for more recombination before selection operates. The default is 1. |
nChr |
The number of chromosomes where features/genes should be allocated to. The default value is 1 but this number can be increased to mimic more recombination events at every generation and avoid linkage disequilibrium. |
mutRate |
A value between 0 and 1 to indicate the proportion of random QTLs that should mutate in each individual. For example, a value of 0.1 means that a random 10% of the QTLs will mutate in each individual randomly taking values of 0 or 1. Is important to notice that this implies that the stopping criteria based in variance will never be reached because we keep introducing variance through random mutation. |
nQTLperInd |
The number of QTLs/genes (classifier |
D |
A relationship matrix between the QTLs (a kind of linkage disequilibrium) specified in the right side of the formula (levels of the |
lambda |
A numeric value indicating the weight assigned to the relationship between QTLs in the fitness function. If not specified is assumed to be 0. This can be used or ignored in your customized fitness function. |
propSelBetween |
A numeric value between 0 and 1 indicating the proportion of families/crosses of solutions/individuals that should be selected. The default is 1, meaning all crosses are selected or passed to the next generation. |
propSelWithin |
A numeric value between 0 and 1 indicating the proportion of individuals/solutions within families/crosses that should be selected. The default value is 0.5, meaning that 50% of the top individuals are selected. |
fitnessf |
An alternative fitness function to be applied at the level of individuals or solutions. It could be a linear combination of the trait breeding values. The available variables internally are: Y: matrix of trait breeding values for the individuals/solutions. Of dimensions s x t, s soultions and t traits. b: vector of trait weights, specified in the 'b' argument. Of dimensions t x 1, t traits by 1 Q: Matrix with QTLs for the individuals/solutions. Of dimensions s x p, s solutions and p QTL columns. Although multiple traits are enabled it is assumed that same QTLs are behind all the traits, differing only in their average allelic effects. D: matrix of relationship between the QTLs, specified in the 'D' argument. Of dimensions p x p, for p QTL columns d: vector of inbreeding contribution for each individual/solution calculated as diag(Q'DQ). Of dimensions s x 1, s solutions by 1 a: list of vectors with average allelic effects for a given trait. Of dimensions s x 1, s solutions by 1 column If
where If you provide your own fitness function please keep in mind that the variables Y, b, Q, D, a, and d are already reserved and these variables should always be added to your function (even if you do not use them) in addition to your new variables so the machinery runs. And additional fitness function available for regression problems is |
verbose |
A logical value indicating if we should print logs. |
dateWarning |
A logical value indicating if you should be warned when there is a new version on |
selectTop |
Selects highest values for the fitness value if |
tolVarG |
A stopping criteria (tolerance for genetic variance) when the variance across traits is smaller than this value, which is equivalent to assume that all solutions having the same QTL profile (depleted variance). The default value is |
keepBest |
A |
fixQTLperInd |
A |
initPop |
an object of Pop-class. |
simParam |
an object of SimParam. |
... |
Further arguments to be passed to the fitness function if required. |
Using the AlphaSimR
machinery (runMacs) it recreates the evolutionary forces applied to a problem where possible solutions replace individuals and combinations of variables in the problem replace the genes. Then evolutionary forces are applied to find a close-to-optimal solution. The number of solutions are controlled with the nCrosses and nProgeny parameters, whereas the number of initial QTLs activated in a solution is controlled by the nQTLperInd parameter. The number of activated QTLs of course will increase if has a positive effect in the fitness of the solutions. The drift force can be controlled by the recombGens parameter. The mutation rate can be controlled with the mutRate parameter. The recombination rate can be controlled with the nChr argument.
The indivPerformance
output slot contains the columns q'a, q'Dq, deltaC and nQTLs. These mean the following:
score : 'q' represents the fitness function value of a solution.
In deltaC : it represents the change in coancestry (e.g., inbreeding), it can be thought as the rate of coancestry.
In q'Dq : 'q' represents the contribution vector, 'D' is the linkage disequilibrium matrix between QTNs (whatever the QTNs represent for your specific problem). In practice we do QAQ' and extract the diagonal values.
In generation : it represents the generation at which this solution appeared.
In nQTNs : it represent the final number of QTNs that are activated in homozygote state for the positive effect.
During the run the columns printed in the console mean the following:
generation: generation of reproduction
constrainedUB: number of solutions constrained by the upper bound specified
constrainedLB: number of solutions constrained by the lower bound specified
total: total number of solutions/progeny/rows available in a given generation
varG: genetic variance present in the population due to the QTNs
the matrix of q'a (score), deltaC, q'Dq, generation, nQTNs per solution per generation. See details section above.
if the argument keepBest=TRUE this contains the pedigree of the selected solutions across iterations.
a matrix with scores for different metrics across n generations of evolution.
the matrix of phenotypes of individuals/solutions present in the last generation.
the matrix of phenotypes of top (parents) individuals/solutions present in the last generation.
AlphaSimR object used for the evolutionary algorithm at the last iteration.
AlphaSimR object corresponding to the best parental haplotypes/solutions selected for new crosses across all cycles.
A matrix with as many rows as solutions and columns as traits to be constrained. 0s indicate that such trait went beyond the bound in that particular solution.
A matrix with as many rows as solutions and columns as traits to be constrained. 0s indicate that such trait went beyond the bound in that particular solution.
a character vector corresponding to the name of the variables used in the fitness function.
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
Gaynor, R. Chris, Gregor Gorjanc, and John M. Hickey. 2021. AlphaSimR: an R package for breeding program simulations. G3 Gene|Genomes|Genetics 11(2):jkaa017. https://doi.org/10.1093/g3journal/jkaa017.
Chen GK, Marjoram P, Wall JD (2009). Fast and Flexible Simulation of DNA Sequence Data. Genome Research, 19, 136-142. http://genome.cshlp.org/content/19/1/136.
evolafit
– the information of the package
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Color Weight Value # 1 Red 4.88 9.95 # 2 Blue 1.43 2.73 # 3 Purple 1.52 2.60 # 4 Orange 3.11 0.61 # 5 Green 2.49 0.77 # 6 Pink 3.53 1.99 # 7 White 0.62 9.64 # 8 Black 2.59 1.14 # 9 Yellow 1.77 10.21 # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. # simple specification res00<-evolafit(formula=cbind(Weight,Value)~Color, dt= Gems, # constraints on traits: if greater than this ignore constraintsUB = c(10,Inf), nGenerations = 10 ) best = bestSol(res00$pop)["Value"] Q <- pullQtlGeno(res00$pop, simParam = res00$simParam, trait=1); Q <- Q/2 qa = Q[best,] %*% as.matrix(Gems[,c("Weight","Value")]); qa # more complete specification res0<-evolafit(formula=cbind(Weight,Value)~Color, dt= Gems, # constraints on traits: if greater than this ignore constraintsUB = c(10,Inf), # constraints on traits: if smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection (fitness function) b = c(0,1), # population parameters nCrosses = 100, nProgeny = 20, # genome parameters recombGens = 1, nChr=1, mutRate=0, nQTLperInd = 1, # coancestry parameters D=NULL, lambda=0, # selection parameters propSelBetween = .9, propSelWithin =0.9, nGenerations = 50 ) Q <- pullQtlGeno(res0$pop, simParam = res0$simParam, trait=2); Q <- Q/2 best = bestSol(res0$pop)["Value"] qa = Q[best,] %*% as.matrix(Gems[,c("Weight","Value")]); qa Q[best,] # $`Genes` # Red Blue Purple Orange Green Pink White Black Yellow # 1 1 0 0 1 0 0 1 0 # # $Result # Weight Value # 8.74 32.10 pmonitor(res0) pareto(res0)
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Color Weight Value # 1 Red 4.88 9.95 # 2 Blue 1.43 2.73 # 3 Purple 1.52 2.60 # 4 Orange 3.11 0.61 # 5 Green 2.49 0.77 # 6 Pink 3.53 1.99 # 7 White 0.62 9.64 # 8 Black 2.59 1.14 # 9 Yellow 1.77 10.21 # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. # simple specification res00<-evolafit(formula=cbind(Weight,Value)~Color, dt= Gems, # constraints on traits: if greater than this ignore constraintsUB = c(10,Inf), nGenerations = 10 ) best = bestSol(res00$pop)["Value"] Q <- pullQtlGeno(res00$pop, simParam = res00$simParam, trait=1); Q <- Q/2 qa = Q[best,] %*% as.matrix(Gems[,c("Weight","Value")]); qa # more complete specification res0<-evolafit(formula=cbind(Weight,Value)~Color, dt= Gems, # constraints on traits: if greater than this ignore constraintsUB = c(10,Inf), # constraints on traits: if smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection (fitness function) b = c(0,1), # population parameters nCrosses = 100, nProgeny = 20, # genome parameters recombGens = 1, nChr=1, mutRate=0, nQTLperInd = 1, # coancestry parameters D=NULL, lambda=0, # selection parameters propSelBetween = .9, propSelWithin =0.9, nGenerations = 50 ) Q <- pullQtlGeno(res0$pop, simParam = res0$simParam, trait=2); Q <- Q/2 best = bestSol(res0$pop)["Value"] qa = Q[best,] %*% as.matrix(Gems[,c("Weight","Value")]); qa Q[best,] # $`Genes` # Red Blue Purple Orange Green Pink White Black Yellow # 1 1 0 0 1 0 0 1 0 # # $Result # Weight Value # 8.74 32.10 pmonitor(res0) pareto(res0)
A genetic algorithm fit by evolaFitMod
.
This class extends class "Pop"
class and includes some
additional slots.
Objects are created by calls to the
evolafit
function.
the matrix of q'a (score), deltaC, q'Dq, generation, nQTNs per solution per generation. See details section above. All other slots are inherited from class "Pop"
.
if the argument keepBest=TRUE this contains the pedigree of the selected solutions across iterations. All other slots are inherited from class "Pop"
.
a matrix with scores for different metrics across n generations of evolution. All other slots are inherited from class "Pop"
.
the matrix of phenotypes of individuals/solutions present in the last generation. All other slots are inherited from class "Pop"
.
the matrix of phenotypes of top (parents) individuals/solutions present in the last generation. All other slots are inherited from class "Pop"
.
A matrix with as many rows as solutions and columns as traits to be constrained. 0s indicate that such trait went beyond the bound in that particular solution. All other slots are inherited from class "Pop"
.
A matrix with as many rows as solutions and columns as traits to be constrained. 0s indicate that such trait went beyond the bound in that particular solution. All other slots are inherited from class "Pop"
.
a character vector corresponding to the name of the variables used in the fitness function. All other slots are inherited from class "Pop"
.
Class "Pop"
, directly.
signature(object = "evolaFitMod")
: also a
non-method for the same reason as update
evolaFitMod
showClass("evolaFitMod")
showClass("evolaFitMod")
A genetic algorithm fit by evolaPop
.
This class extends class "Pop"
class and includes some
additional slots.
Objects are created by calls to the
evolafit
function.
the matrix of q'a (score), deltaC, q'Dq, generation, nQTNs per solution per generation. See details section above. All other slots are inherited from class "Pop"
.
if the argument keepBest=TRUE this contains the pedigree of the selected solutions across iterations. All other slots are inherited from class "Pop"
.
a matrix with scores for different metrics across n generations of evolution. All other slots are inherited from class "Pop"
.
the matrix of phenotypes of individuals/solutions present in the last generation. All other slots are inherited from class "Pop"
.
the matrix of phenotypes of top (parents) individuals/solutions present in the last generation. All other slots are inherited from class "Pop"
.
A matrix with as many rows as solutions and columns as traits to be constrained. 0s indicate that such trait went beyond the bound in that particular solution. All other slots are inherited from class "Pop"
.
A matrix with as many rows as solutions and columns as traits to be constrained. 0s indicate that such trait went beyond the bound in that particular solution. All other slots are inherited from class "Pop"
.
a character vector corresponding to the name of the variables used in the fitness function. All other slots are inherited from class "Pop"
.
Class "Pop"
, directly.
signature(object = "evolaPop")
: also a
non-method for the same reason as update
evolaPop
showClass("evolaPop")
showClass("evolaPop")
Makes a matrix of ones with a single row and nc columns.
Jc(nc)
Jc(nc)
nc |
Number of columns to create. |
A simple apply function to make a matrix of one row and nc columns.
a matrix
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
Jc(5)
Jc(5)
Makes a matrix of ones with a single column and nr rows.
Jr(nr)
Jr(nr)
nr |
Number of rows to create. |
A simple apply function to make a matrix of one column and nr rows.
a matrix
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
Jr(5)
Jr(5)
Makes a matrix indicating how many QTLs were activated for each solution.
nQtl(object)
nQtl(object)
object |
Object returned by the evolafit function. |
A simple apply function to count the number of active QTLs per solution (row) per trait (columns).
a matrix
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. # simple specification res00<-evolafit(formula=cbind(Weight,Value)~Color, dt= Gems, # constraints on traits: if greater than this ignore constraintsUB = c(10,Inf), nGenerations = 10 ) nQtl(res00)
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. # simple specification res00<-evolafit(formula=cbind(Weight,Value)~Color, dt= Gems, # constraints on traits: if greater than this ignore constraintsUB = c(10,Inf), nGenerations = 10 ) nQtl(res00)
Simple function for fitness where an index of traits is weighted by the group relationship.
ocsFun(Y,b,d,Q,a)
ocsFun(Y,b,d,Q,a)
Y |
A matrix of trait values. See details. |
b |
A vector of trait weights. See details. |
d |
A vector of group relationships. See details. |
Q |
A QTL matrix. See details. |
a |
A named list with vectors of average allelic effects per trait. See details. |
A simple apply function of a regular index weighted by a vector of relationships.
Y%*%b - d
Internally, we use this function in the following way:
The Y matrix is the matrix of trait-GEBVs and b is the user-specified trait weights.
d = qtDq * lambda; where qtDq is equal to Matrix::diag(Q%*%Matrix::tcrossprod(A,Q)) of dimensions n x n
Notice that Q represents the marker of QTLs (columns) for all solutions (rows) and A the LD between QTLs. The user can modify this function as needed and provide it to the evolafit function along with other arguments.
Notice that a is a list with elements named as the traits specified in your formula.
a vector of values
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
Y=matrix(1:12,4,3) # 4 solutions with 3 traits b=c(1,2,3) # weights for each trait d=c(1,1,1,1) # coancestry for each solution ocsFun(Y,b,d) # Yb - d where d is QAQ' and A is the LD between QTNs
Y=matrix(1:12,4,3) # 4 solutions with 3 traits b=c(1,2,3) # weights for each trait d=c(1,1,1,1) # coancestry for each solution ocsFun(Y,b,d) # Yb - d where d is QAQ' and A is the LD between QTNs
'overlay' adds r times the design matrix for model term t to the existing design matrix. Specifically, if the model up to this point has p effects and t has a effects, the a columns of the design matrix for t are multiplied by the scalar r (default value 1.0). This can be used to force a correlation of 1 between two terms as in a diallel analysis.
overlay(..., rlist=NULL, prefix=NULL, sparse=FALSE)
overlay(..., rlist=NULL, prefix=NULL, sparse=FALSE)
... |
as many vectors as desired to overlay. |
rlist |
a list of scalar values indicating the times that each incidence matrix overlayed should be multiplied by. By default r=1. |
prefix |
a character name to be added before the column names of the final overlay matrix. This may be useful if you have entries with names starting with numbers which programs such as asreml doesn't like, or for posterior extraction of parameters, that way 'grep'ing is easier. |
sparse |
a TRUE/FALSE statement specifying if the matrices should be built as sparse or regular matrices. |
an incidence matrix with as many columns levels in the vectors provided to build the incidence matrix.
Giovanny Covarrubias-Pazaran
Fikret Isik. 2009. Analysis of Diallel Mating Designs. North Carolina State University, Raleigh, USA.
Covarrubias-Pazaran G (2016) Genome assisted prediction of quantitative traits using the R package soevolafit. PLoS ONE 11(6): doi:10.1371/journal.pone.0156744
The core functions of the package evolafit
.
####=========================================#### #### For CRAN time limitations most lines in the #### examples are silenced with one '#' mark, #### remove them and run the examples ####=========================================#### data("DT_technow") DT <- DT_technow head(DT) DT$dentf <- as.factor(DT$dent) DT$flintf <- as.factor(DT$flint) with(DT, overlay(dentf,flintf, sparse = TRUE)) with(DT, overlay(dentf,flintf, sparse = FALSE))
####=========================================#### #### For CRAN time limitations most lines in the #### examples are silenced with one '#' mark, #### remove them and run the examples ####=========================================#### data("DT_technow") DT <- DT_technow head(DT) DT$dentf <- as.factor(DT$dent) DT$flintf <- as.factor(DT$flint) with(DT, overlay(dentf,flintf, sparse = TRUE)) with(DT, overlay(dentf,flintf, sparse = FALSE))
plot
for monitoring.
pareto(object, scaled=TRUE,pch=20, xlim, ...)
pareto(object, scaled=TRUE,pch=20, xlim, ...)
object |
model object returned by |
scaled |
a logical value to specify the scale of the y-axis (gain in merit). |
pch |
symbol for plotting points as desribed in par |
xlim |
upper and lower bound in the x-axis |
... |
Further arguments to be passed to the plot function. |
vector of plot
Giovanny Covarrubias
plot
for monitoring.
pmonitor(object, kind, ...)
pmonitor(object, kind, ...)
object |
model object of class |
kind |
a numeric value indicating what to plot according to the following values: 1: Average and best q'a (contribution) 2. Average q'Dq and deltaC 3. Number of QTLs activated |
... |
Further arguments to be passed to the plot function. |
trace plot
Giovanny Covarrubias
Simple function for fitness where the mean squared error is computed when the user provides y and X and b are the average allelic effects of the population in the genetic algorithm.
regFun(Y,b,d,Q,a,X,y)
regFun(Y,b,d,Q,a,X,y)
Y |
A matrix of trait values. See details. |
b |
A vector of trait weights. See details. |
d |
A vector of group relationships. See details. |
Q |
A QTL matrix. See details. |
a |
A named list with vectors of average allelic effects per trait. See details. |
X |
A matrix of covariates or explanatory variables. See details. |
y |
A vector of the response variable. See details. |
A simple apply function of a regular mean squared error.
( y - X%*%b ) ^ 2
Internally, we use this function in the following way:
The y vector and X matrix are provided by the user and are fixed values that do not change. The evolutionary algorithm optimizes the b values which are the QTLs and associated average allelic effects that are evolving.
a vector of values
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
y <- rnorm(40) # 4 responses X=matrix(rnorm(120),40,3) # covariates Q=matrix(0,40,30) # QTL matrix with 30 QTLs for(i in 1:nrow(Q)){Q[i,sample(1:ncol(Q),3)]=1} a <- list(trait1=rnorm(30)) # 30 average allelic effects in trait 1 mse = regFun(y=y, X=X, Q=Q, a=a, # used # ignored, Y is normally available in the evolafit routine Y=X, b=NA, d=NA)
y <- rnorm(40) # 4 responses X=matrix(rnorm(120),40,3) # covariates Q=matrix(0,40,30) # QTL matrix with 30 QTLs for(i in 1:nrow(Q)){Q[i,sample(1:ncol(Q),3)]=1} a <- list(trait1=rnorm(30)) # 30 average allelic effects in trait 1 mse = regFun(y=y, X=X, Q=Q, a=a, # used # ignored, Y is normally available in the evolafit routine Y=X, b=NA, d=NA)
Simple function to map a vector of values to the range of 0 and 1 values to have a better behavior of the algorithm.
stan(x)
stan(x)
x |
A vector of numeric values. |
Simple function to map a vector of values to the range of 0 and 1 values to have a better behavior of the algorithm.
new values in range 0 to 1
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
x <- rnorm(20, 10, 3);x stan(x)
x <- rnorm(20, 10, 3);x stan(x)
Extracts the variance found across the M element of the resulting object of the evolafit() function which contains the different solution and somehow represents the genome of the population.
varQ(object)
varQ(object)
object |
A resulting object from the function evolafit. |
A simple apply function looking at the variance in each column of the M element of the resulting object of the evolafit function.
a value of variance
Giovanny Covarrubias-Pazaran (2024). evola: a simple evolutionary algorithm for complex problems. To be submitted to Bioinformatics.
evolafit
– the core function of the package
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. res0<-evolafit(cbind(Weight,Value)~Color, dt= Gems, # constraints: if greater than this ignore constraintsUB = c(10,Inf), # constraints: if smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(0,1), # population parameters nCrosses = 100, nProgeny = 20, recombGens = 1, # coancestry parameters D=NULL, lambda=c(0,0), nQTLperInd = 1, # selection parameters propSelBetween = .9, propSelWithin =0.9, nGenerations = 50 ) varQ(res0$pop)
set.seed(1) # Data Gems <- data.frame( Color = c("Red", "Blue", "Purple", "Orange", "Green", "Pink", "White", "Black", "Yellow"), Weight = round(runif(9,0.5,5),2), Value = round(abs(rnorm(9,0,5))+0.5,2), Times=c(rep(1,8),0) ) head(Gems) # Task: Gem selection. # Aim: Get highest combined value. # Restriction: Max weight of the gem combined = 10. res0<-evolafit(cbind(Weight,Value)~Color, dt= Gems, # constraints: if greater than this ignore constraintsUB = c(10,Inf), # constraints: if smaller than this ignore constraintsLB= c(-Inf,-Inf), # weight the traits for the selection b = c(0,1), # population parameters nCrosses = 100, nProgeny = 20, recombGens = 1, # coancestry parameters D=NULL, lambda=c(0,0), nQTLperInd = 1, # selection parameters propSelBetween = .9, propSelWithin =0.9, nGenerations = 50 ) varQ(res0$pop)