StackGP.evolve#

Evolve is the core search engine within StackGP and is responsible for running the evolutionary search.

In the minimal use case, you need to supply and input dataset and a response vector to fit.

There are many optional arguments:

  • generations: The number of generations to run the search for. Default=100.

  • ops: The math operators to allow in the search space. To use the default set ops=defaultOps().

  • const: The constants to be used in the search space. The default allows \(\pi\), \(e\), random integers from -3 to 3, and random reals from -10 to 10. To use the default set const=defaultConst().

  • variableNames: Takes a list of names to be assigned to the input variables. The list size should be the same as the number of input terms. If no list is supplied, variables will be labelled \(x_1\), \(x_2\), …

  • mutationRate: Controls the percentage of models which will undergo mutation in each generation. The default is 79%.

  • crossoverRate: Controls the percentage of models that will be produced via crossover in each generation. The default is 11%.

  • spawnRate: Controls the number of random individuals to be injected in each generation. The default is 10% of the populuation size.

  • extinction: Determines if models not on the Pareto front will be wiped out every extinctionRate generations. The default is set to false.

  • extinctionRate: Controls the rate that models not on the Pareto front will be wiped out if extinction is set to true. The default is every 10 generations if extinctions are enabled.

  • elitismRate: Controls the percentage of elite models that will be preserved across generations. The default is set to 50%.

  • popSize: Controls the population size during search. The default is 300.

  • maxComplexity: Sets the maximum complexity of models allowed during search. Any models that exceed this will be removed from the population. The default is set to 100.

  • align: Binary setting to control if the models will undergo linear scaling at the end of search. The default is set to true.

  • initialPop: Takes a list of models to seed the search space. By default, it is set to an empty list.

  • timeLimit: Controls the maximum time allowed for search in seconds. If generations is not reached by the timeLimit the search will terminate with the current population. capTime must be set to true to enable this. By default the value for timeLimit is set to 300 second but it is not active.

  • capTime: Binary setting to determine if search should be time constrained by the set timeLimit. By default this is set to false.

  • tourneySize: The size of tournaments used during search. The default is 5.

  • tracking: Binary setting to determine if a plot of the search trace should be displayed after the search completes. By default this is set to false.

  • liveTracking: Binary setting to determine if a plot of the search trace should be displayed live during the search. By default this is set to false. This is supported currently just in Jupyter Notebooks.

  • liveTrackingInterval: The interval in seconds between refreshes of the liveTracking plot. Setting this value to 0 will update the figure every generation.

  • modelEvaluationMetrics: Controls which objectives are used during search. It can take a list of arbitrary length. Pareto tournaments will determine winners from each tournament across all objectives. The default is set to [fitness, stackGPModelComplexity].

  • dataSubsample: Binary setting to determine if the training data should be subsampled in each generation.

  • samplingMethod: Controls which sampling method is used if dataSubsample=True. Options are randomSubsample and generationProportionalSample or any user defined function of the form func(input,response) and returns a subsampled input and response.

  • sharpnessAware: (Coming soon!) Binary setting to determine if sharpness-aware minimization will be used during search to promote smoother model development. By default this is set to false.


First we need to load in the necessary packages

import StackGP as sgp
import numpy as np

Overview#

Define Function#

Here we define the function which we will try to rediscover.

#Define demo function to generate data
def demoFunc(x,y):
    return x**2/y

Data Generation#

Now we can generate some data that we will use in the search.

#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Results#

Now we can visualize the Pareto front plot of the evolved models to see how they perform with respect to accuracy and complexity.

#View model population quality
sgp.plotModels(models)
../_images/5a8578b860ea1086808df7c15c2ad7021c3b73f5e6697e17d3786e5264a25707.png

We can use the printGPModel function to print the best model in a readable format. We can see that the correct model form was found.

#View best model
sgp.printGPModel(models[0])
\[\displaystyle \frac{1.0 x_{0}^{2}}{x_{1}} + 1.84213875231584 \cdot 10^{-15}\]

We can also plot a comparison between predicted and observed values. Since the model fits the data perfectly, we only see the predicted points in the plot since they cover the observed points.

#Compare best model prediction to true data
sgp.plotModelResponseComparison(models[0],inputData,response)
../_images/1e0557a98773256f3d626c25bb68dfe83d27b74443b08dc20f4d049693fc3745.png


Options#

This section showcases how each of the different option settings can be used with the evolve function.


generations#

generations can be used to control how many generations are used in search. The default is to use 100, but if the problem is difficult it may be beneficial to increase this.

First lets setup a somewhat challenging problem with nonlinear patterns. As well, we won’t include cosine, sine, or log in the search space, so we will have to find an approximate model.

#Define a challenging function to generate data
def demoFunc(x,y):
    return np.sin(x) * np.cos(y) + np.log(x + y)
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Now lets try fitting with the default settings. We will turn on tracking so we can see the progress.

#Generate models
models=sgp.evolve(inputData,response,tracking=True)
../_images/1dff2b5d8f97940f3f5415da16aa0c7894b4f139f74eeb12fe2afc40b41ce484.png

Now lets crank up the number of generations to see if we can get closer.

#Generate models
models=sgp.evolve(inputData,response,generations=1000,tracking=True)
../_images/d587931dacd3cdd6097268b737dd6e973fddc8ec56c532d5b7121437e5e88214.png

We can see that the extra generations allowed us to get a much better fitness score.


ops#

By default ops is set to defaultOps(), which contains: div, add, sub, mult, exp, sqrd, sqrt, and inv. These are generally good to try as a starting point since including too many terms or including highly nonlinear terms when they aren’t needed can lead to overfitting.

First lets demonstrate the default case on a problem where the operator set is not sufficient to find a perfect solution.

#Define a challenging function to generate data
def demoFunc(x,y):
    return np.sin(x) * y
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True)
../_images/94ac757d3a2e9c19c316c2420d2572f64d314571b34528c63521615ad49600e2.png
#Plot models
sgp.plotModels(models)
../_images/1278017ea15372be8c228418a7469c72e673f4fbcd5cbae9e8ba27961acb0fdd.png
#View best model
sgp.printGPModel(models[0])
\[\displaystyle -1.28467623935421 + \frac{3.73315459033536 \left(- 0.510402102903892 x_{0} + 0.255201051451946 x_{1} + 1\right)^{2}}{x_{0}}\]
#Compare best model prediction to true data
sgp.plotModelResponseComparison(models[0],inputData,response)
../_images/b84d8faab9f2ce325e203d501e9aeaddb2a29b2ad82c394c6887117b63040fe7.png

Now we set ops to allOps() which includes default set plus trig functions and log. This makes it possible to find an exactl solution.

#Generate models
models=sgp.evolve(inputData,response,ops=sgp.allOps(), tracking=True)
../_images/7abfb857820de3fea95b0630024473eeb900c319e5cbdfff3160987c57f3ca3a.png
#View best model
sgp.printGPModel(models[0])
\[\displaystyle 1.0 x_{1} \sin{\left(x_{0} \right)} + 3.92730806292441 \cdot 10^{-16}\]

We can see now that the solution was discovered.


const#

The const option allows for control over what constants are to be used when generating random models. The default settings of defaultConst() captures many commonly used constants, but there may be instances where these are not sufficient or domain knowledge can be used to provide a better set.

#Define a challenging function to generate data
def demoFunc(x,y):
    return np.pi*x / (np.pi/2 + y)
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True,ops=sgp.allOps(),const=sgp.defaultConst())
../_images/34faeb8e694140bb9df093dc6e4096296a325e55f0cb99557fbf0a72397dd3ce.png
#View best model
sgp.printGPModel(models[0])
\[\displaystyle - \frac{2.26179265560533 x_{0}}{- 0.83630867262685 x_{1} - 0.993221474868728} + 0.186790630712822\]

In the above case, we can see that while we did get a good fitness, the search struggled to find the correct constants and resorted to do its best to approximate the data.

Below, we demonstrate that if we supply the search with good constants, we can make the search much easier. This example outlines and extreme scenario where we happen to know the correct constants. This is unlikely to occur in practice, but it provides good insight into how the constant set can impact search.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,ops=sgp.allOps(),const=[np.pi, np.pi/2])
../_images/c3b2028a528034b2a3ca00746f66762e11a5e9e59e1555bbce8e096464137096.png
#View best model
sgp.printGPModel(models[0])
\[\displaystyle \frac{3.14159265358979 x_{0}}{x_{1} + 1.5707963267949} + 3.39225006206524 \cdot 10^{-16}\]

We can see that we discovered the correct model in this case by supplying the correct constants.


variableNames#

By default, variables are named \(x1\), \(x2\), … Likely you will know the names of features in your dataset, so we can supply those to the search to build models with the supplied names to improve their interpretability.

#Define a function to generate data
def pressureFunc(temperature,volume):
    return temperature/volume
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=pressureFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True)
../_images/cb27e1a174caf831d4d94fcd6bd3b77f24ae803411d4cb5f795449e7bf8362fa.png
#View best model
sgp.printGPModel(models[0],symbols(['temperature','volume']))
\[\displaystyle \frac{1.0 temperature}{volume} - 4.90163564682915 \cdot 10^{-16}\]

While the default case does find a model that perfectly fits the data, there is no reason to leave it to the user to interpret the default variable names and convert them to something informative. Rather, we can just supply them directly to the search. (Note: This is not fully supported yet to embed the feature names in evolution, but this is a placeholder for when it is supported. For now we can still see the interpretable form by supplying the names to printGPModel)

#Generate models
from sympy import symbols
models=sgp.evolve(inputData,response,tracking=True,ops=sgp.allOps(),variableNames=symbols(['temperature','volume']))
#View best model
sgp.printGPModel(models[0],symbols(['temperature','volume']))
\[\displaystyle \frac{1.0 temperature}{volume} - 4.90163564682915 \cdot 10^{-16}\]

mutationRate#

We can control the percentage of model created in each generation from mutation by modifying the mutationRate. It may be the case that mutation is either better or worse suited for the problem than crossover or random generation of models, so we can tune this. The default setting should be reasonably robust, so it is not expected that you will have to modify this setting. As well, evolutionary search tends to be pretty robust to a wide range of settings here, so it is unlikely much will change when moving these parameters around.

#Define a challenging function to generate data
def demoFunc(x,y):
    return np.sin(x) * y
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True,mutationRate=79, crossoverRate=11, spawnRate=5, elitismRate=5)
../_images/b71a32fffa11177d0b015fe75a95bd37791f39dfe431f15e7b5063c8183b49f4.png
#Generate models
models=sgp.evolve(inputData,response,tracking=True,mutationRate=10,spawnRate=10,crossoverRate=70,elitismRate=10)
../_images/bbe74e45905a6981ce6bec1c56bb533c76b7cf109f7388488c1a6adca10caf3f.png

crossoverRate#

We can control the percentage of model created in each generation from crossover by modifying the crossoverRate. It may be the case that crossover is either better or worse suited for the problem than mutation or random generation of models, so we can tune this. The default setting should be reasonably robust, so it is not expected that you will have to modify this setting. As well, evolutionary search tends to be pretty robust to a wide range of settings here, so it is unlikely much will change when moving these parameters around.

#Define a challenging function to generate data
def demoFunc(x,y):
    return np.sin(x) * y
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True, crossoverRate=11, mutationRate=79, spawnRate=5, elitismRate=5)
../_images/c1e0450575b9cc9f1e4228b3ac92594b12f485f3070413881ddb50c5e54c7c95.png
#Generate models
models=sgp.evolve(inputData,response,tracking=True,crossoverRate=70,mutationRate=10,spawnRate=10,elitismRate=10)
../_images/2c651e96b5cb81751754369e6279ff063f81b8b5043e6c39a1295028d9ed2a81.png

spawnRate#

We can control the percentage of random model introduced each generation by modifying the spawnRate. Some problems may be especially difficult an prone to leading evolution to lock in early, if this is the case, we may want to introduce novelty into the search space by injecting new random models. Note, if this value is too high, the search becomes equivalent to random search.

#Define a challenging function to generate data
def demoFunc(x,y):
    return np.sin(x) * y
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True, crossoverRate=11, mutationRate=79, spawnRate=5, elitismRate=5)
../_images/66af3c3cf13a8f9957daff8da9c562481855ded4875c56764185665b52cdc2ef.png

Below, we can see that increasing the spawn rate significantly made search less efficient.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,crossoverRate=0,mutationRate=0,spawnRate=90,elitismRate=10)
../_images/d5b7e604632afff85cf3527d26355655268d78f1e39e28e310b39efba85b0486.png

extinction#

We can introduce occasional extinction events by setting extinction to be true. When an extinction occurs, all non-elite models will be removed and replaced with new random models. The frequency of extinction events will occur based on the set extinctionRate.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x) * y
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

By default, no extinction events occur.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, extinction=False)
../_images/529d682e12176e8f270c579da76d844fbf5699b32514ad71579b466bf13c734f.png

We can set extinction=True and extinctionRate=10 to make it so an extinction event occurs every 10th generation.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, extinction=True, extinctionRate=10)
../_images/6761668c453718025afc9f7683ef70f62984ea36522c635decc579231f43946b.png

We can see what happens if we set extinction rate to 1, meaning every generation we have an extinction event.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, extinction=True, extinctionRate=1)
../_images/ea9643ea8d24e7aa2c7713457cb3dad7ab6edfa137b6613070cefe31623771b8.png

extinctionRate#

When extinction is set to true, we can control the rate at which extinction events occur by setting the extinctionRate. When an extinction occurs, all non-elite models will be removed and replaced with new random models.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

We can set extinction=True and extinctionRate=10 to make it so an extinction event occurs every 10th generation.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, extinction=True, extinctionRate=10)
../_images/6fc2ebff3f571ffa05e3f80f637e17b10fa5a62c38351d42b42c0dbc68d3078a.png

We can set extinction=True and extinctionRate=5 to make it so an extinction event occurs every 5th generation.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, extinction=True, extinctionRate=5)
../_images/0802771c36db756053dc2b6fba07c17b72c68dab6b381a8620f3710b4e794e25.png

elitismRate#

We can control the percentage of elite individuals that are preserved across generations by setting the elitismRate.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Here we set elitismRate=10 so the top 10% of the population will be preserved across generations.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, elitismRate=10)
../_images/5d4a24b1700ac4aa52434852d58d1354a69df6ceb2c4bf3623facd4a276325f1.png

Setting elitismRate=0 will make it so no individuals are preserved across generations, so progress is not guaranteed. This can be observed by seeing that the evolution trace is no longer monotonic.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, elitismRate=0)
../_images/3391e7b383f14a3237dbb557f1f8bf64858ac677b14e00ef414f9b6c2c1ad918.png

popSize#

We can control the population size using the popSize option. By default we target a population of 300 models. It may be the case that we want to increase the size to allow for more diverse models to exist, or we may want to make it easier to progress deeper into evolution by using smaller populations and making it less computationally expensive to evaluate each generation.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Below we set a very small population size of 10 models. We can see that it makes it through the set number of generations very quickly, but due to a lack of diversity, we don’t get to very good models.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, popSize=10)
../_images/d244f8f468c4e4c7b8c5e0e90a304c7cdd6b8978a729e65228c6fd8fc9ff5441.png

Looking at the population, we can see there are very few models.

sgp.plotModels(models)
../_images/1e09de1148f900d66e91658bb29cb5deef8b10d54164251d64168e548d8db880.png

Using a population size of 1000 makes the evolution much slower, but we can see we arrived at much better models since the population contained more potentially useful genetic material to use.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, popSize=1000)
../_images/bcb8e4f8a8298772a95f8e41953224fc5b820cf404e76a7412a8fb68bb304469.png

Looking at the population, we can see that there are many more models

sgp.plotModels(models)
../_images/ee9b6c654b21cb8c4f898b5b9c89b87bf4dc933535831fd2250427aa66cb59fd.png

maxComplexity#

We can control the max complexity allowed in the search space by setting the maxComplexity option. Complexity is determined by counting the total combined lengths of the operator and data stacks.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])
#Generate models
models=sgp.evolve(inputData,response,tracking=True, maxComplexity=10)
../_images/fea47e3d41356d1dcc7fc3c8a269f170f08246e5641e30811477f34094a20373.png

After search, some post processing steps may occur that could increase the model size slightly, so we can see some models a bit over the max complexity of 10.

sgp.plotModels(models)
../_images/d6bfbda7af8a130a1891cd9a9a70efdaeb2e83df0da2e544daf4bdd191fc4988.png

We can crank the complexity limit up to 1000 to see what happens.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, maxComplexity=1000)
../_images/7b2d84f4e61474788da1bd68786a47061491808675207e76079d3766545c501c.png

Setting a max complexity does not mean we will get models up to that complexity. The search will prioritize simpler accurate models, so unless it is necessary, you won’t see many models near a very high complexity limit. As we can see below, the models are hardly more complex than when we set the max complexity to 10.

sgp.plotModels(models)
../_images/0297a4adffe426c284634eee11de54f323c95cad96f82e6e03ec92d470e059ae.png

align#

The align option is a binary setting which determines if models will undergo linear scaling at the end of evolution. Since the default accuracy metric is \(R^2\), there is no guarantee that the model will have the best scaling and translation coefficient, so this scaling step is necessary to align the model with the training data. By default this option is set to True, but if a different accuracy metric is being used, such as RMSE, it may not be necessary to use the alignment.

#Define a function to generate data
def demoFunc(x,y):
    return 5*x*y+2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Lets first show what happens when we set align=False. We can see that we get to a model with essentially 0 error.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, align=False)
../_images/8087584702e96022c84733da05c7d227a2b2825435643c9f429f32725982df1d.png

When we look at the model though, we can see that it is clearly not the same as the function used to generate the data. It has a similar form, but the constants are wrong.

sgp.printGPModel(models[0])
\[\displaystyle 3.14159265358979 x_{0} x_{1}\]

Now when we set align=True we again get to a model with essentially 0 error.

#Generate models
models=sgp.evolve(inputData,response,tracking=True, align=True)
../_images/b090b03bfb81a0b8398ec9befa8260dd0f95ec28d22e12a283c6137c0a1c3ba1.png

When we look at the best model though, we can see that now it has the same form and constants as the formula used to generate the data.

sgp.printGPModel(models[0])
\[\displaystyle 5.0 x_{0} x_{1} + 2.0\]

initialPop#

We can seed the evolution by providing an initial population. Assuming we have previously generated models from other searches, this can allow us to transfer previously learned patterns into our current search to kick start the process.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

We first need to run an initial evolution to get a population to feed into another search.

#Generate models
models=sgp.evolve(inputData,response,tracking=True)
../_images/8f3229385557beed9679b00a4ec1918ad71e478408eab015eb70aa5bf51e5526.png

We can see that the error trace begins where the previous search left off since the new search started with the previously evolved models.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,initialPop=models)
../_images/68f7b8300bc45a31148c1bee8d3a4d62f7cd845644ec13627f09dc3ebf44ffdf.png

We can do this process as many times as we want to continue iteratively improving our models.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,initialPop=models)
../_images/3fa96e6bdbca9543b7a37febde3f1593e70182239052c7c549751a90b490ee3b.png

timeLimit#

Often, we are constrained by time when performing our analysis or modeling so we care about controlling the runtime of the search. We can set the max runtime by setting the timeLimit option to the max number of seconds and setting the capTime option to True.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

By default, the search will run until a set number of generations has been completed. Depending on the problem, the actual runtime to complete a set number of generations can vary significantly and can be difficult to predict in advance. Therefore, a runtime limit is often more practical. The below example does not have a time limit and will run until it completes 10,000 generations. This process took about ~12 minutes to run.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,generations=10000)
../_images/29d64dd62a133e1385aeff891eb1b6614c42457f8f82f64c6729d185f9e8c14c.png

Below we set timeLimit=30 and capTime=True to stop the search when we reach 30 seconds of runtime.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,generations=10000, timeLimit=30, capTime=True)
../_images/151232380484e93d2a6900350efe8df0366e6eab1f7c7be78e8eae6c4a1a4094.png

If the number of generations are completed before the time limit the search will terminate. Below we can see that the 100 generation completes before the time limit of 60 seconds.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,generations=100, timeLimit=60, capTime=True)
../_images/c682ec68b4ffdc81e92f953ab5a3cc251b324e051c13c10478f26d3e54784509.png

tourneySize#

We can control the tounament size used in selection by setting the tourneySize option. Larger tournament sizes will make the search more greedy, while smaller sizes will make it less greedy.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

The below shows an example where we set the tournament size to 5 models.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,tourneySize=5)
../_images/d6ef030ece347a3d7c8dc4cdcc8698000dbe0a95f41f45ae2d43fcbc88829ddb.png

Below we increase the tournament size to 30 models.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,tourneySize=30)
../_images/fa04bba0ef0c8d7cb8b140feef13d29f6787bf3e0f41a5a6aa20d30011eacd1a.png

tracking#

The tracking option is a binary setting that allows us to visualize the evolution trace to see how the model search progressed. It has been set to True throughout most of the examples above to show the evolutionary search progress. If you don’t want the trace plot cluttering up your notebook space or if you are calling the function headless, you may want to set it to False.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Below we can see that by setting tracking=False, nothing is displayed when running evolve.

#Generate models
models=sgp.evolve(inputData,response,tracking=False)

Setting tracking=True below makes it so when we run evolve we see the evolution trace.

#Generate models
models=sgp.evolve(inputData,response,tracking=True)
../_images/59aaf0928dcd4f8b7b9aff4d5d3408753406101b648d55ba930ae60a8e29da38.png

modelEvaluationMetrics#

The choice of fitness objectives can have a significant impact on the success of search, so it is useful to be able to change these. These can be changed by providing them as a list to modelEvaluationMetrics. By default, modelEvaluationMetrics=[fitness,stackGPModelComplexity] is used, where fitness is \(R^2\) and stackGPModelComplexity is combined stack length. A list of any length can be supplied. Each function in the list will be given as arguments the model, input data, and response vector in that order.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,10),np.random.randint(1,10,10)])
response=demoFunc(inputData[0],inputData[1])

Below we define RMSE as a fitness function

#Define new fitness objective (RMSE)
def rmse(model, inputData, response):
    predictions = sgp.evaluateGPModel(model, inputData)
    return np.sqrt(np.mean((predictions - response) ** 2))

Below we use rmse and stackGPModelComplexity as our fitness objectives.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,modelEvaluationMetrics=[rmse,sgp.stackGPModelComplexity])
../_images/5343acfa3c018bf6e20acdc7716a1d197da866ccd5d1bce72287097bdf493342.png

We can see the form of the best model found.

sgp.printGPModel(models[0])
\[\displaystyle - \frac{17.8294775841462 x_{1}^{2}}{- 0.448036177969462 x_{0}^{2} + 0.448036177969462 x_{1}} - 26.8110354570213\]

Now we can compare the best model’s predictions and the true response data.

sgp.plotModelResponseComparison(models[0],inputData,response)
../_images/6ade68506705a6e99615cb8ebcf73e8b3051df1d6e9eee0ae2b066a7e888924f.png

Now we change to use the default settings which uses fitness and stackGPModelComplexity.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,modelEvaluationMetrics=[sgp.fitness,sgp.stackGPModelComplexity])
../_images/0bb96f5795be0d9a28325210a94049495a7184627342fac8d572fc58dde2b240.png

Now we can see the form of the best model found during search.

sgp.printGPModel(models[0])
\[\displaystyle -119.349053125999 + \frac{335.724423637999}{- x_{1} + 1.39299130764398 \sqrt{x_{1} \left(x_{0} + 5.23601810601064 \sqrt{\frac{1}{x_{0} x_{1}}}\right)}}\]

Now we compare the model’s predictions to the true response values.

sgp.plotModelResponseComparison(models[0],inputData,response)
../_images/53285bd91f2f6d612d8bfd6fa99d4d7917894c95b13a51868af3a81b1d8f6b5d.png

If we only want a single objective search, we can supply just one objective. Notice that the search takes longer since there is no constraint on model size, so models will become larger and more expensive to evaluate.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,modelEvaluationMetrics=[sgp.fitness])
../_images/f91e584a346aa376a95b607b37d48e73d76012796c3d0d9973dde2a8b3ff49d9.png

We can confirm this by looking at the size of models. We can see that many are fairly large.

# Update the model quality using standard objectives so we can create the standard model quality plot
[sgp.setModelQuality(mods, inputData, response) for mods in models]
sgp.plotModels(models)
../_images/14c904867ee986a9752aecc79e4903001c619be14513f3fc8d5ab23946e03344.png

We may also be interested in optimizing on a 3D front. In this case, we can supply a list of 3 objectives.

#Generate models
models=sgp.evolve(inputData,response,tracking=True,modelEvaluationMetrics=[sgp.fitness,sgp.stackGPModelComplexity, rmse])
../_images/7b1017befc6929b78fb67fa6726bcb47637ba3b82513bef330480ca262ac76c4.png

The plotModels function will only plot with respect to the first two objectives.

sgp.plotModels(models)
../_images/26f4fcaf5e3541b2b8699ffeb21961ae7c4b5c1d65349ffc30dfe6c7a21b9516.png

We can see the embedded quality of a model by grabbing the end of it. The order of objectives is the same as the order supplied. In this case fitness, stackGPModelComplexity, and rmse.

models[0][-1]
[0.14417745021973816, 13, 63.54150112096759]

dataSubsample#

In scenarios where a data set is very large, it may cause training to be prohibitively slow. You could subsample the data set before passing it to the model training, but then the models are underinformed during training since they do not get the opportunity to learn about all the data. By setting dataSubsample=True you can supply the entire training set and then during evolution the set will be randomly sampled in each generation. The constant shuffling of data allows the models to learn from the whole data set while keeping training efficient since only a small set is used in each generation.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,1000000),np.random.randint(1,10,1000000)])
response=demoFunc(inputData[0],inputData[1])

Running the code below, we can see it takes quite a while to train when using 1,000,000 training points.

models=sgp.evolve(inputData,response,tracking=True)
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  5 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  5 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  5 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  5 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
../_images/6abc3906607b6d6e126b89f39840893721c69fc8431868ad620d612bfd4901ac.png

When we enable data subsampling, we can see the training completes much faster. We can also notice that the training fitness no longer monotonically decreases since the training points used to determine fitness are constantly being resampled.

samplingModels=sgp.evolve(inputData,response,tracking=True,dataSubsample=True)
../_images/1d1d9cd7200411a0440e8ae30290581b787b8730750a66fbb0077dffa1e7c1a3.png

In a scenario where you are constrained by time, the subsampling can make a huge difference. For example, below we constrain the search to 10 seconds and do not subsample. We can see that we make it through very few generations and the model quality does not improve very much.

timeConstrainedModels=sgp.evolve(inputData,response,tracking=True,dataSubsample=False,capTime=True,timeLimit=10)
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  5 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
../_images/8b1dd6b4c9053c646be8801ce3bfe8310ca5bcd86410c8ed8fb07c17cbbc05e5.png

Below is the plot showing the quality of the models in the population.

sgp.plotModels(timeConstrainedModels)
../_images/b444f2fd3ebe7ad33411bde406da48a21934e292deeaa1da48b2a1008af6fd51.png

Now if we have the same time constraint but instead allow subsampling, we can make it through many more generations in that time. This also allows us to potentially find better models since the search is deeper.

timeConstrainedSampledModels=sgp.evolve(inputData,response,tracking=True,dataSubsample=True,capTime=True,timeLimit=10)
../_images/2f45b821ecccb3a1d502c16e28a99860becdf01d7f2c138418baaf6dbbf76627.png

Below is a plot showing the quality of models in the population.

sgp.plotModels(timeConstrainedSampledModels)
../_images/f357b81f5f05c1c9c192ba7e32ba96d2bf0004e0ef18cdfcbba02e070fcf4f4a.png

samplingMethod#

In scenarios where a data set is very large, it may cause training to be prohibitively slow. You could subsample the data set before passing it to the model training, but then the models are underinformed during training since they do not get the opportunity to learn about all the data. By setting dataSubsample=True you can supply the entire training set and then during evolution the set will be randomly sampled in each generation. The constant shuffling of data allows the models to learn from the whole data set while keeping training efficient since only a small set is used in each generation. By default, when data sampling is active, the randomSubsample strategy is used. This strategy selects a new random sample every generation and the sample size remains the same across all generations.

Note: Regardless of the subsampling strategy, before models are returned, all models are evaluated on the full training set so that the returned fitness values are representative of the model quality on the entire training set.

#Define a function to generate data
def demoFunc(x,y):
    return np.sin(x)/np.cos(y) * y**2
#Generate data
inputData=np.array([np.random.randint(1,10,1000000),np.random.randint(1,10,1000000)])
response=demoFunc(inputData[0],inputData[1])

Here we use the randomSubsample strategy that randomly chooses a subset of the full training set each generation.

samplingModels=sgp.evolve(inputData,response,tracking=True,dataSubsample=True,samplingMethod=sgp.randomSubsample,capTime=True,timeLimit=10)
../_images/88e499deca0a48c8145cfa278d2306bb22b4cbbbfce0c54cd8b1640857b62c74.png

Here we use generationProportionalSample to start with a very small sample size and increase as we progress through generations. Note that the big jump in fitness is a result of starting with just 3 training points which is very easy to overfit, thus a near perfect fitness on the 3 point training set.

samplingModels=sgp.evolve(inputData,response,tracking=True,dataSubsample=True,samplingMethod=sgp.generationProportionalSample,capTime=True,timeLimit=10)
../_images/61207fc852926a7fe1f74c7ef6474bc8466d1ca1178779759c91522e334aadac.png

It is also possible to use user defined sampling methods. For example, here we implement a inverse generation proportional sampling. While it likely wouldn’t be useful to do this, we can use this to illustrate how this can be done.

def inverseGenerationProportionalSample(fullInput,fullResponse,generation,generations):
    prop=(generations-generation)/generations
    sampleSize=int(prop*len(fullResponse))
    if sampleSize<3:
        sampleSize=3
    indices=np.random.choice(len(fullResponse),size=sampleSize,replace=False)
    return fullInput[:,indices],fullResponse[indices]

We will make the training set a bit smaller to better showcase the impact here.

inputData=np.array([np.random.randint(1,10,10000),np.random.randint(1,10,10000)])
response=demoFunc(inputData[0],inputData[1])
inverseSamplingModels=sgp.evolve(inputData,response,tracking=True,dataSubsample=True,samplingMethod=inverseGenerationProportionalSample,capTime=True)
../_images/4e5d3e8f8729d4779074e69f8e36a169bac2caf4bf718420df373e0095ab53ed.png