br Step Encode the tree ensembles E
Step 5: Encode the tree ensembles E1, E2, . . . , ER to be the initialized solutions by a binary scheme, where “1” stands for a tree being included in the ensemble and “0” stands for a tree being excluded, and let t = 0; Step 6: Perform non-dominated sorting  on R solutions; Step 7: Select R2 best solutions, implement crossover and mutation to generate R2 new solutions;
Step 14: Repeat the process of Step 6 to Step 13 until the MSE and Pertussis Toxin of the regression trees in the ensembles that correspond to the best solutions keep constant for 50 iterations.
Note that plain averaging is used to get the final output of the regression ensembles. We do not used weighted averaging in this work as our empirical studies indicate that weighted averaging may increase the risk of overfitting.
The evolutionary optimization process will create a set of non-dominated solutions, representing ensembles which are the trade-offs between MSE and diversity. In order to choose the best one, we compute the MSE of these ensembles and rank them according to their MSE values. The MSE of the ensemble is computed as 1 ns−1 (y j − yˆ j )2.The best solution
ns j=0 corresponds to the ensemble with the minimum MSE.
4. Experimental results
4.1. Data preparation
Data preparation plays a significant role in data mining in that good results can only be achieved by the experiments on dataset with good quality . After acquiring data, data preprocessing needs to be conducted to transform raw data into appropriate format for subsequent use , which costs nearly 80% of the effort of the data mining process . Data acquisition and data preprocessing of experiments are described in the following.
Data used in this work is the colorectal cancer data acquired from SEER  (the Surveillance, Epidemiology, and End Results) Program of the National Cancer Institute, which is a primary source for cancer data in the United States. SEER cancer datasets are collected from cancer registries of various regions in US and about 28 percent of the population is covered. This data containing 134 variables with complete data record description can be obtained and used for free after signing a Research Data Agreement. Colon and rectum is one of the 8 major sites of cancer that the SEER cancer data files include and the others are breast, other digestive, female genital, lymphoma and leukemia, male genital, respiratory, and urinary.
The original datasets are TXT format files in which variables are hard to tell apart. Therefore, a C program was written to transform the TXT files into comma separated files in accordance with the length of the variables. Then the files were imported into database for further processing. The variables of the datasets could be divided into seven categories, as shown in Table 3.
To implement the proposed algorithm, all records whose survival time is more than 59 months since diagnosis were labeled as “1” (alive) and the other records are labeled as “0” (dead). Thus, this new variable is used as the target variable of classification, and the target variable of regression is the data item of survival months.
Not all the variables and records in the dataset of colorectal cancer were used in this study. In order to improve the accuracy and e ciency, data preprocessing of the original colorectal cancer dataset was carried out with respect to both variables and records.
Several variables were removed from the models. Some variables were recorded due to the update of the encoding standard, which results in repeated attributes that should be excluded. The variables directly related to patient’s critical state such as the cause of death and vital status were not used as the input of the model because they are linked to the
Classification of variables.
Category number Category of variables
I Record identification