April 5, 2021
Handling Non-numerical Values in the Design Space Exploration and Predictive Modeling Studies
Introduction
A simulation model can return non-numerical value during an optimization problem solution or in the course of evaluation of the design of experiments plan due to the unphysical model configuration or unstable behavior of a numerical method. These values contain important information about the model behavior, and making the right decision on how to treat them allows training reliable approximation model and obtaining the best solution of the optimization problem. Additionally, the response values can be empty for various reasons, such as corrupted data or parsing issues. For such cases, special types of values are implemented in pSeven:
- NaN response evaluation failed for a corresponding design point;
- None response was not evaluated for a corresponding design point.
In other words, NaN is the output value at a point that was checked, but the model was not able to provide the response, for example, due to an error during CAD model update or to the general inconsistency in the input parameters configuration. Whilst None is a value for the missing data, such as if at some point the response was not calculated.
In this tech tip, we explain what situations may result in NaN and None values, and how the Design Space Exploration and Predictive Modeling toolkits handle such points.
1. NaN values
1.1 Design space exploration study
Design of experiments (DoE) study, as well as an optimization problem, are solved by the Design Space Exploration (DSE) block. Optimization/DoE study requires simulation model (blackbox) to be connected.
1.1.1 Blackbox
If for some reason the output of the simulation model cannot be evaluated during DoE study, and the simulation model returns NaN, DSE block treats such point as infeasible. Learn more in pSeven documentation.
During the surrogate-based optimization (SBO) a special option (GTOpt/DetectNaNClusters), enabled by default, allows to avoid design space areas where blackbox responses are NaNs. The optimization algorithm assumes that the response failed to evaluate is a function, which is undefined in some specific regions of the design space. It tries to detect and avoid such regions in order to reduce the number of evaluation failures. So, the NaN regions are treated as “implicitly constrained”. You can find details in the pSeven documentation.
NaN regions are estimated by the distances in the design space between the neighboring feasible and unfeasible points: see Fig. 1. NaN point appears (blue cross) – and the implicit constraint based on the distance to the nearest points is introduced internally by optimizer.
Let’s consider two possible scenarios:
- The Left branch of Fig. 1 – the algorithm tends to add additional points close to the NaN area bound, and they appear to be Feasible (not NaN). They become the new closest-to-NaN points and reduce the region of implicit constrain.
- The Right branch of Fig. 1 – the algorithm places additional points close to the NaN area bound, and they appear to be Unfeasible (NaN). So, the constrained region will be extended. If a group of NaN points is located next to each other, then SBO applies a clustering algorithm to group them together into a single region.
Fig. 1. Area around NaN values
If the default option (GTOpt/DetectNaNClusters) is disabled, the optimization technique assumes that the response evaluation failures are independent and do not belong to a certain region. For example, if the blackbox returns NaN response for some input point, the optimization technique can still request evaluations in the adjacent regions.
Default is suitable for many real tasks, including those where all responses are well defined – there are no drawbacks in this case.
In SBO and SBO-based techniques, such as the Adaptive design of experiments, NaNs are handled as described above.
Gradient-based optimization (GBO) technique does not support NaNs handling explicitly and avoids directions where NaN points are, finishing with error message “NaN/Inf problems” if all possible directions are exhausted.
1.1.2 Initial sample
For both DoE and optimization problems, we can use existing data as the initial designs sample. Initial sample cannot contain NaN values for variables. In the same time, NaN values in responses are processed as described above.
This means that SBO treats NaNs based on the option GTOpt/DetectNaNClusters: it avoids directions where values of responses are NaN if the option is enabled (by default), and continues exploring such directions in the design space if the option is switched off. See the pSeven documentation for details.
Gradient-based technique of optimization (GBO) cannot treat NaN values and ignores directions of NaN responses if possible.
1.2 Predictive modeling
The Model builder tool has a wide range of hints to provide the additional information about the training data. “Enable NaN prediction” hint requirement allows to train the model, which will predict NaN output values in the areas near NaN points of the training sample. Read this documentation chapter with model requirements description to find out how to set this hint.
For example, let us consider a training sample that contains NaN point (Fig. 2) and build an approximation model using “Enable NaN prediction” option. Making predictions using this surrogate model, we obtain a white area around the NaN training point that means NaNs prediction for input points from this area (Fig. 3).
Fig. 2. Training dataset with NaN point
Fig. 3. NaNs prediction
If we constantly increase the number of NaN points in the training dataset, we see how the areas around the NaNs are formed (Fig. 4).
Fig. 4. NaNs prediction
The plots above illustrate the fact that NaN values are constrained by a region, which size depends on the distance to the nearest point, as mentioned in the previous part of the tech tip. No clustering of NaN points is applied in the Model builder toolkit. See the documentation for more information.
If the non-numeric values are in the input part of the training sample, GTApprox/InputNanMode option specifies how to handle them. Since approximation builder cannot obtain any information from NaN values of variables, two scenarios are possible:
- “Raise” (default): raise an exception and cancel training
- “Ignore”: exclude data points with non-numeric values from the sample and continue training.
Full information is available in the documentation.
None values
2.1 Design space exploration study
2.1.1 Blackbox
Since the empty value does not contain any information about the simulation model, None values in blackbox response cannot be handled by any optimization or design of experiments methods.
2.1.2 Initial sample
Same as for NaNs, the initial sample should not contain missing values (Nones) in the input part of the dataset.
However, responses part of the initial sample can contain empty values. It comes from the nature of initial sample. Existing sample can contain values of variables only or values of variables with partially missing responses, because including those particular points into Design of Experiments plan can be useful. So, the output part of the initial sample can be fully or partially empty. In the first case, a special technique “Initial sample evaluation” is implemented to evaluate the variables' values. In the second case, if None values are in the responses part of the initial sample, the DSE block considers such points as “not evaluated” and runs the simulation for them in the first place, prior to the optimization or other parametric studies, in order to fill the initial sample. Then it processes a full sample of initial points according to the current technique — for example, it evaluates the objectives in the optimization study. Go to Documentation for the detailed information.
DSE block allows to upload the initial sample not only as a whole table with inputs and outputs, but also as separate samples for x.Initial sample (inputs), f.Initial sample (outputs) and c.Initial sample (constraints). In such case, if some of these ports are not connected, they will be treated as Nones (Fig. 5).
Fig. 5. Initial sample in DSE block
2.2 Predictive Modeling
None value means that a response was not evaluated for the corresponding design point. For this reason, pSeven does not allow datasets with empty (None) points as a training sample for approximation model building. Thus, data sample should have the numerical values in both variables and responses parts to train the model.
Conclusions
In this tech tip, we considered the ways the Design Space Exploration and Predictive Modeling tools work in situations when simulation model returns non-numerical values, despite the reasons. Special NaN and None types of values are implemented in pSeven to handle such points. All cases of implementation are presented in the Table below.
DoE \ ADoE \SBO | GBO | Approximation builder | ||||
In initial sample | In blackbox response | In initial sample | In blackbox response | In training sample | During model evaluation | |
NaN |
In inputs: raises exception In outputs: Can be effectively used to avoid NaN region |
Can be handled to create internal implicitly constraint |
In Inputs: raises exception In outputs: Ignored |
Used to avoid NaN direction |
In inputs: raises exception In outputs: Can be predicted/ignored/or trigger exception |
Predicted if hint is set |
None |
In inputs: raises exception In outputs: Marks points, which require recalculation |
Cannot be handled (exception) |
In inputs: raises exception In outputs: Marks points, which require recalculation |
Can not be handled (exception) |
In inputs: raises exception In outputs: Exception |
Not supported |
By Yulia Bogdanova, Application Engineer, DATADVANCE