Optimizing Interval Data Classification Through Two-Stage Mixed Integer Programming: A Discriminant Analysis Perspective
Abstract:
Efficient classification of interval data presents considerable challenges, particularly when group overlaps and data uncertainty are prevalent. This study introduces an innovative two-stage Mixed Integer Programming (MIP) framework for discriminant analysis (DA), which is designed to minimize misclassification of vertices while effectively addressing the problem of overlapping groups. By incorporating interval data structures, the proposed model captures both the shared characteristics within groups and the distinct separations between them. The first stage of the model focuses on the identification of group-specific boundaries, while the second stage refines classification by incorporating probabilistic estimates of group memberships. A Monte Carlo simulation is employed to evaluate the robustness of the model under conditions of imprecision and noise, and the results demonstrate its superior capability in handling overlapping data and classifying uncertain observations. Validation through numerical experiments illustrates the model’s effectiveness in accurately resolving group overlaps, thereby improving classification performance. The approach offers significant advantages over traditional methods by probabilistically estimating group memberships, thus enhancing decision-making processes in uncertain environments. These findings suggest that the proposed MIP framework holds substantial promise for applications across a range of complex decision-making scenarios, such as those encountered in finance, healthcare, and engineering, where data imprecision is a critical concern.1. Introduction
Discriminant analysis (DA) is a statistical technique used to predict the group membership of a new observation. A set of known observations is employed to estimate the weights or parameters, as well as to compute the discriminant score, which serves to categorize the observations into two groups, $G_1$ and $G_2$. The method typically involves either minimizing the total deviations of misclassified observations or maximizing the total deviations of correctly classified observations. A new observation is classified into one of the groups by comparing its discriminant score to the evaluation score derived from the discriminant function.
Data envelopment analysis (DEA) evaluates the relative efficiency of a set of homogeneous decision-making units (DMUs) [1]. DEA is particularly effective when multiple inputs and outputs are involved, as traditional methods may struggle to assess efficiency in such complex scenarios [2]. It is especially valuable when conventional approaches, such as ratios (e.g., cost per unit), are insufficient due to the multifaceted nature of the inputs and outputs [3]. Despite their differences, there are notable similarities between DEA and DA. DA is used to differentiate two sets based on prior knowledge, whereas DEA categorizes two sets into efficient and inefficient classifications without the need for prior knowledge [4]. The goal programming method can be applied within each of these frameworks to improve their efficacy [5].
A novel non-parametric discriminant analysis approach has recently been proposed, which is capable of generating a set of weights for a linear discriminant function, thereby producing an assessment score to determine the group membership of observations [6], [7]. This non-parametric approach is referred to as DEA/DA. However, the original DEA/DA method does not accommodate negative data. To address this limitation, an extended version of DEA/DA has been introduced, known as extended DEA/DA, which can handle negative data effectively [8].
The extended DEA/DA approach aims to minimize the overall discrepancies in misclassified observations. In contrast, the primary criterion for evaluating a DA model is the total number of observations that it classifies correctly. Sueyoshi [9] proposed a novel MIP technique to extend the DEA formulation [10], [11], which estimates weights by minimizing the total number of misclassified observations.
However, none of Sueyoshi's models address the classification of imprecise data. Jahanshahloo et al. [12] introduced a method that overcomes a limitation of Sueyoshi's models. Furthermore, Jahanshahloo et al. [13] proposed a method for classifying interval data using DA models by minimizing the sum of the external deviations of all misclassified vertices in the interval data. In recent years, numerous scholars have developed and implemented innovative models and applications of DEA, extending its use to a variety of fields, including energy efficiency, healthcare management, and sustainability assessment [14], [15], [16], [17], [18], [19], [20], [21]. These advancements underscore the versatility of DEA in addressing complex decision-making and performance evaluation challenges across multiple industries.
This research presents an alternative formulation of MIP aimed at reducing the overall number of misclassified vertices in interval data, extending the work mentioned above.
The structure of the study is as follows: Section 2 presents the concepts of interval data and a method for classifying them into two groups. Sueyoshi’s two-stage model is also reviewed at the end of this section. Section 3 develops a reformulation of Jahanshahloo's model using the MIP approach. Section 4 provides a numerical example to demonstrate the application of the model. Finally, Section 5 concludes with a summary of the findings and remarks on the analyzed model.
2. Background
Suppose that there are $n$ observations, $\left(X_j, Y_j\right),(j=1, \ldots, n)$, in which $X_j=\left(x_{1 j}, \ldots, x_{m j}\right)$ is the ${j}^{\text {th }}$ input vector $(j=1, \ldots, n)$ and $Y_j=\left(y_{1 j}, \ldots, y_{s j}\right)$ is the $j^{\text {th }}$ output vector $(j=1, \ldots, n)$. Consider each observation has $k$ independent factors, shown by $\left(x_{1 j}, \ldots, x_{m j}, y_{1 j}, \ldots, y_{s j}\right)$, and $m+s=k$. Also, assume that all observations $\left(X_j, Y_j\right),(j=1, \ldots, n)$ are interval data, i.e., $x_{i j} \in\left[x_{i j}^L, x_{i j}^U\right],(i=1, \ldots, m),(j=1, \ldots, n)$ and $y_{r j} \in\left[y_{r j}^L, y_{r j}^U\right],(r=1, \ldots, s),(j=1, \ldots, n)$ with positive constant lower and upper bounds of the interval. These observations can be classified into $G_1$ and $G_2$ groups, with $n_1$ and $n_2$ observations. In addition, we assume $n_1+n_2=n$ and $G_1 \cup G_2=G$.
Let $J_1=\left\{j \mid\left(X_j, Y_j\right) \in G_1\right\}$ and $J_2=\left\{j \mid\left(X_j, Y_j\right) \in G_2\right\}$. All we need to do is to find a hyperplane $(\alpha, \beta)^t(X, Y)=d$, in which $(\alpha, \beta)^t(X, Y) \geq d $ for $(X, Y) \in G_1 $ and $(\alpha, \beta)^t(X, Y) \leq d-\varepsilon$ for $(X, Y) \in G_2$. Note that $(\alpha, \beta)^t(X, Y)$ is a linear discriminant function, $d$ and $d-\varepsilon$ are the first and second groups' discriminant scores, respectively. A small positive number is utilized to prevent a trivial solution, specifically where all weights are equal to zero. Consequently, we have:
We assume the vector $(\alpha, \beta)$ is normalized. Note that all $\left(x_{i j}, y_{r j}\right)(i=1, \ldots, m, r=1, \ldots, s, j=1, \ldots, n)$ in system (1) have certain values. Otherwise, they are formed as $x_{i j} \in\left[x_{i j}^{L_{i j}}, x_{i j}^{U_{i j}}\right],(i=1, \ldots, m, j=1, \ldots, n)$, and $y_{r j} \in\left[y_{r j}^{L_{r j}}, y_{r j}^{U_{r j}}\right],(r=1, \ldots, s, j=1, \ldots, n)$. Therefore, we have:
The satisfaction of the preceding system is evident when the conditions of the foregoing system (3) are met. Let us first define
$\Gamma_j^{-}=\left\{t_{i j} \mid t_{i j} \in\left\{L_{i j}, U_{i j}\right\}, i=1, \ldots, m\right\}$
$\Gamma_j^{+}=\left\{t_{r j} \mid t_{r j} \in\left\{L_{r j}, U_{r j}\right\}, r=1, \ldots, s\right\}$
Therefore, system (2) can be changed to
The primary procedures for identifying overlaps in interval data are as follows:
Stage 1. Classifying and Identifying Overlap
The identification of overlap concerning system (3) is shown as follows:
In model (4), $s_{1 j}^{+t_{1 j}, \ldots, t_{m j}}$ and $s_{1 j}^{-t_{1 j}, \ldots f_{s j}}$ are respectively positive and negative deviations of the linear discriminant function $\sum_{i=1}^m \alpha_i x_{i j}+\sum_{r=1}^s \beta_r y_{r j} $ from the discriminant score $(d)$ in $G_1$. The positive deviation $(s_{1 j}{ }^{+t_{1 j}, \ldots, t_{m j}}>$ $0,j \in G_1,\left(t_{1 j}, \ldots, t_{m j}\right) \in \Gamma_j^{-})$ is utilized to minimize the incorrect classification of the $j^{\text {th }}$ observation in $G_1$. Simultaneously, the negative deviation $\left(s_{1 j}^{-t_{1 j}, \ldots, t_{s j}}>0, j \in G_1,\left(t_{1 j}, \ldots, t_{s j}\right) \in \Gamma_j^{+}\right)$ indicated the correct classification of the $j^{\text {th }}$ observation in $G_1$. Furthermore, $s_{2 j}{ }^{+t_{1 j}, \ldots f_{m j}}$ and $s_{2 j}^{-t_{1 j}, \ldots, t_{s j}}$ are respectively positive and negative deviations of the same linear discriminant function $\sum_{i=1}^m \alpha_i x_{i j}+\sum_{r=1}^s \beta_r y_{r j}$ from the discriminant score $(d-\varepsilon)$ in $G_2$. The positive number is employed to circumvent an evident solution (all weights equal to zero). In the present case, the negative deviation $\left(s_{2 j}^{-t_{1 j}, \ldots, t_{s j}}, j \in G_2,\left(t_{1 j}, \ldots, t_{s j}\right) \in \Gamma_j^{+}\right)$ indicates an incorrect group classification, while the positive deviation $\left(s_{2 j}^{+t_{1 j}, \ldots, t_{m j}}, j \in G_1,\left(t_{1 j}, \ldots, t_{m j}\right) \in \Gamma_j^{-}\right)$ shows a correct one.
Here, let $\alpha^*, \beta^*$ and $d^*$ be the above model optimal solutions. Accordingly, a new observation,$\left(X_p, Y_p\right) \in\left(\left[X_p^{L_p}, X_p^{U_p}\right],\left[Y_p^{L_p}, Y_p^{U_p}\right]\right)$ can be classified as follows:
then part of the new observation belongs to ${G}_1$ and its rest belongs to $G_2$.
Theorem 1. The model (4) always has bounded optimal solution.
Refer to the study [4] for the proof.
Theorem 2. Assume that $C\left(G_1\right)$ and $C\left(G_2\right)$ be the convex hulls of $G_1$ and $G_2$. Accordingly, $C\left(G_1\right) \cap {C}\left({G}_2\right)=\phi$, if $\phi^*=0$.
Refer to the study [4] for the proof.
Stage 2. Identifying Overlap Using the Monte Carlo Approach
Now, we describe a technique for computing overlap with the use of the Hit and Miss Monte Carlo algorithm [3]. This method quantifies the area within the overlapping zone for any interval data. Consider $P^{\text {th }}$ observation, $\left(X_p, Y_p\right) \in\left(\left[X_p^{L_p}, X_p^{U_p}\right],\left[Y_p^{L_p}, Y_p^{U_p}\right]\right)$, as an interval data satisfying part (3) of (5). We consider $x_{i p} \in\left[x_{i p}^{L_{i p}}, x_{i p}^{U_{i p}}\right],(i=1, \ldots, m)$ and $y_{r p} \in\left[y_{r p}^{L_{r p}}, y_{r p}^{U_{r p}}\right],(r=1, \ldots, s)$. Accordingly, the whole region measure concerning the $P^{\text {th }}$ observation is $\prod_{i=1}^m\left(x_{i p}^{U_{i p}}-x_{i p}^{L_{i p}}\right) \prod_{r=1}^s\left(y_{r p}^{U_{r p}}-y_{r p}^{L_{r p}}\right)$. A vector is generated by generating random numbers as $\left(\bar{X}_p, \bar{Y_p}\right)=\left(\bar{x}_{1 p}, \ldots, \bar{x}_{m p}, \bar{y}_{1 p}, \ldots, \bar{y}_{s p}\right)$, where for each $i(i=1, \ldots, m)$, $\bar{x}_{i p} \sim U\left(x_{i p}^{L_{i p}}, x_{i p}^{U_{i p}}\right), \quad$ indicates that $\bar{x}_{i p}$ is distributed uniformly on $\left(x_{i p}^{L_{i p}}, x_{i p}^{U_{i p}}\right)$, that is, $\bar{x}_{i p}=x_{i p}^{L_{i p}}+\left(x_{i p}^{L_{i p}}-x_{i p}^{U_{i p}}\right)$ where $u_i \sim U(0,1)$. Similarly, for each $r,(r=1, \ldots, s), \bar{y}_{r p} \sim U\left(y_{r p}^{L_{r p}}, y_{r p}^{U_{r p}}\right)$, indicates that $\bar{y}_{r p}$ is distributed uniformly on $\left(y_{r p}^{L_{r p}}, y_{r p}^{U_{r p}}\right)$, that is, $\bar{y}_{r p}=y_{r p}^{L_{r p}}+\left(y_{r p}^{L_{r p}}, y_{r p}^{U_{r p}}\right)$ where, $u_r \sim U(0,1)$. Here, consider the $P^{\mathrm{th}}$ observation $\left(\bar{X}_p, \bar{Y}_p\right)=\left(\bar{x}_{1 p}, \ldots, \bar{x}_{m p}, \bar{y}_{1 p}, \ldots, \bar{y}_{s p}\right)$.
Using inequalities (6) and (7) and counting the number of Hits, we can estimate the measure of the overlapped region of the $P^{\mathrm{th}}$ observation with the relation below:
where, $N_H$ indicates the vector number $\left(\bar{X}_p, \bar{Y}_p\right)$ satisfying conditions (6) or (7), and $N$ stands for vector number.
Sueyoshi [9] introduced a novel two-stage MIP formulation as an extension of the extended DEA/DA model aimed at decreasing the overall count of misclassified observations:
Stage 1: Classifying and Identifying Overlap
The initial phase of extended DEA/DA is redefined as follows:
where, $Z=(X, Y)$.
Now, let's consider that $\lambda_i^*\left(=\lambda_i^{+^*}-\lambda_i^{-*}\right), d^*$ and $s^*$ to be model (9) optimal solution.
(a) If $s^* \leq 0$ (no overlap), the MIP computation is terminated since the whole observation set is classified obviously into $G_1$ and $G_2$ by $\sum_{i=1}^k \lambda_i^* Z_i=d^*$.
(b) If $s^*>0$ (an overlap), the subsequent phase can commence.
Stage 2: Handling Overlap:
Assume that:
and
Since the observations' group membership in the overlap $\left\{\left(X_j, Y_j\right) \mid j \in D_1 \cup D_2\right\}$ is unknown, we need to reformulate stage two as follows:
where, $M$ represents a specified large number. A key characteristic of stage two is that its goal function seeks to minimize the quantity of misclassified observations. The binary variable $y_j$ signifies the presence of such misclassifications. The classification of a recently sampled observation $Z_r=\left(z_{1 r}, \ldots, z_{k r}\right)$, is as follows:
First, we verify the below states.
(a-1) If $\sum_{i=1}^k \lambda_i^* z_{i r}>d^*+s^*$, then the observation belongs to $G_1$.
(b-1) If $\sum_{i=1}^k \lambda_i^* z_{i r} \lt d^*-s^*$, then the observation belongs to $G_2$.
(c-1) If $d^*-s{ }^* \leq \sum_{i=1}^k \lambda_i^* z_{i r} \leq d^*+s^*$, the observation group membership is unknown because it belongs to the overlapped region, then we verify two cases below:
(a-2) If $\sum_{i=1}^k \lambda_i^* z_{i r}>c^*$, then the observation belongs to $G_1$.
(b-2) If $\sum_{i=1}^k \lambda_i^* z_{i r} \leq c^*-\varepsilon$, then the observation belongs to $G_2$.
3. Modified MIP Approach
With regard to presented models (9) and (13) above, it can be seen that the sets of constraints $\xi_i^{+}+\xi_i^{-} \leq 1, (i=1, \ldots, k)$ and $\sum_{i=1}^k\left(\xi_i^{+}+\xi_i^{-}\right)=k$ are equivalent to the constraints $\xi_i^{+}+\xi_i^{-}=1, i=1, \ldots, k$.
Therefore, two models (9) and (13) reformulated as follows, respectively:
$\begin{array}{ll}\text { Min } s \\ \text { s.t. } \sum_{i=1}^k\left(\lambda_i^{+}-\lambda_i^{-}\right) z_{i j}-d+s \geq 0, \quad j \in J_1 \\ \sum_{i=1}^k\left(\lambda_i^{+}-\lambda_i^{-}\right) z_{i j}-d-s \geq 0, \quad j \in J_2 \\ \sum_{i=1}^k\left(\lambda_i^{+}+\lambda_i^{-}\right)=1-2 u, \\ \text{NLC}: \xi_i^{+} \geq \lambda_i^{+} \geq \varepsilon \xi_i^{+} \quad \& \quad \xi_i^{-} \geq \lambda_i^{-} \geq \varepsilon \xi_i^{-}, i=1, \ldots, k \\ \xi_i^{+}+\xi_i^{-}=1, \quad i=1, \ldots, k, \quad \xi_i^{+}, \xi_i^{-}, u \in\{0,1\} \\ d, s: \text { unrestricted, all other variables} \geq 0\end{array}$
and
$ \begin{array}{ll} \text { Min } \sum_{j \in D_1} y_j+\sum_{j \in D_2} y_j \\ \text { s.t. } \sum_{i=1}^k\left(\lambda_i^{+}-\lambda_i^{-}\right) z_{i j}-c+M y_j \geq 0, \quad j \in D_1 \\ \sum_{i=1}^k\left(\lambda_i^{+}-\lambda_i^{-}\right) z_{i j}-c-M y_j \leq 0-\varepsilon, \quad j \in D_2 \\ \sum_{i=1}^k\left(\lambda_i^{+}+\lambda_i^{-}\right)=1-2 u, \\ \text{NLC}: \xi_i^{+} \geq \lambda_i^{+} \geq \varepsilon \xi_i^{+} \quad \& \quad \xi_i^{-} \geq \lambda_i^{-} \geq \varepsilon \xi_i^{-}, \quad i=1, \ldots, k \\ \xi_i^{+}+\xi_i^{-}=1, \quad i=1, \ldots, k, \quad \xi_i^{+}, \xi_i^{-}, u \in\{0,1\}\\ c: \text { unrestrictes, all other variables} \geq 0 \end{array} $
4. The Proposed Algorithm
Consider a decisional case with two Groups $\left(G_1\right.$ and $\left.G_2\right)$ of interval observations which alternatively have $n_1$ and $n_2$ observations. The number of both groups' observations reaches $n$ and every observation is characterized by $m$ independent input factors and $s$ independent output factors, denoted by $\left(x_{i j}, y_{r j}\right)$ for $i=1, \ldots, m$, $r=1, \ldots, s$ and $j=1, \ldots, n$.
Mathematically, the MIP approach for the interval data has the following two-stage computational process.
1: Classifying and Identifying Overlap.
2: Handling Overlap.
To classify the observations and identify the presence or absence of the overlapped region in between them, we solve the following Mathematic Programming (MP):
Now, let $\alpha^*, \beta^*, d^*$ and $s^*$ be the optimal solution of the previous MP.
(a) If $s^* \leq 0$ (no overlap), the MIP computation is stopped since the observations are classified thoroughly into $G_1$ or $G_2$ by $\alpha^* X+\beta^* Y=d^*$.
(b) If $s^*>0$ (an overlap), the following step starts.
At this stage, it is obvious that all or some vertices of some interval observations belong to the overlapped region. To handle the classification of such observations, we first define the following sets, $V_1$ and $V_2$, to be the sets of the vertices of the observations belonging to $G_1$ and $G_2$, respectively.
Also, suppose that
to be the sets of vertices that are correctly classified into $G_1$ or $G_2$. Now, assume that:
Thus, the HO stage is formulated as follows:
A significant characteristic of the model (18) is its ability to minimize the quantity of misclassified vertices. The binary variable $y_j$ signifies the misclassified vertices presence.
Subsequently, consider the newly Sampled $P^{\text {th }}$ observation $\left(X_p, Y_p\right)$ with $X_p \in\left[X_p^{L_p}, X_p^{U_p}\right]$ and $Y_p \in\left[Y_p^{L_p}, Y_p^{U_p}\right]$. Going through stage one, we assume to have $\alpha^*, \beta^*, d^*$ and $s^*$, as the optimal solution.
Otherwise, all or parts of the $P^{\text {th }}$ observation pertain to the overlapped region. Therefore, stage two starts to decide whether this observation would pertain to $G_1$ or $G_2$.
In this case,
Here, since $\varepsilon$ is a very small number, we withdraw it.
(3) Otherwise, some vertices of the arbitrary interval data $\left(X_p, Y_p\right)$ pertain to $G_1$ and the rest pertain to $G_2$.
To overcome this, we assume $A_1$ to be part of $\left(X_p, Y_p\right)$ that pertains to $G_1$ and $A_2$ part of $\left(X_p, Y_p\right)$ that belongs to $G_2$.
Now, we use the Monte Carlo method to measure how much of the observation ($X_p, Y_p$) pertains to $G_1$ and $G_2$ by counting the number of Hits ($N_{H_1}$ or $N_{H_2}$) of all the arbitrary vectors $(N)$ that pertain to one of our Groups ($G_1$ or $G_2$).
Let $V_{P_1}=\frac{N_{H_1}}{N}$ be the overlapped region that belong to $G_1$.
where, $\Delta$ indicates the error by which $\left(X_p, Y_p\right)$ can be assumed to belong to one of the Groups. In other words, we intend to measure the probability under which the interval observations $\left(X_p, Y_p\right)$, belong to either group. Also, since $N$ is a given large number, it is not practically possible for $N_{H_1}$ to reach to half of $N$, exactly. Therefore, we can withdraw the probability of $V_{p_1}<\frac{1}{2}$, and we do not consider it.
5. Examples
Suppose $G_1$ and $G_2$ are two interval data groups with one input and output, as listed in Table 1.
Groups | $\boldsymbol{D M U_j}$ | $\boldsymbol{X_j^{L_j}}$ | $\boldsymbol{X_j^{U_j}}$ | $\boldsymbol{Y_j^{L_j}}$ | $\boldsymbol{Y_j^{U_j}}$ |
Group1 $\left(G_1\right)$ | $P_1$ | 4 | 5 | 4 | 5 |
$P_2$ | 2 | 3 | 5 | 6 | |
Group2 $\left(G_2\right)$ | $P_3$ | 1 | 2 | 3 | 4 |
$P_4$ | 4 | 5 | 1 | 3 |
Using model (15) for the classification of the observations $P_1, P_2, P_3$ and $P_4$, we obtain the following:
$ \alpha^*=0.25, \beta^*=0.75, d^*=3.75, s^*=-0.25 . $
Since $s^*(=-0.25)$ is negative, it is obvious that the two groups are separated completely with no overlap between them. The strongly separating hyperplane is: $0.25 \mathrm{X}+0.75 \mathrm{Y}=3.75$, as shown in Figure 1.
Now consider the new observation $P_5$ in Table 2.
New Observation | $\boldsymbol{X^L}$ | $\boldsymbol{X^U}$ | $\boldsymbol{Y^L}$ | $\boldsymbol{Y^U}$ |
$P_5$ | 2 | 3 | 1 | 2 |
So, we have:
$ \begin{aligned} \alpha^* X^L+\beta^* Y^L & =(0.25)(2)+(0.75)(1)=1.25<3.75=d^* \\ \alpha^* X^L+\beta^* Y^U & =(0.25)(2)+(0.75)(2)=2<3.75=d^* \\ \alpha^* X^U+\beta^* Y^L & =(0.25)(3)+(0.75)(1)=1.5<3.75=d^* \\ \alpha^* X^U+\beta^* Y^U & =(0.25)(3)+(0.75)(2)=2.25<3.75=d^* \end{aligned} $
As can be seen, all the vertices belonging to $P_5$ are located below the line $0.25 X+0.75 Y=3.75$ and, therefore, belong to the group $G_2$.
Consider the new observation $P_6$ that has some of its vertices in $G_1$ and the others in $G_2$, as shown in Table 3.
New Observation | $\boldsymbol{X^L}$ | $\boldsymbol{X^U}$ | $\boldsymbol{Y^L}$ | $\boldsymbol{Y^U}$ |
$P_6$ | 6 | 7 | 2 | 4 |
Regarding the model optimal solutions and the lower and upper bounds of the independent input and output factors, we will obtain:
$ \begin{aligned} & \alpha^* X^L+\beta^* Y^L=(0.25)(6)+(0.75)(2)=3<3.75=d^* \\ & \alpha^* X^L+\beta^* Y^U=(0.25)(6)+(0.75)(4)=4.5>3.75=d^* \\ & \alpha^* X^U+\beta^* Y^L=(0.25)(7)+(0.75)(2)=3.25<3.75=d^* \\ & \alpha^* X^U+\beta^* Y^U=(0.25)(7)+(0.75)(4)=4.75>3.75=d^* \end{aligned} $
It is clear that two of the vertices belong to $G_1$ and the rest belong to $G_2$.
To handle the case, the second stage of the MIP approach to interval data starts to classify the incorrectly classified vertices.
By solving model (18) for the overlap handling of the misclassified vertices, we obtain:
$ \alpha^*=0.25, \quad \beta^*=0.75, \quad c^*=-3.49 $
Therefore:
$ \begin{aligned} & (0.25)(6)+(0.75)(2)=3.00>-3.49=c^* \\ & (0.25)(6)+(0.75)(4)=4.5>-3.49=c^* \\ & (0.25)(7)+(0.75)(2)=3.25>-3.49=c^* \\ & (0.25)(7)+(0.75)(4)=4.75>-3.49=c^* \end{aligned} $
So we can conclude that $P_6$ belongs to $G_1$.
Consider the two-stage MIP method for classification of 10 observations where $P_1, P_2, P_3, P_4$ and $P_5$ belong to group $G_1$ and $P_6, P_7, P_8, P_9$ and $P_{10}$ belong to group $G_2$. The observations are interval data, each with two independent input factors and two independent output factors. The input and output values of the observations are listed in Table 4.
Groups | $\boldsymbol{D M U_j}$ | $\boldsymbol{X_j^{L_j}}$ | $\boldsymbol{X_j^{U_j}}$ | $\boldsymbol{Y_j^{L_j}}$ | $\boldsymbol{Y_j^{U_j}}$ |
Group1 $\left(G_1\right)$ | $P_1$ | 4 | 6 | 7 | 9 |
$P_2$ | 9 | 10 | 9 | 10 | |
$P_3$ | 4 | 6 | 13 | 14 | |
$P_4$ | 8 | 11 | 12 | 13 | |
$P_5$ | 12 | 13 | 9 | 10 | |
Group2 $\left(G_2\right)$ | $P_6$ | 2 | 3 | 4 | 5 |
$P_7$ | 2 | 3 | 6 | 8 | |
$P_8$ | 1 | 3 | 1 | 2 | |
$P_9$ | 7 | 9 | 6 | 8 | |
$P_{10}$ | 4 | 5 | 2 | 4 |
The ideal solution for classifying these observations applying model (14) is derived as follows: $\alpha^*=0, \beta^*=1, d^*=7.5, s^*=0.5$.
Since $s^*(=0.5)$ is positive, we can conclude that an overlap exists between groups. So, at the COI stage, we will have the subsets $K_1$ consisting of all the correctly classified vertices of $G_1, K_2$ consisting of all the correctly classified vertices of $G_2, \bar{Q}_1$ consisting of all the incorrectly classified vertices of $G_1, \bar{Q}_2$ consisting of all the incorrectly classified vertices of $G_2$.
It is obvious that the overlap area includes all the vertices belonging to $\bar{Q}_1 \cup \bar{Q}_2$ and is restricted between the lines $Y=7$ and $Y=8$, as shown in Figure 2.
As it is visually depicted in Figure 2, $\bar{Q}_1$ and $\bar{Q}_2$ are recognized as:
$ \begin{aligned} & \bar{\mathrm{Q}}_1=\left\{\left(\mathrm{X}_{\mathrm{p}_1}^{\mathrm{L}_1}, \mathrm{Y}_{\mathrm{p}_1}^{\mathrm{L}_1}\right),\left(\mathrm{X}_{\mathrm{p}_1}^{\mathrm{L}_1}, \mathrm{Y}_{\mathrm{p}_1}^{\mathrm{U}_1}\right)\right\} \\ & \bar{\mathrm{Q}}_2=\left\{\left(\mathrm{X}_{\mathrm{p}_7}^{\mathrm{L}_7}, \mathrm{Y}_{\mathrm{p}_9}^{\mathrm{U}_7}\right),\left(\mathrm{X}_{\mathrm{p}_7}^{\mathrm{U}_7}, \mathrm{Y}_{\mathrm{p}_7}^{\mathrm{U}_7}\right),\left(\mathrm{X}_{\mathrm{p}_9}^{\mathrm{L}_9}, \mathrm{Y}_{\mathrm{p}_9}^{\mathrm{U}_9}\right),\left(\mathrm{X}_{\mathrm{p}_9}^{\mathrm{U}_9}, \mathrm{Y}_{\mathrm{p}_9}^{\mathrm{U}_9}\right)\right\} \\ & \mathrm{K}_1=\mathrm{V}_1-\bar{\mathrm{Q}}_1 \\ & \mathrm{K}_2=\mathrm{V}_2-\bar{\mathrm{Q}}_2 \end{aligned} $
Here, $\mathrm{V}_1$ and $\mathrm{V}_2$ are the sets of the whole vertices belonging to $G_1$ and $G_2$ respectively.
It is also observed from Figure 2 that the observations $P_1, P_2, P_3$ and $P_4$ thoroughly belong to $G_1$ and the observations $P_6, P_8$, and $P_{10}$ thoroughly belong to $G_2$. The observations $P_1, P_7$ and $P_9$ have parts of them in the overlap area.
Because the overlap exists, the next stage starts to handle observations $P_1, P_7$ and $P_9$. Let $y_1$ be the binary variable for enumerating the misclassified vertices of $P_1$ in $\bar{Q}_1, y_2$ be the binary variable for enumerating the misclassified vertices of $P_7$ in $\bar{Q}_2$ and $y_3$ be the binary variable for enumerating the misclassified vertices of $P_9$ in $\bar{Q}_2$. Going through the model (18), we obtain the following optimal solution for the HO stage:
$ \alpha^*=0, \beta^*=1, \mathrm{c}^*=8.01, \mathrm{y}_1^*=1, \mathrm{y}_2^*=0, \mathrm{y}_3^*=0 $
Therefore, the discriminant function defined at the next stage is formulated as $\mathrm{Y}=8.01$ which makes the misclassified vertices of $P_7$ and $P_9$ to belong to $V_2$ (The set of vertices belonging to $G_2$ ). Also, since the observation $P_1$ still has parts of it in $G_1$ and the rest in $G_2$, we need to use the Monte Carlo method to handle it.
Let $R_1$ be the part of $P_1$ that belongs to $G_1, N$ be the total number of the arbitrary vectors produced by the Monte Carlo method and $N_{H_1}$ be the number of such produced vectors that belong to $G_1$. Going through the Monte Carlo method with the assumption $\mathrm{N}=500,000$, we obtain the following:
$ {V}_{{R}_1}=\frac{{N}_{{H}_1}}{{N}}=0.495<0.5 $
So it is concluded that $P_1$ belongs to $G_2$, because the possibility under which $P_1$ belongs to ${G}_2(=0.505)$ is more than the possibility under which $P_1$ belongs to ${G}_1(=0.495)$. Also, the error considered for $P_1$ to belong to $G_2$ is equal to 0.495 .
6. Conclusions
This study introduced an innovative two-stage MIP framework for discriminant analysis of interval data, addressing key challenges such as misclassification and group overlap. The proposed methodology effectively minimized the total number of misclassified vertices and identifies overlap zones, providing probabilistic insights into group memberships. Through comprehensive examples, the model demonstrated its ability to classify interval data, manage overlaps, and identified structural similarities and differences within datasets. Despite its strengths, the performance of the method is influenced by the choice of parameters and M, which can lead to variability in classification outcomes. Future research should focus on standardizing parameter selection and extending the framework to accommodate data in fuzzy or other non-traditional formats. The proposed approach represents a significant advancement in interval data classification, offering a robust and adaptable tool for complex decision-making scenarios.
The data supporting our research results are included within the article.
The authors declare no conflict of interest.