组合优化赋能的机器学习_技术基础_应用场景与研究前沿_35页_876kb

> **来源：[研报客](https://pc.yanbaoke.cn)** # Combinatorial Optimization Augmented Machine Learning Technical Foundations, Applications, and Research Frontiers Maximilian Schiffer 1,2 , Heiko Hoppe 1,2 , Yue Su 3 , Louis Bouvier 4 , and Axel Parmentier 4 $^{1}$ School of Management, Technical University of Munich, Germany $^{2}$ Munich Data Science Institute, Technical University of Munich, Germany 3 Inria, Université de Lille, France $^{4}$ CERMICS, Ecole Nationale des Ponts et Chaussées, Institut Polytechnique Paris, France January 16, 2026 # Abstract Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning. Keywords: combinatorial optimization; machine learning; decision-focused learning # 1. Introduction The convergence of machine learning (ML) and operations research (OR) stands as a central development in modern decision science. While ML excels at extracting patterns from high-dimensional data to forecast uncertain parameters, OR provides the rigorous machinery to navigate complex, discrete feasible regions and identify optimal decisions. Traditionally, these capabilities have been deployed sequentially—predicting parameters first, then optimizing deterministic models. However, this separation often fails to align predictive accuracy with downstream decision quality. Combinatorial optimization augmented ML (COAML) addresses this gap by integrating combinatorial algorithms directly into the learning loop, creating end-to-end architectures that learn to optimize, see Figure 1. To formalize this paradigm and its necessity, we begin by examining the standard framework of contextual stochastic optimization (CSO), as it yields a natural starting point to data-driven approaches in OR. Figure 1: Conceptual illustration of the COAML paradigm. A neural network (NN) parameterizes a surrogate combinatorial optimization (CO) problem. We propagate gradients backwards through the CO layer, enabling end-to-end training of the policy $\pi_w$ to minimize a downstream decision loss. Consider a setting where a decision maker observes a context $x \in \mathcal{X}$ , containing all available information required to make a decision $y$ . This decision must belong to a (combinatorial) set of feasible decisions $\mathcal{Y}(x)$ . Once the decision is implemented, an uncertainty $\xi$ realizes, incurring a cost $c(x,y,\xi)$ . We assume a joint but unknown probability distribution $\mathbb{P}$ over $(\pmb{x},\pmb{\xi})$ , with $\pmb{x},\pmb{\xi}$ being random variables. Throughout this survey, we adhere to this convention, denoting random variables by bold symbols, e.g., $\pmb{x}$ , and using regular symbols, e.g., $x$ , for their realizations. Our objective is to determine a policy $\pi$ that maps contexts to decisions. To accommodate nondeterministic policies, we generally define $\pi(\cdot|x)$ as a conditional distribution over $y$ given $x$ . Since the decision maker cannot observe $\xi$ beforehand, the decision $\pmb{y}$ is conditionally independent of $\pmb{\xi}$ given $\pmb{x}$ . In this setting, our goal is to minimize the expected cost $$ \min _ {\pi} \mathbb {E} _ {(\boldsymbol {x}, \boldsymbol {\xi}) \sim \mathbb {P}, \boldsymbol {y} \sim \pi (\cdot | \boldsymbol {x})} \left[ c (\boldsymbol {x}, \boldsymbol {y}, \boldsymbol {\xi}) \right]. \tag {1} $$ An optimal policy maps a context $x$ to a decision $y \in \mathcal{V}(x)$ that solves the stochastic optimization problem defined by the conditional distribution of $\xi$ given $x$ : $$ x \longmapsto y \in \underset {y \in \mathcal {Y} (x)} {\arg \min } \mathbb {E} \left[ c (x, y, \boldsymbol {\xi}) \mid x \right]. \tag {2} $$ However, deriving such a policy is generally intractable: statistically, because the distribution over $(\pmb{x},\pmb{\xi})$ is unknown, and computationally, because the resulting problem is both stochastic and combinatorial. In the COAML framework, we address this intractability by focusing on policies of the form $$ \pi_ {w}: x \longmapsto \hat {y} (\theta) \in \arg \max \kappa (\theta , y) \quad \text {w h e r e} \quad \theta = \varphi_ {w} (x). \tag {3} $$ Here, $\max \kappa(\theta, y)$ represents a surrogate $CO$ problem parameterized by $\theta$ , and $\varphi_w$ is a statistical model parameterized by $w \in \mathcal{W}$ , typically a NN. In many cases, $\kappa(\theta, y)$ takes the form of a linear function, such as $\theta^\top y$ or $\theta^\top \phi(y)$ where $\phi(y)$ is an embedding. We assume the oracle $\hat{y}$ to be efficient and scalable, representing any standard $CO$ solver—such as mixed-integer programming, shortest-path algorithms, or sorting algorithms. These solvers leverage decades of advancements in mathematical programming, making them broadly applicable across problems. Note that while this general formulation allows for stochastic policies, we describe $\pi_w$ as a deterministic mapping for the sake of clarity in this introduction. We will generalize this surrogate problem to define a distribution $\pi_w(\cdot|x)$ over $\mathcal{Y}(x)$ in later sections. Our learning problem is to find parameters $w$ that yield a high-quality policy, solving: $$ \min _ {w \in \mathcal {W}} \mathbb {E} _ {(\boldsymbol {x}, \boldsymbol {\xi}) \sim \mathbb {P}} \left[ c (\boldsymbol {x}, \pi_ {w} (\boldsymbol {x}), \boldsymbol {\xi}) \right]. \tag {4} $$ In practice, we lack access to $\mathbb{P}$ and must rely on a training set. The nature of this training set depends on the application, dictating the specific learning methodology. The simplest setting in CSO assumes a training set $(x_{1},\xi_{1}),\ldots ,(x_{n},\xi_{n})$ of historical contexts and uncertainties. Full access to $\xi_{i}$ is common in internal operations optimization, though less guaranteed in settings like sales. Here, it is natural to frame learning as empirical cost minimization (ECM): $$ \min _ {w} \frac {1}{n} \sum_ {i = 1} ^ {n} c (x _ {i}, \pi_ {w} (x _ {i}), \xi_ {i}). \tag {5} $$ In some applications, we do not observe $\xi_{i}$ directly but only the cost $c(x_{i},\pi_{w}(x_{i}),y_{i})$ , which may result from applying the policy in the real world or within a simulator. In such cases, the training set consists only of $x_{1},\ldots ,x_{n}$ , along with the ability to sample from $c(x_{i},\pi_{w}(x_{i}),\pmb {\xi})$ , where $\pmb{\xi}$ is distributed according to $\mathbb{P}(\pmb {\xi}|x_i)$ . This limited information setting is referred to as single-step reinforcement learning (RL) in the context of ECM. In other applications, an expert policy may provide target decisions $(x_{1},\bar{y}_{1}),\ldots ,(x_{n},\bar{y}_{n})$ for imitation. We then formulate the problem as supervised learning (SL): $$ \min _ {w} \frac {1}{n} \sum_ {i = 1} ^ {n} \mathcal {L} \left(\pi_ {w} \left(x _ {i}\right), \bar {y} _ {i}\right), \tag {6} $$ where $\mathcal{L}(y,\bar{y})$ quantifies the divergence between $y$ and $\bar{y}$ While other learning settings exist, a key of the ones presented above is that they are decision-focused: the learning objective evaluates the quality of the solution $y$ derived from $\theta$ . This contrasts with decision-agnostic approaches, which typically minimize a prediction loss over a training set $(x_{1},\theta_{1}),\ldots ,(x_{n},\theta_{n})$ : $$ \min _ {w} \frac {1}{n} \sum_ {i = 1} ^ {n} \ell \left(\varphi_ {w} \left(x _ {i}\right), \theta_ {i}\right). \tag {7} $$ Decision-agnostic learning ignores that the ultimate goal is not maximizing the accuracy of $\theta$ , the output of $\varphi_w$ , but rather the quality of the decision $y \in \arg \max \kappa(\theta, y)$ for the original problem (2). Decision-focused learning is critical for COAML's performance but introduces a practical challenge: the surrogate problem (3) acts as the final layer of the NN. Since NNs are trained via first-order methods, a significant portion of COAML methodology focuses on making differentiation through $CO$ layers both meaningful and practical. # 1.1. The Imperative of COAML in Modern OR The emergence of COAML represents a fundamental evolution in how OR approaches the integration of data and decisions. While the CSO framework introduced above provides a rigorous mathematical basis for this paradigm, the practical imperative for COAML arises from systemic limitations in traditional decision-making architectures. Modern industrial applications demand solutions that are not only performant in theory but also robust to high-dimensional uncertainty and scalable to real-world instances. Conventional methods, which typically segregate prediction from optimization, increasingly struggle to meet these dual requirements. In the following, we examine the structural pressures—ranging from computational scalability to the complexity of stochastic environments—that motivates the shift toward end-to-end learning policies. Scalability in CO. OR optimizes industrial processes through efficient resource allocation. Often, resources are indivisible—e.g., a machine is either assigned to a job or not—giving rise to CO problems. Scaling these problems is essential to unlocking flexibility and economies of scale. For instance, in bin packing, larger volumes allow for more efficient container usage, reducing marginal costs. Similarly, in warehouse order picking, consolidating orders optimizes routes and reduces costs. Realizing these benefits requires algorithms capable of solving large-scale instances efficiently. The complexity of uncertainty. Modern decision-making is characterized by pervasive uncertainty. Disruptions like the COVID-19 pandemic or the variability of renewable energy highlight the necessity of accounting for uncertainty in planning. Historically, OR has addressed these challenges independently, employing CO for static NP-hard problems, stochastic/robust optimization for strategic uncertainty, and control-theoretic methods for sequential decisions. This division frequently requires simplifying assumptions. For instance, navigating vast combinatorial spaces typically depends on simplified objectives, which are common in mixed-integer linear programming (MILP). While MILP solvers have achieved massive speedups (over 500 billion times between 1991 and 2015), stochastic and robust optimization techniques remain computationally intractable for large-scale real-world applications like supply chain planning or vehicle routing. Limitations of separated architectures. The scalability issues of classic uncertainty-handling methods hinder their broad adoption. Since OR gains often stem from economies of scale, restricting optimization to small instances to accommodate stochastic complexity is counterproductive. Consequently, practitioners often favor predict-then-optimize (PTO) architectures, where statistical models first generate deterministic approximations, e.g., expected values, which are then solved by mature combinatorial algorithms. While computationally efficient, this separation creates a critical misalignment: the statistical model is typically trained to minimize a symmetric prediction loss, e.g., the mean squared error (MSE), without awareness of the downstream optimization landscape. For example, consider a vehicle routing problem (VRP) that bases on predicted customer demand. For a standard regression model, a $5\%$ underestimation of demand is equivalent to a $5\%$ overestimation. However, in practice, underestimating demand may result in a truck arriving without sufficient capacity, necessitating a highly costly recourse action—such as dispatching a second vehicle solely for the overflow. In contrast, a slight overestimation merely results in minor unutilized space. By decoupling prediction accuracy from decision quality, a standard prediction fails to penalize the specific errors that trigger such expensive operational repairs. The methodological pivot to structured learning. The ML community encountered analogous challenges in the early 2000s with structured output prediction tasks, such as image segmentation or natural language parsing. Unlike simple regression or classification where predictions are independent scalars, these tasks require predicting a complex configuration of interdependent variables, effectively forming a combinatorial object. Initially, the dominant architecture treated prediction and optimization separately. The standard approach was to first estimate the full joint probability distribution $\mathbb{P}(\pmb {\xi}|\pmb {x})$ and then solve an optimization problem—often termed inference—over this distribution. However, this strategy frequently failed because learning a high-dimensional joint distribution is statistically intractable; it requires exponentially large datasets to approximate the complex dependencies between variables. To overcome this shortcoming, the ML community pivoted to structured learning, a paradigm that bypasses the estimation of the full distribution. Instead, it trains models to directly minimize a loss defined on the output structure itself, effectively teaching the model to optimize. These algorithms consist of a statistical model and a solver, or inference engine, integrated into a single differentiable architecture. While the specific combinatorial problems in computer vision—such as finding the most likely pixel grid—differ from OR challenges like resource allocation, the mathematical foundations are remarkably transferable. For instance, the "smart PTO plus" (SPO+) loss used in modern OR is mathematically rooted in the Structured Hinge Loss developed for support vector machines two decades prior. With the advent of deep learning, this paradigm has been revitalized: the linear models of the past have been replaced by deep NNs, and the inference step is now framed as a differentiable CO-layer (3) within an end-to-end policy. Alignment with industrial realities. In this context, the focus shifts from solving isolated instances to a learning perspective. We assume that instances are drawn from an unknown distribution $\mathbb{P}$ , and we use a training set to learn a policy $\pi_w$ mapping instances $x$ to solutions $y \in \mathcal{Y}(x)$ . This aligns with industrial practice, where instances often follow a distribution with fixed support, e.g., daily aircraft routing with constant flights but varying loads. Traditional optimization ignores this structure, solving each instance from scratch. A learning-based perspective exploits historical patterns to minimize expected costs across the distribution. Furthermore, this framework often confines combinatorial complexity to a linear optimization step, leveraging efficient linear solvers. This shifts the computational burden to the offline learning phase, leaving a simple linear problem on-the-fly, which is ideal for industrial settings where rapid decisions are crucial. # 1.2. COAML within the Decision-Making Landscape To position COAML relative to other decision paradigms, we use CSO (1) as a common baseline and describe alternatives as policies within this framework. A key distinction lies in the division between on-the-fly computation, i.e., when a new $x$ arrives, and offline computation. For COAML policies, on-the-fly computation involves evaluating the NN $\varphi_w(x)$ and solving the CO layer (3). Offline computation, which is far more intensive, involves solving learning problems like (5) or (6). Most CSO policies involve both online CO solving and offline computation, but differ in key characteristics, see Table 1. Deterministic CO. Here, a deterministic approximation of (1) is built and solved on the fly. This generally involves estimating a single nominal scenario $\bar{\xi}$ , e.g., by using the mean, and solving: $$ \min_{y\in \mathcal{Y}(x)}c(x,y,\bar{\xi}). $$ In the terminology of our surrogate framework, the parameters $\theta$ are fixed to $\bar{\xi}$ . While computationally efficient, this approach ignores the dispersion of $\mathbb{P}$ , often leading to brittle solutions. Stochastic CO. This approach addresses the flaws of deterministic approximations by directly solving the stochastic optimization problem (2). Since $\mathbb{P}$ is unknown, a generative model $p(x,\xi)$ or discriminative model $p(\xi |x)$ is typically learned offline. Given that $c$ is often non-linear or the feasible set complex, the problem is generally intractable. Practitioners therefore resort to sample average approximation (SAA): $$ \min _ {y \in \mathcal {Y} (x)} \frac {1}{N} \sum_ {i = 1} ^ {N} c (x, y, \xi_ {i}), $$ where $\xi_{i}$ are i.i.d. samples from $p(\cdot |x)$ . However, as the decision space grows, SAA remains computationally prohibitive for real-time applications. PTO. This paradigm decouples learning and optimization into two sequential stages and is closely related to the previous two paradigms. In fact, it can be seen as a special case of deterministic CO, where one uses a point forecast to create the nominal scenario, allowing to better capture context and uncertainty dynamics. Specifically, a predictive model $\varphi_w$ is trained to minimize a decision-agnostic loss, e.g., an MSE, against the ground truth $\xi$ , purely focusing on prediction accuracy in the first stage. In the second stage, the predicted value $\hat{\xi} = \varphi_w(x)$ is plugged into the deterministic optimization problem $\min_{y \in \mathcal{Y}(x)} c(x, y, \hat{\xi})$ . While this retains the computational efficiency of deterministic approaches, it suffers from a fundamental loss-cost mismatch: the training objective ignores Table 1: Comparative analysis of decision-making paradigms. Paradigm Online Computation Primary Constraint Coincidence Learning Loss Target Deterministic CO Solve nominal instance Ignores uncertainty Yes None N/A Stochastic CO Solve SAA of stochastic problem Intractable for large y(x) Yes Gen. N/A PTO Solve predicted instance Loss-cost mismatch No DA Prediction Error Smart-PTO Solve predicted instance Restricted to linear objectives Yes† DF Regret / Cost Structured Prediction MAP Inference Limited to tractable structures No DF Imitation / Cost COAML Solve learned surrogate Non-convex / Hard training No DF Imitation / Cost Abbreviations: DA-Decision-Agnostic Learning; DF-Decision-Focused Learning; Gen.-Generative Learning; MAP-Maximum a Posteriori; Coincidence: Denotes if the online surrogate problem is mathematically equivalent to the underlying stochastic objective (or its SAA). $\dagger$ Coincidence holds strictly only for linear objectives where $\mathbb{E}[c(x,y,\xi)] = c(x,y,\mathbb{E}[\xi])$ . the downstream optimization landscape; accordingly, small prediction errors can result in arbitrarily large decision costs. Smart PTO. When the objective is linear in the uncertainty, the deterministic approximation (using the conditional expectation) and the stochastic problem coincide mathematically: $$ \min _ {y \in \mathcal {Y} (x)} \mathbb {E} \big [ c (x, y, \pmb {\xi}) | x \big ] = \min _ {y \in \mathcal {Y} (x)} \underbrace {\mathbb {E} \big [ \pmb {\xi} | x \big ]} _ {\bar {\xi}} y = \min _ {y \in \mathcal {Y} (x)} c (x, y, \bar {\xi}). $$ The smart PTO literature exploits this property, training a NN $\varphi_w$ to predict $\theta = \bar{\xi} = \varphi_w(x)$ . Unlike standard PTO, it uses a decision-focused loss such as SPO+ rather than a prediction loss, cf. Elmachtoub and Grigas [2022]. This ensures that even if the prediction $\varphi_w(x)$ is imperfect, it induces a near-optimal decision $y$ . Structured prediction. Structured learning applications from the early 2000s share the mathematical form of CSO but differ in intent. Often, the true cost $c(x,y,\xi)$ is unknown or hard to specify, e.g., the "cost" of a wrong pixel in image segmentation, but expert demonstrations are available. The surrogate problem (3) is typically a maximum a posteriori (MAP) inference problem over a graphical model. While these methods pioneered decision-focused learning when following an imitation learning (IL) setting, the underlying combinatorial structures were often simpler than those found in industrial OR. Neural CO. Neural CO has gained significant traction as an alternative paradigm. In the CSO context, its policies function as constructive heuristics rather than solvers. For example, in the traveling salesman problem (TSP), a sequence-to-sequence network predicts the next city to visit given a partial tour. The surrogate problem is implicit: there is no explicit optimization instance solved on the fly, but rather the execution of a parameterized policy trained via RL. We exclude Neural CO from this review as it discards the exact solvers that COAML aims to leverage. Synthesis: the COAML paradigm. COAML emerges as a generalization of these approaches, designed to handle the complexity of OR problems where Smart PTO's linearity assumptions do not hold and Stochastic CO's computational costs are prohibitive. By treating the solver as a parameterized surrogate layer $\max \kappa (\theta ,y)$ , COAML decouples the learned parameters $\theta$ from the physical predictions $\bar{\xi}$ . The parameters $\theta$ do not need to represent expected demands or costs; rather, they are latent "knobs" tuned by the network to force the solver into a low-cost configuration. This allows COAML to utilize the speed of deterministic solvers while accounting for complex stochastic interactions through decision-focused training, effectively closing the loop between data and decision. # 1.3. Scope and Contributions Recent literature has extensively charted the intersection of ML and OR. Comprehensive surveys, such as those by Sadana et al. [2024] and Mandi et al. [2024], map the broad landscape of data-driven optimization, covering a vast array of modeling approaches and theoretical guarantees. Complementing these, tutorials by Misić and Perakis [2020] and Kotary et al. [2021] provide technical deep dives into specific methods for accelerating constrained optimization. In contrast, this paper adopts a distinct "tutorial-survey" perspective. Rather than attempting to cover the entire spectrum of contextual decision-making, we provide a focused, comprehensive primer on a specific, high-impact paradigm: Combinatorial Optimization Augmented Machine Learning (COAML). The COAML paradigm fundamentally rethinks the interface between prediction and optimization. Although we introduced COAML through the theoretical lens of CSO, its applicability extends far beyond this setting. The core innovation—embedding combinatorial oracles as differentiable layers within learning architectures—allows for end-to-end training in diverse environments. This ranges from learning surrogate costs for static NP-hard problems to enabling structured RL (SRL) policies in multi-stage environments. Accordingly, this paper is designed to serve as both a tutorial for newcomers and a structured reference for experts. Our contributions are organized as follows: Formalization: We rigorously define COAML architectures and distinguish them from related PTO and neural CO approaches. Applications: We illustrate the paradigm's versatility through key application domains, demonstrating its value in both industrial operations and scientific computing. Methodology: We outline the essential building blocks, with a focus on the techniques required to differentiate through discrete optimization layers. Taxonomy: We review existing work and best practices, organizing the literature into a structured taxonomy that categorizes methods by their learning signal and solver integration. Guidelines: Based on the taxonomy, we provide guidelines on how to leverage the available tools to design successful COAML approaches. Open challenges: We identify critical gaps in theory and practice, proposing directions for future research. Beyond our primary scope, we acknowledge a related line of research focusing on continuous optimization layers. While this shares challenges with COAML, such as gradient estimation and implicit differentiation, they require distinct mathematical techniques. We briefly outline their relationship to COAML in the conclusion to provide a complete picture. Finally, we note that this paper makes use of various acronyms. To aid readability, we provide an overview of these acronyms in the appendix. # 2. Formalization & Problem Classes As introduced earlier, COAML architectures are NNs that terminate with a CO-layer. These architectures define policies $\pi_w$ that map a context $x \in \mathcal{X}$ to a decision $y \in \mathcal{Y}(x)$ , or potentially a conditional distribution over $\mathcal{Y}(x)$ , parameterized by weights $w$ . While we initially presented these architectures in the context of CSO, where the goal is to minimize an expected cost $\mathbb{E}[c(x,y,\pmb{\xi})]$ given a context realization $x$ and random uncertainty $\pmb{\xi}$ , they are in fact much more general. They provide a flexible mechanism for parameterizing policies over combinatorial spaces that can be learned from data. This versatility allows COAML to be applied to a wide range of problem classes, ranging from hard deterministic problems to multi-stage stochastic optimization. In the following, we give an overview of problem classes where the COAML framework can be useful to derive efficient algorithms. These problem classes will serve as running examples throughout the paper, and we will return to them in Section 5 when mapping algorithms and applications to our taxonomy of COAML approaches. # 2.1. Learning heuristics for hard CO problems. Industries today have access to vast amounts of data, which they leverage to enhance the performance and resilience of their operational processes. Achieving this potential requires algorithms that not only optimize these processes but also scale effectively. The primary improvement potential of optimization algorithms stems from their ability to reduce marginal costs. Consider, for instance, a routing application. Even the most advanced routing algorithm incurs high per-request costs if only three delivery requests exist, as the vehicle must still complete a tour starting and returning to a depot. In contrast, with one thousand delivery requests, the vehicle can consolidate deliveries within the same neighborhood, significantly lowering the marginal cost per request. However, to realize these gains, the underlying algorithm must scale efficiently to handle such large instances. The CO community has therefore devoted substantial efforts to designing algorithms that scale well for large instances, often accepting simplified objective functions to maintain tractability. As a result, problems with linear objectives are pervasive in CO. Decades of research have yielded practically efficient algorithms for a wide range of applications. However, these methods frequently become intractable when the objective function incorporates complex phenomena or resilience considerations, which introduce nonlinearities or stochastic elements. In such contexts, we encounter a fundamental discrepancy: while the problem requires scalable algorithms for complex combinatorial structures, existing methods typically only achieve true scalability for simpler linear approximations. This setting can be formally cast as a CSO problem. Here, the context $x$ represents the instance of the hard problem, e.g., the set of requests. In many cases, the difficult problem is stochastic, and the uncertainty $\pmb{\xi}$ represents the complex or stochastic elements, e.g., delays or disruptions, that make the true cost $\mathbb{E}_{\pmb{\xi}}[c(x,y,\pmb{\xi})]$ difficult to optimize directly. If the problem is deterministic, then $\xi$ can be skipped. The COAML policy uses a simpler, scalable combinatorial layer, e.g., a deterministic linear solver, to generate solutions $y$ , while the learning process adjusts the parameters of this layer to minimize the complex true cost $\mathbb{E}_{\pmb{\xi}}[c(x,y,\pmb{\xi})]$ . Complexity theory indicates that we cannot expect a policy $\pi_w$ to perform well for every possible instance $x$ . Nevertheless, practical instances often exhibit specific structures captured by the distribution $\mathbb{P}$ of $x$ , and possibly $\xi$ . In many applications, this structure allows us to achieve excellent performance in practice, which translates into a low risk, i.e., effectively solving (5). Example 1. Stochastic vehicle scheduling. The vehicle scheduling problem (VSP) is a classical CO problem in which a fleet of vehicles must serve a set of timed requests. The objective is to construct sequences of requests for each vehicle that collectively cover all requests at minimal cost. A common application of the VSP arises in aircraft routing, where airlines must assign aircraft to scheduled flights. When the objective function is simply the sum of the arc costs, the VSP can be formulated and solved efficiently as a minimum cost flow problem. In this formulation, the graph's vertices correspond to requests, and the arcs $a \in A$ represent pairs of requests that can be consecutively served by a vehicle within a feasible route. The rich literature and efficient algorithms for minimum cost flow problems make this deterministic formulation well-suited for practical applications. However, practitioners often encounter stochastic variants of the VSP. For example, delays in flights can propagate to subsequent legs of an aircraft's route. Airlines therefore aim to find aircraft routing solutions that minimize the expected total delay cost, taking into account the randomness inherent in operational disruptions. These stochastic variants are considerably more challenging from a computational standpoint, and there is generally no algorithm that scales well enough for industrial applications in these settings. To address this challenge, Parmentier [2021] proposed a COAML-based approach to approximate the stochastic VSP (SVSP) by reparametrizing the deterministic VSP. As illustrated in Figure 2, a statistical model $\varphi_{\mathbf{w}}$ predicts linear arc costs $\theta = (\theta_{a})_{a}$ that are tailored to yield good solutions for the stochastic variant of the problem. A deterministic VSP solver then computes a solution $y$ using these learned arc costs. This approach has a significant advantage in industrial practice: it leverages the existing deterministic solver, including all the specialized industrial constraints already incorporated into it. By learning to adjust only the arc costs $\theta$ while keeping the solver unchanged, COAML enables practitioners to integrate data-driven adaptations to uncertainty without having to overhaul their existing optimization infrastructure. Figure 2: Neural network with a CO-layer for the stochastic vehicle scheduling problem. Vertices represent requests. Dotted arrows represent arcs $a$ , which are pairs of requests that can be operated in a sequence. The colored paths give vehicle routes in the solution returned. Image adapted from Dalle et al. [2022]. # 2.2. Structured Learning The roots of modern COAML can be traced back to the structured learning community. In the early 2010s, before the widespread adoption of end-to-end deep learning, CO-layers became the state of the art for tasks involving complex output dependencies, such as image segmentation or natural language parsing [cf. Nowozin and Lampert, 2011]. Unlike standard classification, where predictions are treated as independent scalars, these tasks require predicting a configuration of variables that must respect structural constraints—effectively solving a combinatorial problem to ensure the output is valid and coherent. This problem class can be seamlessly viewed through the lens of CSO, albeit with a specific interpretation of the variables. In this case, the context $x$ represents the raw input, e.g., an image or sentence, while the "uncertainty" $\xi$ degenerates to the unknown ground truth structured output $\bar{y}$ , e.g., the correct segmentation map. The cost function is effectively a supervised loss function $c(x,y,\xi) = \mathcal{L}(y,\bar{y})$ that measures the discrepancy between the predicted structure $y$ and the target $\bar{y}$ . Consequently, this setup corresponds to the SL setting defined earlier, where we observe pairs $(x,\bar{y})$ and aim to minimize the expected loss over the data distribution. Beyond its historical significance, this field has provided the algorithmic bedrock for COAML. Many of the differentiation techniques used in OR today, e.g., perturbation-based differentiation and Fenchel-Young (FY)-losses, originated here [Berthet et al., 2020b, Blondel et al., 2020]. The community remains vibrant, with applications expanding into domains like information retrieval, where ranking problems are modeled as optimization over the permutahedron. For an up-to-date and comprehensive introduction to these foundational methods, we refer to Blondel and Roulet [2024]. # 2.3. Contextual Stochastic Optimization Although the introductory section established the basic mechanics of CSO, its full scope warrants further elaboration to appreciate its versatility. In this framework, the context $x$ plays a pivotal, dual role. First, it defines the structural parameters of the optimization instance itself, determining the feasible set $\mathcal{Y}(x)$ . This dependence allows the framework to accommodate dynamic environments where the constraints change from instance to instance, a standard requirement in operational problems. Second, the context $x$ serves as a carrier of predictive information regarding the uncertainty $\xi$ . Unlike traditional stochastic optimization approaches, which often struggle to condition decisions on high-dimensional or complex feature spaces, COAML seamlessly integrates this contextual information into the decision pipeline. By leveraging the representational power of NNs, the policy can exploit subtle correlations between the context and the uncertainty to minimize expected costs. Example 1 (continued). Returning to the SVSP of Example 1, consider that the context $\tilde{x}$ comprises the weather forecast, a signal highly correlated with the flight delays $\xi$ . Furthermore, the set of flights to be covered—and consequently the feasible solution set $\mathcal{V}(x)$ —varies naturally from week to week, making the optimization domain itself dependent on the instance $x$ . # 2.4. Two-Stage Stochastic Optimization The COAML framework readily extends to two-stage stochastic optimization, a setting where decisions are partitioned into immediate "here-and-now" actions and subsequent "wait-and-see" recourse actions. This problem class can be modeled as: $$ \min _ {y \in \mathcal {Y}} \mathbb {E} \Big (\tilde {c} (x, y, \boldsymbol {\xi}) \Big) \quad \text {w h e r e} \quad \tilde {c} (x, y, \boldsymbol {\xi}) = \underset {z \in \mathcal {Z} (x, y, \boldsymbol {\xi})} {\arg \min } c (x, y, z, \boldsymbol {\xi}). $$ In this formulation, the cost function $\tilde{c}$ represents the value of the optimal second-stage recourse $z$ , which is determined after the uncertainty $\xi$ is realized. The variable $x$ acts as the context, encoding both the instance definition and all information available for the first-stage decision. Structurally, these problems are a specific instance of the general CSO framework where the immediate cost is augmented by the outcome of a downstream optimization problem. Applying COAML in this setting typically involves learning a policy for the first stage that implicitly anticipates the expected recourse. Consequently, the combinatorial layer often involves solving a linear problem solely on the first-stage decision, effectively compressing the two-stage complexity into a single learned representation [cf. Dalle et al., 2022]. Alternatively, if the second-stage feasible set $\mathcal{Z}(x,y,\pmb{\xi})$ is independent of the uncertainty $\pmb{\xi}$ , the layer can be constructed as a linear problem over both decision stages [Parmentier, 2025]. # 2.5. Data-Driven Optimization from Raw Inputs A particularly powerful application of COAML lies in data-driven optimization where the parameters of the optimization problem must be inferred from high-dimensional, unstructured data. In this setting, the context $x$ typically represents raw observations, such as satellite imagery, textual descriptions, or complex sensor streams. The uncertainty $\xi$ corresponds to the underlying ground-truth parameters of the optimization problem, such as edge costs, capacities, or utilities, which are not directly observable. The cost function $c(x,y,\xi)$ evaluates the objective value of the solution $y$ under these true parameters. Traditional OR architectures require a manual feature engineering step to convert raw data into structured coefficients. COAML, by contrast, leverages deep NNs $\varphi_w$ to learn a direct mapping from raw perception to combinatorial reasoning. The network acts as a predictive model that extracts the latent parameters $\theta \approx \xi$ from the raw context $x$ , which are then immediately processed by the combinatorial layer. This end-to-end approach ensures that the feature extraction is optimized specifically for the downstream decision quality, rather than for generic reconstruction accuracy. Example 2. Perception-based Shortest Paths. Vlastelica et al. [2020] introduce the Wcraft Shortest Path benchmark, a task that has become a standard reference for evaluating differentiable solvers. The problem involves computing shortest paths across terrain maps represented solely as pixel images. Here, the context $x$ is the raw image of the map, and the neural network $\varphi_w$ , typically a convolutional architecture, must visually infer the movement cost $\xi$ associated with each terrain type (e.g., water, forest, grass). The oracle $\hat{y}$ corresponds to Dijkstra's algorithm, which takes these predicted costs and outputs the optimal path. Crucially, the supervision signal is often defined on the final path rather than the individual tile costs, forcing the network to learn the physics of the terrain purely through the optimization loop. # 2.6. Multistage Stochastic Optimization Let us conclude this section with a flagship application of COAML in OR: multistage stochastic optimization problems in a combinatorial setting. Unlike the previous examples, this setting is best described as a Markov decision process (MDP) rather than a static CSO problem. The system evolves in discrete time steps $t = 0,1,\ldots ,T$ . At each time step $t$ , the system resides in a state $x_{t}\in \mathcal{X}_{t}$ , and the decision maker must select a decision $y_{t}\in \mathcal{Y}_{t}$ . The system then transitions to a new state $x_{t + 1}$ according to a (possibly unknown) transition probability $p(x_{t + 1}\mid x_t,y_t)$ . The decision maker incurs a cost $c_{t}(x_{t},y_{t})$ that depends on the current state and the chosen decision. A solution to this problem is a policy $\pi :x_{t}\in \mathcal{X}_{t}\longmapsto y_{t}\in \mathcal{Y}_{t}$ that minimizes the expected cumulative cost: $$ \min _ {\pi} \mathbb {E} _ {\pi} \left[ \sum_ {t = 0} ^ {T} c _ {t} \left(\boldsymbol {x} _ {t}, \boldsymbol {y} _ {t}\right) \mid \pi \right]. \tag {8} $$ Analogously to previous settings, policies do not need to be deterministic, and can be turned into conditional probability distributions. In this setting, classical dynamic programming provides exact solutions when the state space $\mathcal{X}_t$ is sufficiently small to permit enumeration and the optimization over $\mathcal{Y}_t$ remains tractable. Outside of this regime, existing methodologies typically compromise on one dimension of complexity. The multi-stage stochastic optimization literature offers robust methods for handling large, complex action spaces $\mathcal{V}$ , e.g., via stochastic dual-dynamic programming, but generally assumes the state space $\mathcal{X}$ has moderate dimension. Conversely, deep RL excels at processing high-dimensional state spaces $\mathcal{X}$ , e.g., images, but often struggles when the action space $\mathcal{V}$ is large, discrete or combinatorial. A significant gap therefore remains for problem settings where both $\mathcal{X}$ and $\mathcal{Y}$ are large and structured. COAML architectures naturally address this dual complexity by assigning the burden of state representation to the neural network and the burden of action feasibility to the combinatorial layer. Example 3. Dynamic vehicle routing The 2022 EURO meets NeurIPS vehicle routing competition [Kool et al., 2022] centered on a dynamic VRP (DVRP). The task involved a rapid delivery service using capacitated vehicles to serve customer requests originating from a depot. Each request had to be served within a specified time window. Requests arrived dynamically, and vehicles were dispatched in waves to serve them. At each wave's decision time $t$ , the system state $x_{t}$ consists of the set of requests that have not yet been served. The decision $y_{t}$ involves selecting the subset of requests to be served by the vehicles dispatched at time $t$ , as well as the corresponding routing plan. The objective is to find a policy $\pi$ that minimizes the expected total routing cost. Baty et al. [2024] observe that the feasible set $\mathcal{Y}_t(x_t)$ at each decision epoch $t$ can be formulated as a prize-collecting capacitated VRP (CVRP). In this variant of the CVRP, serving each request $v$ yields a prize $\theta_v$ , and the goal is to maximize the total profit, defined as the sum of the collected prizes minus the routing cost. However, in the DVRP considered here, there is no explicit notion of prizes—only routing costs. To address this, the authors construct a policy $\pi_w$ as illustrated in Figure 3, in which a neural network $\varphi_w$ predicts the prize values $(\theta_v)_v$ . The prize-collecting CVRP is then solved using these predicted prizes to produce the routing solution $y_t$ . This approach demonstrated a significant performance improvement and ultimately won the competition by a considerable margin. # 3. Methodology Having introduced the formal setting and problem classes, we now turn to the methodological building blocks of COAML pipelines: architectures, and learning algorithms. Crucially, the training methodology is not dictated by the problem class, but by the nature of the learning signal available during the offline training phase. We distinguish three primary learning paradigms: Supervised learning: When a dataset of expert decisions or ground-truth optima $(x_{i},\bar{y}_{i})$ is available, the policy can be trained to replicate these targets by minimizing a divergence measure or loss $\mathcal{L}(\pi_w(x_i),\bar{y}_i)$ , a process commonly referred to as IL. Empirical cost minimization: When explicit targets are unavailable but the decision maker has access to the cost function $c$ and historical realizations $\xi_{i}$ of the uncertainty $\pmb{\xi}$ , the policy can be trained to directly minimize the empirical cost $\sum c(x_{i},\pi_{w}(x_{i}),\xi_{i})$ . This approach is particularly suitable for white-box settings where the system physics are known. Figure 3: COAML policy for the EURO-NeurIPS DVRP. Yellow nodes represent request, the grey square the depot, arrows the dispatched routes. The illustration is a courtesy of Léo Baty. Reinforcement learning: When neither targets nor an explicit cost function are available and the cost can only be observed as a scalar reward signal following interaction with an environment, the policy can be trained using RL techniques. This is the standard approach for sequential or black-box environments. When considering multistage stochastic optimization problems, these learning paradigms adapt to the sequential nature of the problem: Imitation learning: If expert trajectories or optimal policy demonstrations are available, the policy can be trained to clone this behavior using SL, effectively reducing the sequential problem to a series of static classification or regression tasks. Reinforcement learning: When the transition dynamics are unknown, complex, or black-box, the policy must be trained via interaction. The agent observes the state $x_{t}$ , generates a combinatorial action $y_{t}$ , and updates parameters to minimize cumulative costs based on the sparse reward signal. Direct policy search: This is the dynamic equivalent of ECM. Given a differentiable simulator or a fixed dataset of historical trajectories, we can optimize the policy parameters to directly minimize the total empirical cost accumulated over the horizon. The rest of the section introduces the architectures and the learning paradigms. Among these, ECM plays a particularly central role in COAML, serving as both the conceptual foundation and the source of many open challenges. We therefore devote more space to ECM than to the other components. # 3.1. Architectures When applying COAML to a specific problem, the first task is to design the architecture of the policy $\pi_w$ . This implies choosing the CO-oracle $\hat{y}$ and the statistical model $\varphi_w$ . The choice of $\hat{y}$ is application dependent and dictated by the structure of the feasible set $\mathcal{Y}(x)$ as well as the algorithms that scale well for this class of problems. In contrast, the choice of $\varphi_w$ is more generic: it determines how contextual information and problem features are encoded into parameters $\theta$ that interact with the combinatorial oracle, see Figure 4. The design challenge is therefore not only predictive accuracy but the alignment of the architecture with the downstream CO-layer. Independent scoring models. A first family of approaches relies on assigning independent scores to the dimensions of the vector space where $\mathcal{V}(x)$ lies. Each dimension $i$ is represented by a feature vector $\phi (i,x)$ , and a simple parametric model assigns it a score $\theta_{i}$ . A generalized linear model (GLM) PINN in COAML $\neq$ PDE PINN; here: domain constraints embedded Figure 4: Schematic COAML pipeline. An input instance $x$ is encoded by a statistical model $\varphi_w$ , which parametrizes a CO oracle $\hat{y}$ to produce a decision $\pi_w(x)$ . Different encoder architectures can be used depending on the problem class: independent scoring models, graph neural networks (GNNs), sequential/constructive models, or domain-informed encoders. is the simplest such choice: $$ \varphi_ {w} (x) = \left(\theta_ {i}\right) _ {i \in \{1, \dots , d (x) \}} \quad \text {w h e r e} \quad \theta_ {i} = w ^ {\top} \phi (i, x). \tag {9} $$ These scores parametrize the CO-layer, which then computes a feasible solution by optimizing over $\mathcal{Y}(x)$ . This approach is highly scalable and interpretable, but it ignores structural dependencies between elements. Example 1 (continued). In the SVSP, the dimensions of the flow polytope correspond to the arcs $a$ of the digraph. A vector of features $\phi(a,x)$ summarizes the characteristics of each prospective connection $a$ in the instance $x$ , and a GLM predicts arc costs as $\theta_a = w^\top \phi(a,x)$ [Parmentier, 2021]. The deterministic VSP solver then uses these predicted arc costs to generate routing solutions. A natural extension of this idea is to replace the linear model by a feed-forward NN $\psi_w$ , applied in parallel to each dimension: $$ \varphi_ {w} (x) = \left(\theta_ {i}\right) _ {i \in \{1, \dots , d (x) \}} \quad \text {w h e r e} \quad \theta_ {i} = \psi_ {w} \circ \phi (i, x). \tag {10} $$ This increases expressive power while retaining scalability, and is often used in CSO tasks. Graph neural networks. Many combinatorial problems exhibit strong relational structure. GNNs exploit this by encoding elements as vertices of a graph $G$ , with edges capturing feasible interactions such as temporal or spatial adjacency. Message passing aggregates local information into context-aware embeddings, which are then mapped to scores $\theta_{i}$ : $$ \varphi_ {w} (x) = \psi_ {w} (\tilde {\phi}) \quad \text {w h e r e} \quad \tilde {\phi} = \left(\phi (i, x)\right) _ {i