| 
					 
						 
					 
					
					
						
							
								
								
									
										In a nutshell, SeaPearl is an intelligent generic solver guided by Reinforcement Learning (RL) to solve NP-hard discrete optimization problems using Constraint Programming.
										 
										 
										At that moment, you can feel a little confused with all these concepts if you've
										never heard of them before. No worries, let's start from the beginning :).
									
								 
								 
								
									What are discrete optimization problems ?
									 
									 
									
										Discrete optimization problems are problems for which an outcome is to be
										optimized and for which there is a finite number of solutions. For example,
										consider the following problem :
										 
										 
										A deliveryman has to deliver products from a factory to all the customers at
										varying distances from each other as detailed below (left).In practice, we
										represent the problem as a graph with values on edges (right). The goal for the
										deliveryman is to minimise the total distance travelled to
										deliver all the customers, this is the objective function of the problem under
										the constraint that every customer must be delivered.
										 
										 
										 
									 
									
										 
										 
										Of course, there are several solutions to this problem and some are better than
										others : Here solution 1 is better than solution 2 because
										its route has a shorter total distance. For each problem, it could exist one or
										more optimal solutions, but prooving its optimality (ie. solving the problem)
										requires to have listed all possible solutions.
										 
										 
									 
									
										 
									 
									
										 
										 
										In this small problem it is possible to list all possible solutions and choose
										the best one, which become not possible anymore when considering larger problems
										with more nodes, as NP-complete class of problems, which TSP belongs to, could
										not be solved in polynomial time ( unless P = NP  ).
										 
										 
										That said, it may not be interesting to prove the optimality of a solution but
										simply to find good enough solutions. The proof of optimality is sacrificed for
										the benefit of execution time.
									 
								 
								 
								
									Constraint Programming
									 
									 
									
										Now that we have introduced the notion of discrete optimisation problems, we
										will discuss a method used for searching solutions : constraint
											programming (CP). CP is a paradigm to solve discrete optimization
										problems. It lies on methods for problem declaration and problem solving using a solver that automates the
										constraint propagation mechanism (the process of inferring the possible values
										given the value of the variables already assigned, this is the process that
										everyone uses to solve a sudoku for example). For more information about it,
										please look at this presentation from Enric Rodrıguez-Carbonell .
										 
										the strength of the CP comes from the diversity of the problems it is able to
										solve. It allows, among other things, to manage non-linear constraints or
										variables of different types (boolean, integer ...).
										 
										 
										To understand how the research works, let's go back to our problem above. Let's
										consider the deliveryman starts his delivery tour at the factory.
										 
										 
										Each node of the graph can be considered as a variable of
										the problem. Each variable can take for values the node they are connected to,
										this define their domain of definition. FInally, constraints governs the values taken simultaneously by the
										variables (ex : You can't go through the same node twice.)
										 
										 
										Below is how a search is conducted in a CP solver, which alternates decision phases and fix-point phases
										(constraints propagations) :
										
											- The red assignations are the one obtained
												after a variable/value assignation
 
											- The green assignations are the one infered
												directly by the constraint propagation mechanism
 
										 
									 
									
										 
									 
									As can be seen, constraint propagation is a very powerful mechanism that allowed us
									to automatically infer the only possible values for 3 of the 5 variables in the
									problem.
									 
									 
									During the decision phase, 2 decision heuristics are to be considered:
									
										- Variable Selection Heuristic : Select the next variable
											to assign
 
										- Value Selection Heuristic : Select the value to assign
											to the previously selected variable
 
									 
									The design of these heuristics can have a tremendous impact on search performance,
									this is what makes the difference between a advanced sudoku player, who will find a
									solution on the first try and a beginner, who do multiple errors and changes before
									finding a solution. Moreover, for some problems, we don't even know heuristics that
									works well in the general case.
									 
									This is where we bring
									Reinforcement Learning !
									 
									 
		
	 
	
		How to learn a decision making process ?
		
			 
			 
			The idea is to train a model using RL to learn a smart Value Selection Heuristic by
			solving successively thousands of similar problems drawn for a specific distribution and evaluate the
			performance of this heuristic on problems never seen during training. The resolution process still relies on
			CP, which guarantees the validity of the returned solutions, while the agent is in charge of branching
			during the search. We propose a generic approach where a single agent could learn
			problems of different nature, for this purpose we present a generic graph representation of any problems
			using the characteristics of CP problem definition.
			 
			 
			The RL algorithm used is Deep Q-Learning  (DQN). The associated reward is based on the
			objective function/quality of the solution returned.
			Generic Graph representation :
			Every CP problems is defined by a set of variable, values that can be taken by these variables and a set of
			Constraints on these variables. The idea is to encode each of these entities as node in a graph and connect
			these nodes according to whether :
			
				- A Value is part of a Variable's domain of definition
 
				- A Variable is involved in a Constraint
 
			 
			This graph is naturally dynamically updated throughout the resolution of instances during the training
			process. Here is a little example just to get a sense of things :
		 
		 
		 
		
			
		 
		 
		
			The advantage of this method is that it allows the entire information to be encoded in a structure (a graph)
			that can be processed by a Neural Network. Each node comes with node embeddings allowing to identify -among
			others- the type of constraints of a Constraint node or the value of a Value node.
			Neural Pipeline for Variable/Value assignation :
			Now that we defined our input, recall that our goal is to infer a variable/value assignation. Let's consider
			for the moment that the variable selection heuristic is deterministic, so that the input is the graph
			representation and a variable on which to branch. This is where we are :
		 
		 
		 
		
			
		 
		 
		
			Given a state, the RL agent is parameterized by a DQN-network that outputs the Q-values associated with
			every possible value selection in the domain of the selected variable $v$. The tripartite state is fed into
			a neural approximator model, the learner $\hat{Q}$, which consists of two parts : an encoder for learning contextual embeddings of the nodes of the graph representation,
			and a decoder which, given these node embeddings, estimates the optimal policy to design
			a powerful heuristic.
			Graph neural network encoder :
			Graph convolutional networks (GCN) constitute a very convenient solution to learn contextual node
			embeddings, and have been largely used for this purpose in reinforcement learning for discrete optimization.
			Due to the heterogeneous nature of our representation we opted for a heterogeneous GCN composed of several
			Graph convolutional layers.
			 
			 
			Considering a variable $v_i$, a constraint $c_j$ and a value $u_k$, they are respectively defined by raw
			feature $V_i, C_j, U_k$, with respective dimension $d_v, d_c, d_u$ . First, a type-wise linear combination
			of raw features compute the input features of dimension $d$ for the GNN such that : $\mathbf{h}_{v_{i}}^{0}
			= V_{i}w_v$, $\mathbf{h}_{c_{j}}^{0} = C_{j}w_v$ and $\mathbf{h}_{u_{k}}^{0} = U_{k}w_v$.
			Then, we perform recursively $N$ operations of graph convolution on the nodes of the graph representation.
			At step $t$, a convolution can be formulated as :
			 
			\begin{align*}
			\mathbf{h}_{v_{i}}^{t+1}&=\phi_{v}\left(\mathbf{V}_{i}: \mathbf{h}_{v_{i}}^{t}: \bigoplus_{c_{j} \in
			\mathcal{N}_{c}\left(v_{i}\right)} \mathbf{h}_{c_{j}}^{t}: \bigoplus_{u_{k} \in
			\mathcal{N}_{u}\left(v_{i}\right)} \mathbf{h}_{u_{k}}^{t}V\right)\\
			\mathbf{h}_{c_{j}}^{t+1}&=\phi_{c}\left(\mathbf{C}_{j}: \mathbf{h}_{c_{j}}^{t}: \bigoplus_{v_{i} \in
			\mathcal{N}_{v}\left(v_{i}\right)} \mathbf{h}_{v_{i}}^{t}\right)\\
			\mathbf{h}_{u_{k}}^{t+1}&=\phi_{u}\left(\mathbf{U}_{k}: \mathbf{h}_{u_{k}}^{t}: \bigoplus_{v_{i} \in
			\mathcal{N}_{v}\left(v_{i}\right)} \mathbf{h}_{v_{i}}^{t}\right)
			\end{align*}
			Where $\phi_{v}, \phi_{c}, \phi_{u}$ are one-layer perceptron, composed by an affine transformation followed
			by an activation function, $:$ is the concatenation operation,
			$\mathcal{N}_{v},\mathcal{N}_{c},\mathcal{N}_{u}$ represents the type-specific neighborhood, and $\bigoplus$
			is the feature-wise aggregation function.
			 
			 
			Downstream neural network decoder :
			Once the contextual node embeddings are computed, a decoder should be used to convert them into an
			actionable policy.
			 
			 
			Our architecture consists in first transforming the embeddings of the variable $v$ currently branched on and
			all the possible values by feeding them into two different multi-layer perceptrons. The transformed
			embedding of each value $u \in \mathcal{D}_v$ concatenated with the transformed embedding of the branching
			variable $v$ passes trough a last multi-layer perceptron that outputs the approximated Q-value for each pair
			($v$,$u$).
			 
			 
			\begin{equation}
			\hat{Q}(S_t,a) = \hat{Q}\left(\{s_t,v\},u) = \phi_q(\phi_v( \mathbf{h}_{v}^{N}):\phi_u(
			\mathbf{h}_{u}^{N})\right)
			\end{equation}
			where $\phi_{q}, \phi_{u}, \phi_{v}$ are multi-layer perceptron.
			 
			 
			Once the Q-values are approximated, the explorer can exploit the learned values and
			voraciously choose the best action or decide otherwise (for example, a random action with probability
			$\epsilon$). This tradeoff between exploitation and exploration is necessary in early learning when the
			estimate of Q-values is very poor and many states have never been visited before.
			 
			 
			Here is a summary of the value decision pipeline:
		 
		 
		 
		
			
		 
		 
		
		 
		
	 
	
	 | 
	 
	 
	
 |