Papers by Shalabh Bhatnagar

arXiv (Cornell University), Jan 26, 2017
In this paper, we analyze the behavior of stochastic approximation schemes with set-valued maps i... more In this paper, we analyze the behavior of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee. We prove that after a large number of iterations if the stochastic approximation process enters the domain of attraction of an attracting set it gets locked into the attracting set with high probability. We demonstrate that the above result is an effective instrument for analyzing stochastic approximation schemes in the absence of a stability guarantee, by using it obtain an alternate criteria for convergence in the presence of a locally attracting set for the mean field and by using it to show that a feedback mechanism, which involves resetting the iterates at regular time intervals, stabilizes the scheme when the mean field possesses a globally attracting set, thereby guaranteeing convergence. The results in this paper build on the works of V.S. Borkar, C. Andrieu and H. F. Chen , by allowing for the presence of set-valued drift functions.
arXiv (Cornell University), Feb 6, 2016
The Random early detection (RED) active queue management (AQM) scheme uses the average queue size... more The Random early detection (RED) active queue management (AQM) scheme uses the average queue size to calculate the dropping probability in terms of minimum and maximum thresholds. The effect of heavy load enhances the frequency of crossing the maximum threshold value resulting in frequent dropping of the packets. An adaptive queue management with random dropping (AQMRD) algorithm is proposed which incorporates information not just about the average queue size but also the rate of change of the same. Introducing an adaptively changing threshold level that falls in between lower and upper thresholds, our algorithm demonstrates that these additional features significantly improve the system performance in terms of throughput, average queue size, utilization and queuing delay in relation to the existing AQM algorithms.

arXiv (Cornell University), 2022
In this paper, we use concepts from supervisory control theory of discrete event systems to propo... more In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian actionsequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.

Mathematics of Operations Research, Aug 1, 2022
In this paper, we consider the stochastic iterative counterpart of the value iteration scheme whe... more In this paper, we consider the stochastic iterative counterpart of the value iteration scheme wherein only noisy and possibly biased approximations of the Bellman operator are available. We call this counterpart as the approximate value iteration (AVI) scheme. Neural networks are often used as function approximators, in order to counter Bellman's curse of dimensionality. In this paper, they are used to approximate the Bellman operator. Since neural networks are typically trained using sample data, errors and biases may be introduced. The design of AVI accounts for implementations with biased approximations of the Bellman operator and sampling errors. We present verifiable sufficient conditions under which AVI is stable (almost surely bounded) and converges to a fixed point of the approximate Bellman operator. To ensure the stability of AVI, we present three different yet related sets of sufficient conditions that are based on the existence of an appropriate Lyapunov function. These Lyapunov function based conditions are easily verifiable and new to the literature. The verifiability is enhanced by the fact that a recipe for the construction of the necessary Lyapunov function is also provided. We also show that the stability analysis of AVI can be readily extended to the general case of set-valued stochastic approximations. Finally, we show that AVI can also be used in more general circumstances, i.e., for finding fixed points of contractive set-valued maps.

arXiv (Cornell University), Jan 18, 2006
The measure-theoretic definition of Kullback-Leibler relative-entropy (KL-entropy) plays a basic ... more The measure-theoretic definition of Kullback-Leibler relative-entropy (KL-entropy) plays a basic role in the definitions of classical information measures. Entropy, mutual information and conditional forms of entropy can be expressed in terms of KL-entropy and hence properties of their measure-theoretic analogs will follow from those of measure-theoretic KL-entropy. These measure-theoretic definitions are key to extending the ergodic theorems of information theory to non-discrete cases. A fundamental theorem in this respect is the Gelfand-Yaglom-Perez (GYP) Theorem (Pinsker, 1960, Theorem. 2.4.2) which states that measure-theoretic relative-entropy equals the supremum of relative-entropies over all measurable partitions. This paper states and proves the GYP-theorem for Rényi relative-entropy of order greater than one. Consequently, the result can be easily extended to Tsallis relative-entropy.

2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), 2018
This paper considers two important problems -on the supply-side and demand-side respectively and ... more This paper considers two important problems -on the supply-side and demand-side respectively and studies both in a unified framework. On the supply side, we study the problem of energy sharing among microgrids with the goal of maximizing profit obtained from selling power while at the same time not deviating much from the customer demand. On the other hand, under shortage of power, this problem becomes one of deciding the amount of power to be bought with dynamically varying prices. On the demand side, we consider the problem of optimally scheduling the time-adjustable demand -i.e., of loads with flexible time windows in which they can be scheduled. While previous works have treated these two problems in isolation, we combine these problems together and provide a unified Markov decision process (MDP) framework for these problems. We then apply the Q-learning algorithm, a popular model-free reinforcement learning technique, to obtain the optimal policy. Through simulations, we show that the policy obtained by solving our MDP model provides more profit to the microgrids.
Telecommunication Systems, 2016
The Random early detection (RED) active queue management (AQM) scheme uses the average queue size... more The Random early detection (RED) active queue management (AQM) scheme uses the average queue size to calculate the dropping probability in terms of minimum and maximum thresholds. The effect of heavy load enhances the frequency of crossing the maximum threshold value resulting in frequent dropping of the packets. An adaptive queue management with random dropping (AQMRD) algorithm is proposed which incorporates information not just about the average queue size but also the rate of change of the same. Introducing an adaptively changing threshold level that falls in between lower and upper thresholds, our algorithm demonstrates that these additional features significantly improve the system performance in terms of throughput, average queue size, utilization and queuing delay in relation to the existing AQM algorithms.

arXiv (Cornell University), Oct 19, 2021
Learning optimal behavior from existing data is one of the most important problems in Reinforceme... more Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the "off-policy" setting compared to the "on-policy" setting where new data from the policy updates will be utilized in learning. This work proposes an off-policy natural actorcritic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. The existing natural gradient-based actor-critic algorithms with convergence guarantees require fixed features for approximating both policy and value functions. This often leads to sub-optimal learning in many RL applications. On the other hand, our proposed algorithm utilizes compatible features that enable one to use arbitrary neural networks to approximate the policy and the value function and guarantee convergence to a locally optimal policy. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the vanilla gradient actor-critic algorithm on benchmark RL tasks.
arXiv (Cornell University), Nov 8, 2018

IEEE Wireless Communications Letters, Oct 1, 2018
We consider the problem of tracking an intruder using a network of wireless sensors. For tracking... more We consider the problem of tracking an intruder using a network of wireless sensors. For tracking the intruder at each instant, the optimal number and the right configuration of sensors has to be powered. As powering the sensors consumes energy, there is a trade off between accurately tracking the position of the intruder at each instant and the energy consumption of sensors. This problem has been formulated in the framework of Partially Observable Markov Decision Process (POMDP). Even for the state-of-the-art algorithm in the literature, the curse of dimensionality renders the problem intractable. In this paper, we formulate the Intrusion Detection (ID) problem with a suitable state-action space in the framework of POMDP and develop a Reinforcement Learning (RL) algorithm utilizing the Upper Confidence Tree Search (UCT) method to solve the ID problem. Through simulations, we show that our algorithm performs and scales well with the increasing state and action spaces.
arXiv (Cornell University), Oct 9, 2020

arXiv (Cornell University), Jun 16, 2019
We consider the problem of two-player zero-sum game. In this setting, there are two agents workin... more We consider the problem of two-player zero-sum game. In this setting, there are two agents working against each other. Both the agents observe the same state and the objective of the agents is to compute a strategy profile that maximizes their rewards. However, the reward of the second agent is negative of reward obtained by the first agent. Therefore, the objective of the second agent is to minimize the total reward obtained by the first agent. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the max-min reward (of first player), starting from a given state is called the equilibrium value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation. Successive relaxation has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the two-player zero-sum games. We prove that, under a special structure, this technique computes the optimal solution faster than the techniques in the literature. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm.

arXiv (Cornell University), Aug 25, 2017
We consider the problem of minimizing the difference in the demand and the supply of power using ... more We consider the problem of minimizing the difference in the demand and the supply of power using microgrids. We setup multiple microgrids, that provide electricity to a village. They have access to the batteries that can store renewable power and also the electrical lines from the main grid. During each time period, these microgrids need to take decision on the amount of renewable power to be used from the batteries as well as the amount of power needed from the main grid. We formulate this problem in the framework of Markov Decision Process (MDP), similar to the one discussed in . The power allotment to the village from the main grid is fixed and bounded, whereas the renewable energy generation is uncertain in nature. Therefore we adapt a distributed version of the popular Reinforcement learning technique, Multi-Agent Q-Learning to the problem. Finally, we also consider a variant of this problem where the cost of power production at the main site is taken into consideration. In this scenario the microgrids need to minimize the demand-supply deficit, while maintaining the desired average cost of the power production.
2019 5th International Conference on Control, Automation and Robotics (ICCAR), 2019
In this paper, we present a complete description of the hardware design and control architecture ... more In this paper, we present a complete description of the hardware design and control architecture of our custom built quadruped robot, called the Stoch. Our goal is to realize a robust, modular, and a reliable quadrupedal platform, using which various locomotion behaviors are explored. This platform enables us to explore different research problems in legged locomotion, which use both traditional and learning based techniques. We discuss the merits and limitations of the platform in terms of exploitation of available behaviours, fast rapid prototyping, reproduction and repair. Towards the end, we will demonstrate trotting, bounding behaviors, and preliminary results in turning. In addition, we will also show various gait transitions i.e., trot-to-turn and trot-to-bound behaviors.

2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2020
With the research into development of quadruped robots picking up pace, learning based techniques... more With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user's input into an expert user's input, thereby ensuring stable manoeuvres regardless of the user's experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.
IEEE Transactions on Automatic Control, Apr 1, 2004
Mathematics of Operations Research, Nov 1, 2020
With 12,500 members from nearly 90 countries, INFORMS is the largest international association of... more With 12,500 members from nearly 90 countries, INFORMS is the largest international association of operations research (O.R.) and analytics professionals and students. INFORMS provides unique networking and learning opportunities for individual professionals, and organizations of all types and sizes, to better understand and use O.R. and analytics tools and methods to transform strategic visions and achieve better outcomes. For more information on INFORMS, its publications, membership, or meetings visit

arXiv (Cornell University), Apr 1, 2016
The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gra... more The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure boundedness of the iterates) and convergence of GD with bounded, (possibly) non-diminishing gradient errors. In addition to ensuring stability, such an algorithm is shown to converge to a small neighborhood of the minimum set, which depends on the gradient errors. It is worth noting that the main result of this paper can be used to show that GD with asymptotically vanishing errors indeed converges to the minimum set. The results presented herein are not only more general when compared to previous results, but our analysis of GD with errors is new to the literature to the best of our knowledge. Our work extends the contributions of Mangasarian & Solodov, Bertsekas & Tsitsiklis and Tadić & Doucet. Using our framework, a simple yet effective implementation of GD using simultaneous perturbation stochastic approximations (SP SA), with constant sensitivity parameters, is presented. Another important improvement over many previous results is that there are no 'additional' restrictions imposed on the step-sizes. In machine learning applications where step-sizes are related to learning rates, our assumptions, unlike those of other papers, do not affect these learning rates. Finally, we present experimental results to validate our theory.
arXiv (Cornell University), Oct 27, 2021
Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied... more Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP along with an application on smart Grids.

IEEE Control Systems Letters, Jul 1, 2020
In this paper, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was propo... more In this paper, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins' Q-learning. In most RL algorithms such as Qlearning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxation. We use the generalized Bellman operator to derive a simple and efficient family of algorithms called Generalized Speedy Q-learning (GSQL-w) and analyze its finite time performance. We show that GSQLw has an improved finite time performance bound compared to SQL for the case when the relaxation parameter w is greater than 1. This improvement is a consequence of the contraction factor of the generalized Bellman operator being less than that of the standard Bellman operator. Numerical experiments are provided to demonstrate the empirical performance of the GSQLw algorithm.
Uploads
Papers by Shalabh Bhatnagar