(Nanowerk News) Children who are first learning to walk may walk too fast and fall, or bump into furniture. However, that element of cause and effect teaches them invaluable information about how their bodies move through space so they can avoid falling in the future.
Machines learn in much the same way humans do, including learning from their mistakes. However, for many machines—such as self-driving cars and power systems—learning on the job at the expense of human safety poses a problem. As machine learning matures and proliferates, there is a growing interest in applying it to highly complex and safety-critical autonomous systems. However, the promise of this technology is hindered by the safety risks inherent in the training process and beyond.
A new research paper challenges the idea that you need an infinite number of trials to study safe behavior in unfamiliar environments. Paper, published recently in the journal IEEE Transactions on Automated Control (“Learn to Act Safely With Limited and Almost Certain Exposures”), presenting a new approach that ensures learning of safe actions with complete confidence, while managing the balance between being optimal, dealing with dangerous situations, and recognizing unsafe actions quickly.
“Typically, machine learning seeks the most optimal solution, which can generate more errors along the way. That is problematic when a mistake could mean hitting a wall,” explains Juan Andres Bazerque, assistant professor of electrical and computer engineering at the Swanson School of Engineering, who led the research along with Associate Professor Enrique Mallada at Johns Hopkins University. “In this research, we show that studying safe policies is fundamentally different from studying optimal policies, and that it can be done in a discrete and efficient way.”
The research team conducted studies in two different scenarios to illustrate their concept. By making reasonable assumptions about exploration, they create an algorithm that detects all unsafe actions within a limited number of rounds. The team also tackled the challenge of finding optimal policies for Markov decision processes (MDP) with near-certain constraints.
Their analysis emphasizes the trade-off between the time required to detect unsafe acts in the underlying MDP and the level of exposure to unsafe events. MDP is useful because it provides a mathematical framework for modeling decision making in situations where the outcome is partly random and partly under the control of the decision maker.
To validate their theoretical findings, the researchers performed simulations that confirmed the identified tradeoffs. These findings also suggest that incorporating security constraints can speed up the learning process.
“This research challenges the common belief that studying safe behavior requires an infinite number of trials,” Bazerque said. “Our results show that by effectively managing the trade-off between optimality, exposure to unsafe events, and detection time, we can achieve guaranteed safety without an unlimited amount of exploration. This has significant implications for robotics, autonomous systems, and artificial intelligence, and much more. ”