Anthropic unveils new method to understand and control artificial neural networks

Anthropic, an artificial intelligence startup, has announced a key breakthrough in understanding and controlling the behavior of artificial neural networks, the mathematical models that power AI algorithms. The research, published in a blog post last week, could have significant implications for the safety and reliability of next-generation AI, giving researchers and developers more insight and influence over the actions their models take.

The challenge of neural network understanding

Artificial neural networks are inspired by the human brain, mimicking the way that biological neurons signal to one another. They are trained on data, rather than programmed to follow any rules, and they produce AI models that can display a wide range of behaviors. However, the math behind these neural networks is well-understood, but the reason why they result in certain behaviors is not. This makes it very difficult to control AI models and prevent so-called “hallucinations”, where AI models sometimes generate fake or inaccurate answers.

Anthropic explains that neuroscientists face a similar challenge in trying to understand the biological basis for human behavior. They know that the neurons firing in a person’s brain must somehow implement their thoughts, feelings and decision-making, but they can’t identify how it all works.

Anthropic unveils new method to understand and control artificial neural networks

“Individual neurons do not have consistent relationships to network behavior,” Anthropic wrote. “For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts.”

The discovery of features within neurons

To get a better grasp of what neural networks are doing, Anthropic’s researchers looked deeper at the individual neurons and identified what they call small units, known as features, within each neuron that better correspond to patterns of neuron activations. By studying these individual features, the researchers believe they can finally get a grip on how neural networks behave.

In an experiment, Anthropic studied a small transformer language model, decomposing 512 artificial neurons into more than 4,000 features that represent contexts such as DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements and more. They found that the behavior of the individual features was significantly more interpretable than that of the neurons.

To validate their research, Anthropic created a blinded human evaluator to compare the interpretability of individual features and neurons. The features (red) have much higher scores than the neurons (teal). This provides strong evidence that features can be used as a basis for neural network understanding.

The potential for manipulating neural network behavior

With additional research, Anthropic believes that it may be able to manipulate these features to control the behavior of neural networks in a more predictable way. Ultimately, this could prove critical in overcoming the challenge of understanding why language models behave as they do.

Anthropic also discovered a surprising universality among the features: each feature was largely consistent across different AI models. This means that the same feature could be used to understand and influence different types of neural networks.

“This suggests that there may be some underlying structure to neural network behavior that is independent of architecture or training data,” Anthropic wrote. “This could have profound implications for transfer learning and generalization across domains.”

Anthropic’s research is part of its broader mission to create scalable, reliable and aligned AI systems that can benefit humanity. The startup was founded by Demis Hassabis and Shane Legg, two co-founders of DeepMind Technologies Ltd., which was acquired by Google LLC in 2014. Anthropic has raised $124 million in funding from investors such as Reid Hoffman, Dustin Moskovitz and Jaan Tallinn.

Anthropic unveils new method to understand and control artificial neural networks

The challenge of neural network understanding

The discovery of features within neurons

The potential for manipulating neural network behavior

Rian Lord

Leave a Reply Cancel reply