Eureqa! How a Bored Undergrad’s Algorithm Achieved 3,000 Academic Citations
In 2005, I was a bored undergrad who hated his major in electrical and computer engineering.
I had started school to learn about hardware and CPU design. But instead, I had become infatuated with developing AI and machine learning algorithms. As my personal experiments with algorithm development escalated, it led to introductions to professors at Cornell University – where I was studying – who were researching AI and evolutionary algorithms.
At the time, nearly all AI research was focused on generating the most accurate predictions– especially around images and text. But I started to wonder if AI could help the scientific process itself instead. Could we devise algorithms that bootstrap discoveries? Could they discover answers that were not just accurate but concise and elegant? What would they find if we unleash them on new experimental data? These questions became my obsession in graduate school, and they ultimately led me to working on a new algorithm and application called Eureqa to answer them.
I knew that developing AIs to think like scientists would be a challenging problem (the clue is in the name). But I hadn’t expected to learn so much about how we – people – formulate and communicate our expectations. And why we so often get the unexpected back in return.
Eureqa and Genetic Algorithms
Genetic or evolutionary algorithms mimic natural selection, by eliminating weaker solutions to a given problem and allowing the stronger ones to be developed into future generations of possible solutions.
Eureqa uses this approach to mimic the scientific process. Fundamentally, it proposes new features and designs the right experiments to test for mathematical relationships in a given dataset. Its objective is to find the most explainable model for those relationships with the least assumptions; that’s a totally open-ended problem. (“Explainable” matters. Newton’s law is beautiful – and useful – because it’s elegant.)
This enables Eureqa and other genetic algorithms to outperform many machine learning techniques in highly complex, but real-world situations. For example, many time series forecasting problems suffer from unpredictable or spurious events in the data, and many machine learning models build extremely complex ways attempting to fit them. That extra complexity can come with a large cost, offsetting the true signals and leading to rapid model drift when used in production. Some other algorithms, like reinforcement learning, are interesting because they can continue to adapt in production. However they too suffer from this same risk of generating highly complex explanations for simple behavior, while also being risky to control in production.
That usually is not a problem for Eureqa, which instead searches for the simplest possible relationships, and uses the data that is available. It can also be given more prior knowledge than other learning algorithms, guiding it towards a particular structure or asking it only to improve specific aspects of the model.
Upping the Stakes
During the course of my research leading to Eureqa, I joined the Creative Machines AI Lab at Cornell University where I eventually went to grad school.
In that setting, I had the opportunity to add some degree of intelligence to robots, which are not really very smart, by modeling how their physical parts interact and how they themselves interact with their environment.
But during this time I was becoming more and more fascinated with Eureqa’s potential for automating and accelerating scientific discovery in physics. Nearly every law in physics is based on some sort of symmetry or conserved quantity (i.e. conservation laws). And identifying these laws are the basis for most of physics’ biggest discoveries. I was convinced Eureqa could help scientists to find them more quickly and more often.
The difficulty with algorithmic discovery of conservation laws, however, is that you’re liable to get a lot of trivial conservations. f(t)=0 is conserved over time, for example, is a super boring conservation that explains almost nothing. So, the real breakthrough is developing an algorithm that can search for and find meaningful physical relations, while avoiding trivial solutions.
And the success in optimizing Eureqa to detect these laws led to the publication in 2009 of a research paper in Science, the prestigious – and awesome – academic journal.
Navigating the Dark Arts of Feature Engineering
Part of the reason Eureqa excels in the search for conservation laws and other mathematical relationships, is that in its efforts to discover explainable models, it also automatically figures out the optimal feature transformations from scratch and without human intervention.
On its journey to provide the highest accuracy with the least complexity, it can filter through the almost limitless potential transformations you might make on data. It tests ever simpler ways to achieve similar or improved accuracy, billions of times. In one sense it’s a flashlight for the dark arts of feature engineering, the attribute that differentiates the most skilled data scientists.
Eureqa’s automatic feature transformations have turned out to be particularly powerful in working with time series datasets to forecast out the future. Time series often requires heavy feature engineering to figure out the right lags and interactions–which is where Eureqa thrives.
Through the Wormhole and into the Public Consciousness
I didn’t appreciate at the time the significance of the Science publication and how pivotal it would turn out to be.
But as a result of that paper, the work was covered in the science section of The New York Times, and I was invited onto NPR’s Radiolab and the Science Channel’s Through the Wormhole with Morgan Freeman. On those shows, we put Eureqa through its paces, challenging it to offer an underlying law of physics for how different physical systems behave (e.g. double pendulums, moving cars, etc).
The experience was a sharp lesson in how scientific work can catch the public imagination. The media latched on to the original publication with the idea that AI can help automate scientific discoveries – exactly what had captured my imagination years before — even dubbing it the a “robot scientist”.
The resulting interest prompted me to develop Eureqa as a software that people could use. I got out of school, moved in with my girlfriend in Detroit, and coded for about 12 months straight.
Incredibly, in that first year, we had 30,000 users, which allowed me to bootstrap a company around Eureqa that was eventually acquired by DataRobot six years later.
As Relevant as Ever
In the years since the Science publication, Eureqa has been mentioned or cited in almost 3,000 academic studies.
And recently, there’s been a surge in Eureqa-inspired academic research into evolutionary algorithms that compete with Eureqa (which, I have to add, still performs extremely well).
The reasons behind this interest might be society’s need to continue scaling our ability to explain complex systems. World events have emerged to create a perfect inflection point for genetic algorithms and global-scale problem solving. Certainly, they perform well solving hard optimization problems with limited info or in rapidly changing environments – like, say, a global pandemic.
COVID and the sudden changes in behavior it prompted has broken many models that were trained pre-pandemic. Retraining models to mitigate this is difficult because even after almost two years, there’s not a lot of training data available. And so, compared to machine learning and deep learning models that need months of training, Eureqa’s principle of finding the simplest model to explain a phenomena makes it more robust to things like overfitting and noise. It’s also able to use smaller datasets more effectively, which is great for when you have less history or scenarios in which things are changing quickly.
What Algorithm Development Can Teach Us About Ourselves
My experience taking Eureqa from my dorm room to the lab, into corporate America and beyond, revealed a lot to me about what you might call the “soft” skills needed to build a useful algorithm. Curiosity and the ability to get excited about its potential impact are essential complements to sound development fundamentals and a good math foundation. Coming up with ideas, testing them systematically and brainstorming new approaches is extremely creative work.
At a deeper level, I’ve worked on Eureqa for so long and become so familiar with it that there are analogies with close personal relationships and all the frustration they can sometimes throw up. You can spend a lot of time building and nurturing an algorithm. But the crazy thing about developing new genetic algorithms is that you give them an objective function (e.g., the highest accuracy with the lowest complexity), and they’ll find the most creative ways of doing that. You let them loose and they achieve their goal, but in exactly the opposite way that you hoped they would. A lot of experimentation goes into defining the right rewards and operations within these algorithms to encode what it is that you really want, like now encoded in Eureqa.
It’s a constant fight to adequately communicate your goals to the algorithm, which I think forces us to also confront how the way we articulate our requirements – to both machines and people – is often loaded with assumptions about context.
For me, that was an illuminating experience, and it’s one that’s inherent to most machine learning endeavors. You’re going to learn a lot about what your objectives really are when developing and refining algorithms. I know I did.