Klein Maetschke, GPLv3 <http://www.gnu.org/licenses/gpl-3.0.html>, via Wikimedia Commons

Science of Chess: The AI Revolution Hasn't Made Chess Better (yet!)

3 Jun 2026490 viewsEnglish (US)

Chess Analysis Tactics Strategy Chess engine

The best chess engines can crush the best human players, but has their existence helped advance the frontiers of the game?

It can be easy to think about chess improvement as something that only happens on a personal scale. If you're reading this article, I imagine you'd love to be better at the game and you may spend some amount of time on opening courses, puzzle training, and other resources designed to make your understanding of the game more complete.

We can think about chess improvement at a grander scale, however. The best players of each era push the game forward into new directions, sometimes by offering specific opening novelties that force others to explore uncharted territory, sometimes by introducing new strategic approaches to the game that represent alternative ways of thinking (e.g., Nimzowisch and the hypermodern approach or Steinitz's contributions to positional play), and sometimes by refuting old ideas that end up tossed in the dustbin of history. Chess is always changing and each innovative idea, no matter how large or small, teaches us more about how the game works and makes the games of our era look different than the ones that came before.

Of course, the way that change happens at this historical scale has itself been subject to change! For the vast majority of the game's history, advances came from individual players (or sometimes "schools" of players, both formal and informal) studying the game, experimenting with new ideas on their own or with a set of analysis partners, and occasionally debuting their innovations in dramatic fashion. This could sometimes backfire: Capablanca famously refuted the Marshall Gambit over-the-board in 1918, for example (though further study led to its adoption into grandmaster praxis, where it remains to this day).

Capablanca's annotations on his 1918 game with Marshall in which an attempted innovation (the Marshall Gambit) was not enough to win. In time, Marshall's idea was re-examined and better continuations than he attempted in this game led to the adoption of the gambit in master-level play.

Presently, however, this process looks much different, thanks to the availability of chess engines that can evaluate arbitrary positions much more effectively than the best human players. Engines like Stockfish have an "understanding" of the game (yes, I think the scare quotes are appropriate here - you can debate me in the comments if you like!) that is sufficiently deeper than ours that the phrase "computer move" has become common parlance to describe moves that may be optimal, but are so difficult to understand that it's unlikely a human player would ever come up with them. Though playing against the strongest instantiation of Stockfish feels a lot like being slowly (and sometimes quickly) crushed to death, it's obvious that it offers players at every level an unparalleled tool for exploration and improvement.

The kind of computer analysis that's incrediibly easy to come by these days. This is produced by DroidFish, an app based on the Stockfish engine. Peter Österlund, Tord Romstad, Marco Costalba, Joona Kiiski, GPLv3 <http://www.gnu.org/licenses/gpl-3.0.html>, via Wikimedia Commons

But what exactly have chess engines done for the game? You can find a number of quotes from modern players like Magnus and Gukesh describing how their play has evolved as a result of training with Stockfish, but without some real detail this is what I would call "anecdata." Besides, even if we accept that specific players have changed their game due to working with a powerful engine, can we say something more broad about whether or not chess engines have more broadly influenced chess in a measurable way?

That's the goal of the target article I'll discuss with you here. Specifically, the authors set out to determine if there was a measurable effect of what they call the two "AI Chess revolutions" that occurred (1) in the late 1990s (corresponding to the commercial availability of more power desktop machines) and (2) in the late 2010s (corresponding to the arrival of deep-learning approaches to chess engine development). To put it plainly, did the quality of chess rapidly change in response to these technological developments?

A quick proof-of-concept: AlphaZero and the sudden onset of better moves in Go

Here's a bit of context I found really fascinating in this paper: There is existing evidence from what we'd call "natural experiments" that this kind of sudden change in gameplay has happened before in response to tech advances. To stick with chess for a moment, West German players in the 1980s who had access to chess engines (albeit not spectacular ones) improved their play more than players in the Soviet Union who weren't able to incorporate engines into their training regimens.

Similarly, the impact of AlphaZero (a deep-learning model capable of improving its gameplay by self-play) on Go gameplay was also revealed by the same kind of natural stratification of players. In this case, Shin et al. (2021) observed that Korean Go players serving in the Army (and who thus couldn't access AlphaGo's games or the engine itself) showed less gameplay improvement compared to Korean players who did have access. Moreover, a granular analysis of Go move quality and novelty over time (Shin et al., 2023) also demonstrated a clear uptick in these metric that corresponds to players having access to AlphaGo's play (see figure below).

Adapted from Figure 3 in Shin et al. (2023) - The novelty of human moves was declining steadily for decades, then took a mighty (and statistically meaningful!) leap after AlphaGo defeated the reigning World Champion.

Together these results make a compelling case for examining chess play at a similar level of granularity: Did chess players start making better moves once they had access to strong chess engines? Can we see the impact of sophisticated computer analysis on the quality of play?

Using Change Point Analysis to see if (and when) something changed

This kind of quantitative question - can we identify a moment in time when things changed? - turns out to be an interesting and challenging thing to think about statistically. As a cognitive scientist, I'm often very interested in quantifying the effects of change over time (I study the development of the visual system during childhood) and sometimes use data that's densely-sampled in time (like eyetracking or EEG data) to measure responses at critical moments. The thing is, I'm almost always using tools that assume I either know how to identify a time range I care about, or that rely on talking about the shape of the entire trajectory of performance across a span of time. What do you do if you don't know when (or if) a critical moment might be evident in your data?

Consider the data I'm showing you below, which is a plot of the volume of the Nile River year-to-year across a long-ish span of time. How would you go about finding out if something had ever happened during this span that marked a pivotal change in the river's output? If I had a good idea about when that specific point in time happened (maybe I have a hunch that the dashed vertical line marks an important historical moment), I could split my data into two groups, one before that moment and one after, and use any number of simple tests that allow you to compare two samples to see if they differ. There are cases where you might know this, but if we're talking about chess and the impact of computers, something like the availability and use of desktop PCs that can handle stronger chess engines is tough to localize in time. When exactly did that happen? How do we pick a moment **a priori **to guide our statistical testing? To be honest, if we really care about finding the crucial moment for change, we shouldn't try to pick it out ourselves - we should let the data tell us if it's there.

van den Burg, Gerrit JJ, and Christopher KI Williams. "An Evaluation of Change Point Detection Algorithms." arXiv preprint arXiv:2003.06222 (2020).

To get closer to the kind of answer we want, you might try fitting some kind of function to a graph like this one so you have a structural description of how your variable changed over time. This is also fairly easy to do and there are plenty of tools for either fitting one curve to this data or finding the best "mixture" of curves that accounts for the shape of the graph. The problem is that these approaches also make it difficult to talk about punctate moments in time due to the fact that usually the curves we fit to data like this don't incorporate sharp "spikes" (or to be more technical, non-linearities like a delta function) in their functional form. This means that we might get a nice description of our data in terms of a combination of gradually unfolding contributing processes, but these won't be great for working out when something suddenly pushed the data around at one critical moment.

Luckily for us, statisticians spend a lot of time developing new methods to answer specific quantitative questions like this. Tools for identifying "change points" in time-series data have been around for some time and I found it a sort of neat rabbit hole to explore for a bit as I was preparing to write this piece. One of the more compelling demonstrations I could find of identifying a critical timepoint where a dataset changed wasan investigation of coal mining accidents from the mid-19th century to the mid-20th century. I grew up near coal country in Pennsylvania and there's a family story involving my great-grandfather having to dig men out of a collapsed mine he owned, so I have a little bit of generational familiarity with the dangers associated with digging chunks of coal out of the ground. But how did those dangers change historically? Here's a cumulative graph of coal-mining accidents over time - because it's cumulative it will keep going up (flat is the best we could expect to see!), but our question is whether it looks like its rate of climb starts to slow down all of a sudden somewhere along the line. Do you feel like you see anything? Did coal-mining ever get suddenly safer?

Figure 1 from Jarrett (1979) depicting the cumulative count of coal-mine explosions over the course about a century. Is there a critical moment when the curve begins to flatten? If so, when does it happen and can we work out what may have caused it?

Of course I don't expect you to have a perfect answer and even if you did we can't just defer to an eyeball test! Instead, the key statistical idea is to identify two independent functions: One that we'd like to use to account for the data before some critical moment and another that's only intended to account for what happens afterward. The important step from a stats standpoint is to include that critical moment (which we can just call "t") as something we're going to estimate from the data along with the other parameters that describe our "before-t" and"after-t" functions. In the paper that analyzed this coal-mining data, the authors end up with a new graph that indicates the likelihood that "t" happened at each timepoint in the original series.

Figure 1 from Raftery & Akman (1986). This plot shows you the probability density that a critical change-point in the cumulative graph I showed you up above happened at each year in the original data. That tallest spike suggests something important happened around 1889 that made that curve start to flatten meaningfully.

While there are a few serious-looking spikes in this graph, there is also clearly one that is head-and-shoulders above the rest. What happened between 1888 and 1892? This data about coal-mine explosions suggests that something changed that affected worker safety profoundly, so what's in the historical record? In a word: Unions. The Miners' Federation was formed in 1889, advocating strongly for improved working conditions in coal mines and apparently meeting with reasonable success! I realize this is a bit of a tangent, but I hope this gives you a sense of how this version of statistical analysis makes it possible to try and link specific moments in history to data that we've measured over it's sweep.

Have the two AI revolutions in chess impacted the quality of play (Bilalic et al., 2024)?

So what about the history of chess? Is there statistical evidence for "change points" associated with the presumed revolutions in the late 90s and late 2010s that the authors propose? To answer this question, the authors developed a dataset obtained from ChessBase's "Mega Database 2023" - a repository with 10 million games spanning the years 1985-2021. First, they identified the 20 best players in each year overall (a cast of characters that can and does rotate, mind you!) and also put together separate Junior (under 20) and Senior samples (over 65). Next, for each year they collected all of the games the top players took part in (only including tournament play - no blitz, no bullet) and examined all of the moves they made (excluding the first 10 moves and everything after move 60).

From here, the big goal is to treat the average quality of these moves in each year like time-series data that provides a historical record of how good humanity is at playing chess: How accurate are players' moves over this time span and do we see evidence of rapid change in the neighborhoods we might expect given the technological developments we know about? The authors use a few different metrics to characterize move quality, including centipawn loss and accuracy, both of which are measures of performance you may have heard of before. If you're not familiar with them, Accuracy is quite simple: How often did your move match the optimal move? As for centipawn loss, a centipawn is a unit of measure (equal to 1/100th of a pawn) for estimating the advantage one player has over another in a position. Having one fewer pawn than your opponent costs you a full point, but a difference in position even when material is equal can mean one player is better off than another by a smaller amount. For example, a +0.1 evaluation in your favor means that the arrangement of pieces on the board is better for you, to the tune of 1/10 as good as being a pawn up.

By extension, when we talk about centipawn loss as a measure of accuracy, we're referring to the difference between your advantage if you had made the best move (determined here by Stockfish 16) and your advantage after making a different move. The figure below shows you what these various measures of accuracy look like when we calculate the average value over time for the three groups of top players (Centipawn Loss is at the bottom left, with Accuracy and Optimality in the top two panels, left to right).

Adapted from Figure 2 in Bilalic et al. (2024). A range of performance metrics including Accuracy, Move Optimality and Centipawn Loss plotted over time in the critical period for the Top 20 players overall (in Red), the Top 20 Junior players (in Blue) and the Top 20 Senior players (in Green). The authors model these trajectories with a mixture of continuous functions, but that approach does not allow inferences about critical time-points corresponding to sudden change.

Do you think you see anything? Is there a spike or two somewhere in there that signals either the onset of home computers with solid chess engines or the availability of deep neural-net models? If you'd like a visual aid to help you get a sense of what could be lurking in there, the graph below shows you how the strongest chess engine rating changed during this same time span. This pretty obviously has a marked change point right where you think it should - but is that reflected in the human data at all?

Adapted from Figure 3 in Billalic et al. (2024) - A plot of how the ELO of the best chess engines changed during the critical period examined in this study. The authors suggest that one critical point for potential human improvement corresponds to increased access to these models in the late 1990s (even while AI ELO was improving slowly) and that the second corresponds to the sudden rise in ELO subsequent to advances in applying deep networks to chess in the late 2010's.

The graph below shows you that the answer is more complicated than you'd think. What you're looking at here is a plot much like the spiky figure I showed you above to indicate when the critical moment happened in the coal-mine disaster data, only now with a change-point analysis for each of our three player groups (Overall Top 20 at left, Senior Top 20 in the middle, and Junior Top 20 at the right). What you're looking for here is the thin blue line underneath those densely packed black dots. Peaks in that blue line indicate strong evidence for a change point at that moment in time.

Adapted from Figure 6 in Bilalic et al. (2023). A change-point analysis of Accuracy and move Optimality both reveal critical time-points for improvement in the Top 20 Junior and Senior players, but these occur at different times. The Top 20 Overall players give no indication of sudden change in response to either presumed "AI chess revolution."

The fact that you see some blue peaks on the bottom of some of these graphs means that there are some meaningful change points here, but it depends on what player group we're talking about! If we look at the Top 20 players overall (these are the top-flight experts in the database), there isn't really a sharp change in move quality at all, save for something happening right at the beginning of the sample. That is, neither of the two presumed "AI revolutions" changed the trajectory of improvement for the best players!

Breaking things down by age, however, reveals something interesting: Junior and Senior players have different critical timepoints in their data. The Top 20 Junior players do show statistical evidence of a sharp jump in accuracy in the late 1990's (remember, this is the arrival of home PCs that could support strong chess engines) but nothing else! On the other hand, the Top 20 Senior players show evidence of an accuracy jump in the 2010's (when deep nets like Stockfish show up). I don't want to drown you in figures more than I already have, but this pattern of results is stable when you break things down into White and Black moves, too, which is a nice way of looking at the reliability of the results in all three groups.

But what about Magnus? (or Gukesh, or...<insert your favorite GM>)

You might be a little troubled by the finding I highlighted in bold above - what do you mean access to strong engines didn't kick-start improved play at the top levels of chess? I mentioned at the outset that some of the best modern players have explicitly said that access to state-of-the-art chess engines changed their game. Just to underline this point, here's a quote from Magnus Carlsen that the authors of this study included in their paper:

"Yes, I have been influenced by my hero AlphaZero recently. Essentially I have become a very different player in terms of style than I was before, and it's been a great ride."

So what are we to make of this kind of statement in light of the data I just described? Well, what if we took a look at Magnus' play during this critical period of chess history to see what his various trajectories look like? The authors did just this as a sort of "case study" and you can see their plots of his performance over time on several different metrics.

Adapted from Figure 8 in Bilalic et al. (2024). Magnus Carlsen's various performance metric including Accuracy, Centipawn Loss, etc. do not carry evidence of a change-point corresponding to the adoption of AlphaZero into his training.

The bottom line? Perhaps Magnus' style of play changed dramatically after working with these engines, but his performance metrics don't show evidence of non-linear improvement anywhere to be found. There are a lot of reasons to not think of this as terribly definitive, but I appreciated the attempt to put one player under the microscope statistically-speaking to see whether we could see some special point when things changed. In Magnus' case, we can't.

Conclusions and next steps

Of course, saying that engines like Stockfish haven't affected the historical creep of chess improvement doesn't mean that they haven't been incredibly useful tools for the vast majority of players. I obviously benefit from being able to work through post-game analyses that can identify my blunders and show me all the better choices I could have made throughout a game. Still, what I think the data described in this paper show us is that following in the footsteps of these engines is much harder than it seems. I mentioned earlier that we talk about "computer moves" that seem almost alien in their strangeness in contrast to "natural" moves that we know human players will gravitate towards. To me, this suggests that even the best players have perhaps only been able to glean some hints from our computational overlords of how the game might be played differently, but wholly adopting the viewpoint of our strongest engines remains elusive.

Support Science of Chess posts!

Thanks as always for reading! If you're enjoying these Science of Chess posts and would like to send a small donation my way ($1-$5), you can visit my Ko-fi page here: https://ko-fi.com/bjbalas - Never expected, but always appreciated!

References

Bilalić, M., Graf, M., & Vaci, N. (2026). Computers and chess masters: The role of AI in transforming elite human performance. British Journal of Psychology, 117, 585–609. https://doi.org/10.1111/bjop.12750

Choi, S. , Kang, H. , Kim, N. , & Kim, J. (2023). How does artificial intelligence improve human decision‐making? evidence from the ai‐powered go program. arXiv, 231008704.

Graf, M. , Danek, A. H. , Vaci, N. , & Bilalić, M. (2023). Tracing cognitive processes in insight problem solving: Using GAMs and change point analysis to uncover restructuring. Journal of Intelligence, 11(5), 86

Jarrett, R. G. (1979). A Note on the Intervals Between Coal-Mining Disasters. Biometrika, 66(1), 191–193. https://doi.org/10.2307/2335266

Raftery, A. E. , & Akman, V. E. (1986). Bayesian analysis of a Poisson process with a change‐point. Biometrika, 73(1), 85–89. [Google Scholar]

Shin, M., Kim, J., van Opheusden, B., & Griffiths, T. L. (2023). Superhuman artificial intelligence can improve human decision-making by increasing novelty. Proceedings of the National Academy of Sciences of the United States of America, 120(12), e2214840120. https://doi.org/10.1073/pnas.2214840120

Discuss this blog post in the forum

Your network blocks the Lichess assets!

Science of Chess: The AI Revolution Hasn't Made Chess Better (yet!)

A quick proof-of-concept: AlphaZero and the sudden onset of better moves in Go

Using Change Point Analysis to see if (and when) something changed

Have the two AI revolutions in chess impacted the quality of play (Bilalic et al., 2024)?

But what about Magnus? (or Gukesh, or...<insert your favorite GM>)

Conclusions and next steps

Support Science of Chess posts!

References

You may also like

How I started building Lichess

Improving at chess: calculation as a skill

How titled players lie to you

Science of Chess: Knowing when to think (and when to just move)

How Opening Advantages Translate into Results in Online Games

Science of Chess: Networks for expertise in the chess player's brain