How to build a Chess Engine
Celebrating 100k games of Tofiks I want to share how I have been working on my engine.Intro
When I tell people that one of my favorite hobbies is building a chess engine, most people have no idea what a chess engine is. Those who know chess engines and also have some experience with programming often have a bit of awe. I feel a lot of engineers think that chess engines are somehow in a different plane of software. Nobody these days would bat an eye if you told them you programmed your own website or some app.
I do agree in some sense chess engines are a very different type of software than what most engineers build in their day jobs. But at the same time they are the same - they have a source code that is lines of code. They define types, functions, methods. You create releases. You test your code. Nothing is outlandish or alien. Yet there are a lot of differences and I would like to describe them. By the end of the article you should see that while there are huge differences there is nothing that would prevent even a beginner programmer getting into the hobby.
This might be a long and dense read but if you are struggling with engine development it should be a great resource of all the tools and methods to get you started and to ship really strong engines. If you're just a curious developer it might be an interesting window to peek through while passing by. If you are not a software developer I am afraid you will not find much enjoyment or value in this article but I would be happy to be proven wrong.
The tools you'll need
The main tool is going to be the programming language. This is probably the most important choice you will make but also the easiest.
It's very easy to make a strong case that a chess engine should be written in a compiled language like C/C++, rust, Zig, etc... They are king when it comes to performance. There is not much else that can rival them in memory management and compile optimization. I will spare you the details.
But there is nothing wrong picking any language you like - python, Java, JS, C#, Go... Or maybe you want to have those demonic challenges of building it entirely in assembly or postgres. It's a fun hobby. It's yours. Pick your own adventure. My recommendation is pick the one language you love or want to learn.
My choice was Go. I never intended writing an engine I just wanted to play with Go. I sat down to write some Go code and at the time Magnus was fighting Ian in a World Chess Championship match - so naturally the code was chess related.
Go is not the best choice for an engine but it is a very good one. Go is a compiled language and while it has a small run time and is garbage collected it can still produce programs that can trade blows with the languages in the S-Tier. Go has a very simple and pleasant syntax. It has a test and benchmark framework, debugger, profiler, dependency management all packaged with the language.
The tools you'll build
Some of the earliest versions of Tofiks had a off-by-one bug in the move generation of rooks. They could only move a maximum of 6 squares instead of 7 which would cover the full board distance. I observed some games where the A-file would open up and two unprotected rooks would stare each other down without being able to capture each other. This was the first time my professional software developer skills had failed me. I of course had written methods to generate rook moves. I had written tests where I set up board positions and generated rook moves and compared that the generated moves are the ones I expect. I had positions where the rook was in the middle of the board unopposed. Positions where it could capture pieces or the movement was blocked by other pieces. But there are so many different positions in chess testing them all is impossible. And those tests did not help me catch that bug.
perft
perft stands for performance test. It's a command for your chess engine perft <depth> and it counts all the moves and positions that can be reached at depth.
Example 1: Start position perft 1
You have 8 pawns. Each pawn has two moves - push one or two squares - 16 moves. You also have two knights each can move either to the edge of the board or towards the center. That is 4 more moves. In total perft 1 would give you the answer 20.
Example 2: Start position perft 2
Very simple white has 20 moves we calculated in perft 1 and now black has also the same mirrored 20 moves. So for every one of white's 20 moves black has a choice of 20 moves. So 20 * 20 = 400. So perft 2 is 400
Example 3: Start position perft 6
I have no idea. Even perft 3 would be complicated because the number of moves would depend on your first move and also black's response. The answer is 119,060,324. People have been crunching these numbers for a long long time and have published the perft results.
So when working on your move generation the first thing you should do is implement the perft run it against the known and verified perft results and you will know that your engine has correct implementation of move generation. This will save you a lot of time. I previously wrote a blog about board representations. At some point i had board[8][8] and board[64] before moving towards magic bitboards. The refactor work between each of them was never trivial but with the help of the perft command and the known results it was catching any bugs before Tofiks played a single game of chess and embarrassed himself.
FEN parser
There are perft results in that wiki article for six positions. Some positions have some castling rights considerations, some pins and en passant moves. It's important to test your move generation against all of them. Positions in chess are usually encoded in FEN notation which looks something like this:1r3rk1/pp3ppp/2p1b3/3nNn2/2BPN3/8/PP3PPP/3RR1K1 w - - 6 19
It holds the exact board representation, side to move, castling rights, etc... You can read about it here. This is the standard of sharing a chess position if you go to the board editor or analysis board on lichess you can copy paste that link to share the position.
Your chess engine needs to be able to parse this string and set up a board internally from it.
UCI
It's fun to print out characters on the command line. Add a new line return and render the board with letters or ASCII symbols. Don't.
Engines and GUIs should not live together. My recommendation build one or the other or both but keep them separate. There are alternatives to UCI like XBoard. Pick whatever you fancy but UCI is the golden standard accepted by all chess software. Link to uci protocol specification.
Once your engine speaks UCI it can be linked up with any chess GUI or CLI tool out there and it can play chess. You don't even need to implement all of it:
- uci, uciok
- isready, readyok
- ucinewgame, position
- go
- bestmove
- stop
- quit
I think this is the bare minimum that should let it play some games. Now that you have a working move generator you could pick a random legal move and play it by reporting it back to the GUI via bestmove command. You built a chess engine!
debug builds
When building for performance you should strip down to the bare minimum and anything that is not contributing to speed and strength should be stripped off. Leaving you with a black chess playing box. However, I highly recommend also keeping around some debug builds. Tofiks gathers a lot of stats on the utilization of the transposition table, performance of different techniques in the search - how well Null Move Pruning, Late Move Reduction, Futility Pruning, move ordering, etc... work. This is very valuable info but it contributes nothing to chess playing strength. This is kept as a separate build target only for development.
So far I have been focusing on very basic stuff without any insights into the more advanced techniques. There is no excuse for getting the basics wrong. If your engine produces wrong moves, can't recognize positions as inputs and does not speak any uci - there is very little you can do to progress. This is the bare minimum and needs to be a rock solid foundation before you can advance. I would say everything I described so far could be implemented in a few evenings or couple weekends depending on your experience as a programmer and familiarity of the language you picked.
The tools you'll use
We picked a language and we coded up some basic board representation and move generation code. We also added commands that allow us to interact with our program with a GUI or CLI. We actually have a little silly engine that plays random legal moves now. Not much but it a great start. At this point you would write an evaluation function and implement a search routine so your engine can pick good moves. I will not focus on that. I have other blog posts where I talk about computer chess topics. The rest of this article I will focus on the tools and methods how to support that work.
Version control
No matter how obvious this has to be stated. Writing code without some kind of version control is a nightmare. You need to organize your code. You want to deliver small improvements in separate commits. You want to see the history of your work and have the ability to throw away your current work and revert back to an earlier version. This has nothing to do with chess programming this is just a must writing any code.
Benchmarks
Chess programming is all about performance. You want to write a program that plays chess. Whatever change you make you hope the result will be a program that plays better than before. Often you will implement new more advanced techniques say changing your search from minimax to alpha-beta pruning. The algorithms produce exactly the same output but one can do so at a fraction of the cost. Your engine got smarter and stronger. However, often times you will want to make sure your existing algorithms work as fast as they can. Benchmarks are like tests - they run the same piece of code over and over. And while tests do that to verify results and behavior are as expected. Benchmarks simply measure the speed of execution. Pay attention to the number of operations, deep nested loops, branching if-else statements, types and sizes of variables that you use and allocations stack vs heap.
Optimization is a very deep topic and I will not go into it any deeper. This is just a teaser and to stress the importance of using good benchmarks. If you set out to improve your code, you should first know what and how it is doing now before you start work on improving it. I would highly discourage anyone to argue about code performance looking at source code. Experts could probably make good assertions but even then it's nothing more than a guess. Don't guess write benchmarks.
We already should have one benchmark - perft. Perft computes all the nodes from a position at depth it should also print out a nodes per second (nps) score along the total nodes. For move generation this is a good source of truth - is it generating moves faster?
Profiling
While benchmarks are good at measuring performance of isolated parts of code profiling allows you to gather run-time metrics of your code. A profile can show you where you program spends most CPU time and where it allocates memory. This is very valuable information and before you start writing benchmarks and then optimizing that code a profile should show you what is worth working on. I have spent many hours trying to optimize my move generation and making significant gains only to realize that it barely influences the performance at all because the move generation wasn't the bottleneck it was making the move and then taking it back.
Profile your engine before you set out to do any serious optimization work because without that you will not know what impact it could even make.
cutechess-cli and fastchess-cli
These are the core tools of engine development. They are command line tools that allows engines to play each other. All this prior talk about version control, perft, benchmarks and profiles is just a preamble. At the end of the day when building an engine all you should care about is the strength - their Elo score. Both of these tools allow you to play engines against each other. Record the games and their results, and it will output an Elo differential - how much stronger is one engine than the other.
This is the very root of the development flow. You sit down at your keyboard and you want to write more code to make your engine play better chess. You keep one version of the current engine the latest version as per your main branch on your version control. This is defines the threshold that needs to be beat. You check out a new branch and start working on your ideas of improving it. You compile it and you run a match between the two. After a number of games it will become clear - is your new work an improvement or not? Is your new version able to beat the baseline that you set aside at start? If yes then well done- the test passes. Commit, push and merge. Next time you sit down and want to keep improving you repeat the steps with the new baseline.
Sequential Probability Ratio Test (SPRT)
When you first start building an engine and you have a naive engine that just consists of a move generator that spits out random moves and you implement the most basic minimax algorithm that looks at a fixed depth of 2. It will crush the baseline every single game. The same will happen when you upgrade to alpha-beta - it will crush your minimax engine. The Elo gains on each step will be in the double or even triple digits. Confirming an improvement is obvious and cheap maybe tens of games will show already a huge improvement quickly. Sometimes you will think you made a huge improvement and when you test it, it fails miserably. You probably forgot a sign somewhere or had an obvious bug and once you find and fix it again you will crush the baseline.
This does not continue indefinitely. At some point your engine will start to plateau and all the low hanging fruit are picked and the next change only produces a marginal improvement and it takes thousands of games to verify if you gained a few points of Elo. Often you will attempt to implement a speculative pruning technique and it will regress, maybe there is a tiny bug, maybe the logic is fine but some constant needs adjusting and it just doesn't work your engine plays worse.
Cutechess and fastchess CLIs allow performing this process in a rigorous process. Check out the wiki on SPRT. But in short you want to test a hypothesis that your new changes are at least better by some Elo points within a margin of error. Fastchess will then play as many games as needed to either accept the hypothesis (the change is an improvement) or reject it (the change regressed or didn't improve enough to meet the hypothesis).
Simply it tests rigorously if the changes are working. Early wins come easy and they are clear but the stronger your engine becomes those gains become smaller and are not so obvious. This is the most important tool any serious engine needs to have.
Simultaneous Perturbation Stochastic Approximation (SPSA)
I want to say it again - Simultaneous Perturbation Stochastic Approximation. That's probably not a combination of words that you expected to read / hear when you woke up this morning. Again there is the wiki article for SPSA.
It's an algorithm to tune / optimize constants within your engine. Let me give you a small example. Late Move Reductions in my article about tree search I made a note that in alpha-beta it's important to search the best moves first. Engines put a lot of effort into guessing which moves should be searched first. In contrast, the later a move comes in that order the less confidence the engine has that it is a good move. So LMR is a heuristic that dictates that moves that come late in the ordering are unlikely good ones and we need not invest a full search on them so that sub-tree will be searched at a shallower depth. The question is what should be considered a late move and what should be the reduction of that sub-tree search?
That is a question SPSA answers. You allow those values to be tuneable and the SPSA test will turn the knobs and play the engine against itself with different values and from the results it will make adjustments and attempt to find the most optimal values that result in the strongest play.
Honestly, I ran this a couple of times and so far I have had mixed results with Tofiks. I think I have better results using intuition and tune them by hand and then just SPRT test my guess. The SPSA tune can take a lot of games and has a lot of noise and my feeling is that this method is more suited for already highly optimized engines to squeeze out a few extra Elo here and there. I will need to put in more work to see if I can make it work.
Texel Tuning
NNUE took the chess engine world by storm. These are Neural-Networks that are used for evaluation. Stockfish and all the serious engines exclusively use these now and they brought staggering Elo improvements to engines that already were out of this world. Tofiks still uses a Hand-Crafted Evaluation function - HCE. The way HCE works is you define various weights for your evaluation. The value of a pawn, knight, bishop.... What value should a passed pawn have on the 6th rank? 7th rank? What penalty should an exposed king with no pawn cover have in the middle game? How valuable is a Bishop pair? There are also more abstract Piece-Square-Tables. These are look up tables that say how much should one penalize or reward a white knight on the a2 square vs e5. There are tables like that for every piece and often they have two variants one for middle game and one for endgame. With each entry in those tables counted as one value, Tofiks has 941 weights that are used to evaluate all aspects of a position.
Now you can look at my lichess account and see my rating. I am not a good chess player. I am at best average, but these days I only play games with some friends over some beers at a bar. Whatever my stats are - they are probably overinflated to my current skill. So how would I ever be able or have any authority to set those values? I read what others do. I have some common sense where I know that knights on the rim are dim so when I defined those tables for knights I had some idea. But that's 941 values all needing to work together. Not even a Grandmaster could come up with the best values.
Texel tuning solves that issue. I encourage you to read the article but I'll give you a dumbed-down version. You collect a huge number of games where Tofiks plays against himself. Again, fastchess and cutechess are the tools to achieve this- 50~100k games should do. Then you extract all positions from those games (filtering out opening book moves and mate sequences - evaluation is meaningless when there is mate on the board). You save all the positions together with the outcome of the game. You then pass all the positions to your evaluation function and compare the evaluation to the outcomes. You calculate an error between your evaluation and the outcome of the game. The error means there is a mismatch - why would a position be evaluated positively giving an advantage to white while the actual game result was a victory for black?
The optimization process then goes to work and tunes all of the 941 weights of the evaluation to minimize the mean error- reduce the mismatch of the output of the evaluation function and the outcome of the game. It attempts to create the highest possible correlation between evaluations and outcomes.
This process allowed me to pick better values for all the weights but it also allows to see what doesn't work or is outright broken. For example I extended my evaluation with victim aware threats. It had a value for pawn attacking a minor piece, a minor attacking a rook, a rook attacking a queen, a minor attacking a queen, etc... Logically all of these should be positive - if my pawn attacks their knight that should be good. But after the tune this value turned out to be negative. I was a little confused. I explained it away - it's probably some weird other signal mixed in - maybe my pawns attack their minor pieces when they over extend and then the attacked piece moves away and the pawn is lost. While in fact I had my pawn attacks in that piece of code backwards - so the logic was broken; the weight was not doing what I was expecting. After flipping and re-tuning the value became positive.
This also helps trimming evaluation that does not do much - if values are near zero - maybe they are just bad signals. The tuner really can not correlate the weight meaningfully to outcomes of games. That's often worth investigating if I should maybe just drop that term from my evaluation. Texel tuning alone has guided me towards easy wins that gained Tofiks probably around 100 Elo.
Open Bench
One of the reasons why Stockfish is the king of engines is their amazing testing framework Fishtest. It's a distributed test framework to allow performing the SPRT and SPSA in an efficient and trusted manner. It's an amazing piece of software and community effort. Open Bench is the same system and framework for open source engines. Doing the same type of for open-source engines.
Open Bench is amazing and I set up my own instance at bench.likeawizard.dev. Some of the tests can take several hours. Running fastchess on my machine works fine but interruptions and doing other work at the same time causes inconvenience. My own Open Bench instance allows me to set up test work and run when I need it. GitHub Actions have various triggers to compile and create tests automatically. When a test is created and approved I spin up cloud workers that pick up that work and do the testing while I can already explore other changes locally. While nothing new this set up has such nice developer experience ergonomics and allows iterating much faster.
So my set-up looks like this:
- A single hetzner instance running:
- lichess bot that always runs the latest version of Tofiks from master
- Open Bench instance - bench.likeawizard.dev
- worker manager - that spins up/down cloud worker instances depending on available work
- Temporary hetzner cloud workers picking up work from my OB instance
- GitHub Actions
- Create SPRT tests
- Create SPSA tunes
- Deploy newest version of Tofiks to lichess
- Runs a linter and short list of tests to not ship anything broken
- Locally on my machine:
- one binary of the latest master branch version of Tofiks - baseline
- development version of Tofiks - compiled from my working directory
- Makefile with various targets for fastchess SPRT tests:
- vvstc - very very short time control 0.2+0.02
- vstc - very short time control 0.5+0.05
- stc - short time control 10+0 (all automated tests on OB are run on this)
- ltc - long time control 60+0.6
The local SPRT tests are just quick validation tools for new ideas. The short time controls are just to speed up the validation process of new ideas. Some ideas perform differently based on depth so on very short time control games Tofiks never reaches the depths that would test some features.
In conclusion
This is a long post and if you reached this far, you're a trooper. I hope if you stumbled on this article and read it, it will help you with your engine development. It took me many lessons and failures to come up with a workflow, tools and methods to make chess engine development less complicated than it has to be. If you're also an engine developer I would be happy to hear how you deal with all this nitty-gritty workload.