Your trusted source for computer chess information!
Computer-Chess Wiki
How to Run a Chess Tournament
Introduction
Why Run a Computer Chess Tournament?
Running a computer chess tournament is a long and thankless task that will tie up your computer for days. Many people feel that such tournaments – referred to as basement tournaments – are merely a waste of computational resources. Despite this, many computer chess enthusiasts still eagerly spend much of their free time running such tournaments and the reasons why they do so vary. Some do it because they want to assist authors of their favorite programs as beta testers, others do it because they have the need to find out once and for all the best program or to prove a point, and yet others do it because they enjoy watching the computers battle it out and enjoy playing through or analyzing the resulting chess games.
Tournament Design
Choosing Hardware and Software
The first decision you have to make is to decide what software and hardware you need to run the tournament. If you only have one computer, then your choice of hardware is already made. The type of programs you wish to test may affect your choice of hardware and software. If the programs you decide to test are all using the same interface e.g. Winboard, then everything becomes simpler. If the programs cannot communicate, it becomes more tricky as you must either transfer the moves manually using the old <Alt-Tab> trick (if on the same computer) or use an adapter. Engine communication through drivers such as auto232 or Polyglot is more complicated to set up and can be difficult to verify for correctness, but many tournament directors have solved these issues and are willing to help newcomers to get up to speed.
One Computer or Two?
Theoretically speaking, two computers are superior than playing two programs on one machine since we can be sure that all the resources of one computer are used only for one program. If you are using a dual or multi-core processor keep in mind that engines are still fighting for resources, e.g. access to EGTBs and access to main memory. Using two separate computers avoids the problem of deciding whether ponder should be ON or OFF. See also the section on Ponder ON or OFF? (below)
Unfortunately, few people have identical computers for a computer chess tournament. To ensure that the tournament is fair, some have proposed increasing the time available for the program on the slower computer, based on differences in processor speed. However, this overlooks the fact that, if both programs ponder – otherwise they wouldn't be running on 2 separate machines – the extra time given to the program on the slower machine overlooks the fact that the ponder time for the engine on the faster machine will be longer than for the engine on the slower machine. Connecting two computers together is also somewhat more complex and technically demanding. Perhaps a better way would be to switch computers every match, or run matches over ICS, but do this only if you can find someone with exactly the same computer configurations as you.
Ponder ON or OFF?
Pondering refers to the fact that after the engine has moved, it will continue to “think” (on the assumption that the opponent will play what it considered the best move) just as humans do. It is common wisdom that pondering should be set OFF when testing on one computer. This ensures that when it's programs A's turn to move we can be sure that all the computer resources are used by A and A alone.
On the other hand, some argue that an engine with ponder OFF cripples an engine based on the fact that an author expects testing with ponder ON. Crafty for example has time management that is based on the assumption that pondering is ON. If Crafty accurately predicts the opponent's move (usually referred to as the ponder move
) the pondering is deemed successful (known as a ponder hit
) and Crafty can either think for a shorter time or move immediately when it has the move. If pondering is OFF, this cannot happen – Crafty effectively will get less thinking time. It is also unknown if setting ponder OFF will hurt some engines like Crafty more than others. There have being attempts by some to show that setting ponder OFF affects all programs equally, (noticeably, tests by Volker Pittlik , show that the results are similar for ponder ON or ponder OFF), but neither side remains convinced. (e.g. Volker tested Crafty with a series of strong freeware programs, but it is arguable that the weakness that affects Crafty due to pondering would only be apparent against stronger commercial programs).
In some cases, you will have games where one program that doesn't support pondering will be playing against one that does. In that case, if you are running the match on one computer, it would be advisable to turn pondering OFF since the CPU usage would be extremely uneven. On two computers the program that can ponder should be allowed to ponder. The lack of pondering in one engine shouldn't handicap the other unless there are serious reasons against doing so. [Thanks to Severi Salminen for pointing this out.] Also see the discussions in Verifying Fairness
Choosing a Tournament Format
Round Robins tend to be most popular for a tournament of 5-10 programs. Generally programs of about the same estimated strength are chosen. Unfortunately this makes significant results more difficult to come by. (See Interpreting Chess Results.) Swiss Systems are usually used in a large free-for-all tournament with many programs of different strengths . Knock-out type of tournaments tend to be less popular among testers. Testers can also let new programs 'run the gauntlet
' by testing one program against various programs of known strengths to gauge playing ability and strength.
Choosing A Time Control
Choosing a time control is a tricky task. The first limitation as mentioned above is that some chess programs don't support certain time controls. There are some who feel that Blitz time controls are not chess, are less interesting, and not meaningful compared to longer time controls.
Another disadvantage of extremely short time controls is that many programs cannot handle time trouble and many crash. On the other hand, Blitz games gives you the luxury of playing more games over a shorter period of time and as every one knows the more games you play, the more certain you are of your results.
You should also keep in mind that programs are not equal at all time controls. Some, like the old Yace, are much better in Blitz than in standard time controls, while others, like Francesa and Amy, are the opposite. Another item to consider is your processor speed. G/30 in slower computers might be equivalent to G/15 or less on a faster machine. Slower time controls such as one move per day will allow us to get a glimpse of the strength of current day programs on future hardware, but it's impractical to do so.
Selecting the Participants
With so many chess programs around, the computer chess tester has to be selective. In many ways this is the most significant step since the limitations of each program directly affect the tournament. (E.g. some programs cannot handle X moves in Y minutes, others have fixed-size hash tables.) There is no standard way to pick chess programs but I think it's advisable to include at least one program with a well established strength as a benchmark.
Also, it is usually a bad idea to include multiple versions of the same engine. Unless there is really a big difference between versions your test results can get skewed. If your purpose is to test the difference in strengths between the versions it is much better to test the versions of interest against common opponents.
Some testers insist that participating chess engines have the ability to recognise draws be it due to the 50-move rule, insufficent material or 3-fold repetition. This is because without such features, there will be a lot of time wasted where chess engines mindlessly move back and forth in drawn positions!
Allocation of Memory For Transposition/Hash Tables
If you were doing engine-versus-engine matches in the old days, where system memory was much smaller than today, the total memory allocated to both programs was recommended to be half of your system's total memory. Given that the amount of memory resources needed by Windows is somewhat fixed, if you have large amounts of RAM you do not need to follow the “50% rule” above. [Thanks to Andreas Schwartmann and Mogens Larsen for pointing this out.]
How much RAM you should allocate also depends on the time controls used. At Lightning and Blitz time controls, large amounts of RAM dedicated to hash tables usually hurt playing strength. Large hash tables are useful at long time controls where using a smaller table would cause it to quickly fill up and not help an engine much. As CPU speed increases the meanings of large
and small
also increase. At identical time controls a 32MB hash table on a 1 GHz cpu is about equivalent to a 64MB table on a 2GHz processor!
In the interest of fair competition you should allocate the same amount of memory to both engines. However, depending on the program, you might not have full control for the allocation process. Some programs (e.g. Francesa) use only one fixed hash size and cannot be adjusted at all. Others might allow almost any finite allocation of memory. Lastly, some programs like Crafty lie in between by allowing you to adjust the size of hash tables but in discrete increments.
Therefore it may not be possible to be totally “fair” in the allocation of hash for engine-versus-engine matches.
Another question to consider is when allocating memory for engines that use endgame tablebases and those that don't. Should the total memory allocated to each engine include those allocated to the endgame caches? Here again we run into situations where some engines allow setting the internal egtb size and some that don't.
Verifying Fairness
Be wary because some programs can hog CPU resources. It is imperative that you ensure that any program in your tournament is not using up all the resources when it runs or has an unfair advantage. For example the choice of interface can affect results. The old Chessbase Winboard adapter deliberately put all Winboard engines at a disadvantage compared to its own products when run under the Fritz interface. Even the new Chessbase UCI adapter can cause problems, especially when setting the non-native hash sizes correctly. Sometimes there are work-arounds for these problems and you may have to dig deep to find an answer to your problem.
Computer chess results can be adversely affected by poor opening books. In fact, it has being suggested that some commercial programs have strong “killer” books, which accounts for much of their strength, and each new version improves because of a better book rather through the strength of the engine. This is of course, an extreme view. But no one cannot deny that opening books play a big part in determining the results.
Using Nunn positions for testing is an attempt to avoid the variability in the quality of opening books by starting all the chess programs from the same 16 default opening positions. However, just as humans use openings that suit their playing style, opening books are designed to allow each chess engine to play to its best capability and are an integral part of the chess program. Another problem is that the Nunn's test assumes that a program is stronger than another only if it demonstrates its strength over its rival in the 16 opening positions that are played. This is extremely artificial. In your own mind replace “two programs” with “two human GMs” and consider the consequences. Forcing one opening book on all engines or removing all opening books runs into the exact same problems as the Nunn testing.
[Speaking as a chess engine author I can say that Nunn tournaments and common book tournaments have much less meaning to me than tournaments where my engine is allowed its own book. I have spent many, many hours tuning my opening book and correcting mistakes. It is quite frustrating to see the book I have labored hard over taken away because of flawed reasoning. If opening books are disabled or Nunn positions are used, or common books are employed I feel it is quite proper to question the tournament director as to what, exactly, he is trying to test. Certainly it will not be a real-world comparison because people are expected
to use the opening book supplied with an engine – it is there for a specific purpose! The questions that come to my mind are these: Does the proposed Nunn test serve any useful purpose whatsoever? Is forcing a common opening book just a way to promote the tournament director's own book? – It may be a fine book, I am not arguing this, but what does it have to do with testing the relative strength of engines?]
Similarly, disabling tablebases or using a subset of available ones will handicap some engines (the ones without endgame logic) and reward others (the ones that do). All of these factors are important in deciding what should be common to all engines and what should not.
There is a question of whether learning should be turned on, especially in a tournament with engines that don't have this feature. My own personal view is that learning should be turned on, since the lack of the learning feature in one engine shouldn't handicap/hinder the use of it in a engine that does. On the other hand you might get a lot of repeated losses by the non-learning engine if the learning engine has aggressive book learning. Of course all learning files should be purged before a tournament so that learning engines do not have an unfair advantage before the tournament starts.
Some engines hold their opening books in memory while others read theirs on-the-fly from disk. Engines holding the book in memory will always use more memory than those that don't. Some tournament directors seem to be concerned about total memory usage and subtract an engine's book memory from the hash size in an attempt to “equalize”. It seems to me that this is not equalization but an unfair advantage given to the disk-reading engines or to engines with a very compact way of storing the opening moves.
Setting Up the Participants and Calculating the Results
How to Automate the Running of a Tournament
There is no built-in feature to support the running of Round Robins or Nunn Tests in Winboard. However the /mg command helps you run matches of fixed lengths between 2 given engines. The fastest and simplest way to get started setting up a tournament is to download a Winboard tournament manager, configure it, and use it to automate the running of your tournament. The managers:
- PSWBTM - Pradu's Simple Winboard Tournament Manager
- Arena has the built-in capability of running automated tournaments
It is best to set up a “throw-away” tournament at very fast time control to verify that each engine is configured correctly. The quick tourney can verify paths to opening books, endgame tablebases, and engine stability. Monitoring each engine in a Task Manager
program that shows each program's memory usage will verify correct hash settings. Be prepared to experiment to get some engines to work properly. It is preferable to use Winboard whenever possible because it is the most stable platform. You don't want your GUI to crash in the middle of a tournament! [Thanks to Roger Brown for suggestions and help with this section.]
Tournament Log Files
Because many tournament directors want to help engine authors, they enable the creation of Winboard log files during their tournaments. These files are large and capture every message that goes to/from either engine and Winboard. These files serve no purpose until a problem is encountered. Then it becomes an item of prime importance. Many Winboard problems are too difficult to fix unless a log file points out the problem area.
[As an example, my engine once was involved in a small number of time losses. In every case it was the other engine's clock that ran down. I was quite surprised when the Winboard log revealed that my engine was refusing a legal move as illegal. The opposing time losses were my fault! I checked my program's move legality testing and it contained no bugs. I went back to the log file and noticed that in every time loss the other program had offered a draw. It turned out that the true cause of my problem was my engine's total ignorance of draw offers during a game. My engine was attempting to parse “draw” as an actual move and declared it as an illegal move. Winboard, by blindly trusting my engine, kept refusing the legal move until the opponent's clock ran down. Once I “taught” my engine to process the draw offer correctly there were no further opposing time losses. This bug would have been impossible to track down without a Winboard log file.]
Handling Program Crashes
During the course of the tournament, there will be times when engines crash for no particular reason. You will have the discretion to decide whether to award a win/loss/draw or allow a replay. In general if the problem is due to the failure to recognise the 50-move rule or insufficient material, it is best to award a draw. It gets more tricky when a program crashes when it's winning.
Some argue that when a program is clearly winning (say a rook up), and it crashes, you should award a win to the crashing program. However, many (most?) would disagree. Firstly, it's arguable that a engine that crashes should be treated as having forfeited the game just as a human GM would lose the game if he refused to move. Another problem is that if you follow the rule strictly of awarding wins when the computer crashes, it's conceivable that a programmer could program his engine to crash whenever it gets a “winning” position to ensure the win!
Having a Winboard log file of a crash event is of extreme impotance. Quite often the question of fault is impossible to ascertain without the log. Awarding of points through guesswork is not recommended!
Handling Upgrades and Bug Fixes During the Tournament
A great debate arose in Computer Chess Club (CCC) in March 2001 over whether it was a good idea to upgrade programs while the tournament is still ongoing.
It was argued that by mixing up engine versions, you invalidate the results of the tournament since it would be akin to replacing players in the middle of the tournament. Also, programmers could then pick and choose versions that could do especially well against specific engines.
On the other hand, in real official Computer Chess tournaments like the WCCC tournaments, this is exactly what happens! Depending on the opposition, the programmer will select a certain opening, tune parameters, change code, etc. in hope to get a version that can best handle the next opponent. Is this “scientific” ? Probably not.
Also, human GMs are not exactly the same between rounds anyway since they adjust, learn, and change depending on who they face over the board.
Even if the upgrade does not increase strength but merely fixes a bad bug that causes a program crash, some still feel that the upgrade shouldn't be applied since they feel that the bug is part of the program and should be evaluated as such. Again, some disagree. They argue that you learn nothing about the strength of a chess program if a stupid bug causes it to crash again and again.
There is probably no right answer to this and it all depends on your objective. Whatever way you choose, it would be best to state your policy up front and to apply it without bias.
Calculating and Posting the Results
The most important after-tournament test is verifying the results as reported by Winboard. Some engines have bugs and will erroneously claim a draw or win. Winboard will pass on the bad claims without checking validity. George Lyapko wrote a program called LGPgnVer which detects invalid draw/win claims in pgn files. Another slippery item is handling time losses. Time losses can be caused sometimes by Winboard communication bugs of either
engine, sometimes they are caused by infinite loops, and sometimes they are “natural”. Advanced Winboarders can quite often distinguish between time losses that could not be avoided and those caused by programming bugs. It is in your best interests that all your published results are correct. You don't want engine bugs influencing your results in random ways. If you are uncertain of a result then post for advice. There's plenty of expert Winboarders out there who will help you out.
If you need to calculate ELOs based on your results, you can use Elostat by Frank Schubert. An even better Elo rating program that addresses many of the EloStat shortcomings is Rémi Coulom's Bayesian Elo Rating.
Now that you have finished the tournament, you may wish to share the results of your tournament. You can post in various places including the Winboard Forum, Computer Chess Club (CCC) and other forums depending on the type of programs in your tournament.
Before you post remember to provide the following information:
- The speed/type of your CPU processor, (e.g. 2.8Ghz Dual-Core AMD Opteron™, Model 890)
- Whether ponder is turned on. (e.g. Ponder=ON)
- Time Control (e.g. G/60)
- Amount of memory allocated to each engine, listing exceptions if possible (e.g.: 64MB hash, 4MB pawn hash, etc.)
- Whether all or some of the Endgame Tablebases are used (e.g. All 3, 4, and some 5-men tablebases)
Many of the Winboard tournament managers will produce the cross-tables for you. This is the information that most Winboarders want to see. Most Winboarders find it inconvenient when tournament directors merely point to their own web page instead of posting their results. Especially annoying is when the referred page has pop-up ads; that will surely discourage future visitors.
You might also wish to provide brief commentary. For example, you may want to comment on how certain programs performed better than expected or worse than expected and if so why. Was it due to a poor opening book? Poor King safety? Constant crashing? Any piece of information about a problem has the potential to be very useful for authors to help search out and correct bugs or shortcomings.
For any bugs, crashes, and time losses it is very helpful to post the portion of the Winboard log file where the problem occurred. This is usually the tail end of the log. It is best to refer to a Winboard log file before assigning blame. The log is unassailable evidence whereas your opinion is not. Egos can get crushed and people can get unbelievably angry over accusations, so try to phrase accusations of engine blame very carefully. ;)
My advice is that you shouldn't post all your tournament games in the forum. Make them available for download on a web site or offer to email the games to people who are interested. It is customary to compress these files to minimize size and download times. If you really need to post, pick one or two interesting ones and comment on them.
While authors of the engines in your tournament will likely be very interested in the results and bugs you report, don't be disappointed if no one comments on your postings. Unless your results are somewhat unexpected, people will not be likely to comment.
Happy Testing.
Acknowledgements
Help and comments from Tim Mann, Peter Berger, Andreas Schwartmann, Severi Salminen, Roger Brown, and many others. Thanks guys!