Chessy - Episode 2
This is a follow-up about the mini-proyect called Chessy, an app to reproduce chess matches. In the previous post I decided to use a list of board positions to render the UI, and these notes explain how I got them.
The chess matches I use as input are written in PGN Format and contain the movements of the match. So what I need is to generate the board positions, and save them all in a custom-JSON file.
In this post I add write briefly about the PGN format, the notation for the movements whithin it, and the intermediate process (orange box). Finally, I link the main parts to its implementation.
Input
PGN Files
A chess-match recorded in PGN format is mainly divided in two parts: general information (place, date, tournament, players), and the movements played.
The following is an example:
[Event "WCh 2013"]
[Site "Chennai IND"]
[Date "2013.11.16"]
[Round "6"]
[White "Anand, Viswanathan"]
[Black "Carlsen, Magnus"]
[Result "0-1"]
[WhiteTitle "GM"]
[BlackTitle "GM"]
[WhiteElo "2775"]
[BlackElo "2870"]
[ECO "C65"]
[Opening "Ruy Lopez"]
[Variation "Berlin defence"]
[WhiteFideId "5000017"]
[BlackFideId "1503014"]
[EventDate "2013.11.09"]
1. e4 e5 2. Nf3 Nc6 3. Bb5 Nf6 4. d3 Bc5 5. c3 O-O 6. O-O Re8 7. Re1 a6 8. Ba4
b5 9. Bb3 d6 10. Bg5 Be6 11. Nbd2 h6 12. Bh4 Bxb3 13. axb3 Nb8 14. h3 Nbd7 15.
Nh2 Qe7 16. Ndf1 Bb6 17. Ne3 Qe6 18. b4 a5 19. bxa5 Bxa5 20. Nhg4 Bb6 21. Bxf6
Nxf6 22. Nxf6+ Qxf6 23. Qg4 Bxe3 24. fxe3 Qe7 25. Rf1 c5 26. Kh2 c4 27. d4 Rxa1
28. Rxa1 Qb7 29. Rd1 Qc6 30. Qf5 exd4 31. Rxd4 Re5 32. Qf3 Qc7 33. Kh1 Qe7 34.
Qg4 Kh7 35. Qf4 g6 36. Kh2 Kg7 37. Qf3 Re6 38. Qg3 Rxe4 39. Qxd6 Rxe3 40. Qxe7
Rxe7 41. Rd5 Rb7 42. Rd6 f6 43. h4 Kf7 44. h5 gxh5 45. Rd5 Kg6 46. Kg3 Rb6 47.
Rc5 f5 48. Kh4 Re6 49. Rxb5 Re4+ 50. Kh3 Kg5 51. Rb8 h4 52. Rg8+ Kh5 53. Rf8 Rf4
54. Rc8 Rg4 55. Rf8 Rg3+ 56. Kh2 Kg5 57. Rg8+ Kf4 58. Rc8 Ke3 59. Rxc4 f4 60.
Ra4 h3 61. gxh3 Rg6 62. c4 f3 63. Ra3+ Ke2 64. b4 f2 65. Ra2+ Kf3 66. Ra3+ Kf4
67. Ra8 Rg1 0-1
As we can (hopefully) clearly see, the first part tells that this game was between Anand and Carlsen, played in 2013, in the World Championship context, Magnus won 0-1, etc. After reading a couple of games, I noted that this general information part doesn’t follow strict rules: the number of lines vary from match to match, and apparently none of them is mandatory. Only some matches include the player’s ELO (official ranking), Round (if played in a many-rounds context), etc.
The second part, the movements, is present in every file, starts with 1.
, and finishes with the result -0-1
in the example case. Each movement is written using the Algebraic Notation.
Algebraic Notation
Each movement is a bunch of letters with specific and positional meaning, like a digit that means something by itself and something else next to another.
Each complete movement (Whites and Blacks, in that order) starts with an incremental number follwed by a dot, like 1.
, 2.
and 3.
. Then it comes the White movement and, if the game is not yet finished, the Black movement. These two movements are separated by a space. In the example game, e4
is played by Whites while e5
is played by Black, at the very beginning of the game.
The letters a
to h
refer to the File of the Board (vertical lines), being a
to the left looking from Whites side; the numbers from 1
to 8
refer to the Rank (horizontal lines), being 1
the first one in the Whites side. The capital letters refer to the piece that performs the movement: R means Rook, N means kNight, B Bishop, Q Queen, K King, and no-letter means Pawn.
Furthermore, =
means promotion, O-O
and O-O-O
castle, !
check, etc.
One (important) thing to note is that in the notations is explicit about where the piece moves to, no where it moves from. To know that (the from) one should look at the board and deduce what’s the only piece that’s capable of performing such movement (here you should know the movements rules). If more than one piece can move to that square, more letters are introduced to the movement (called “disambiguation”): file, rank, or both. In the match of the example, this happens in 20. Nhg4
, telling that Knight in the file h
-and not the other- is the one moving.
Process
The process is simply about playing the movements in a board as I read and understand the input game. If, while I play, I save all the intermediate board positions, I can query them all in a final step.
To read and understand the input I use a Lexer & Parser, like compilers do (I really wanted to play with it). The role of the Lexer is to detect tokens as it reads the input, and the role of the Parser, who receive the tokens, is to group them in valid semantic constructions.
Expanding on those concepts: The Lexer’s tokens can bee seen as “category of things”; for example, 1.
is a movement number
, Q
, K
and R
are pieces
and 1
, 2
, and 8
are ranks
. The Parser’s semantic constructions are constructed by grouping tokens or other constructions; some examples are: a one-side movement
(made out of piece, file, rank, etc., in a particular order), game result
, match
, etc.
Once a movement is detected by the Parser, I play it in a Board. Of course, the Board needs some logic to do it (move a piece from a square to another) since the movement in algebraic notation doesn’t provide all the information - we know where it moves to, and no where it moves from, as mentioned before. Therefore the Board first inspects itself in order to pick the right one. The Board also saves all the intermediate positions generated by playing each movement. Once the Parser detects that a game is over, the Board provides all the intermediate positions to be written in the JSON output.
NOTE: in this first iteration, I left out the general information part of the PGN file - player names, dte, etc - and focused on the movements part only.
Implementation Notes
Although my initial thoughts were about using lex, yacc, and/or bison in C, I went for Python and the amazing SLY library; it follows the same lex/yacc idea but its syntax is modern, intriguing to learn, and (hopefully) useful. Its documentation is very clear and a good introductory material to Parsers, together with its creator talk.
The Lexer specifies the tokens with regular expressions, and chessy’s ones are at the beginning of this file
The Parser defines the semantic constructions in production rules format, and the ones used in this project are in this section of the same file1. The parser calls a couple of functions when some rules are detected (movement made and game over) that a sort of coordinator links to the Board.
The Board that receives and “plays” each movements, finding the appropiate piece, is implemented in this file. Every intermediate position is stored in an list so they can later be retrieved, converted to FEN format, and be part of the JSON file that will feed the UI.
Everything I mentioned can be used through a mini-CLI; so to process a PGN file and generate a JSON file per match it’s enough to run:
python src/main.py --pgnFile <pgn-file>
That’s all for today, see you soon! ◆
-
When writing the production rules I found that the semantic of a movement is recursive to the left when you add more than 1 disambiguation letters, and so problematic for these kind of parsers (LALR(1)) - can’t be resolved by looking at only 1 token ahead. I thought about two options: invert the input string and all the production rules - so the recursion ends up in the right-hand side - or add these multiple (and not so clean) production rules for that case. ↩