Introduction
Two Sigma recently hosted a programming competition called Halite, held from Nov2016 to Feb2017. Participates chose to compete in the game-like challenge using a variety of programming languages and artificial intelligence strategies. Rather than building a set of heuristics, I decided to utilize machine learning to mimic a top player and achieved diamond rank.
Here’s a snapshot of my bot (in green) going against four other players (full replay):
Machine learning was appealing because it provided the opportunity to compete at high level without needing to understand the rules in-depth, or think about the many different situations a bot might encounter, such as balancing expansion and production or handling combat; all of these things could be learned from other players in a model. Another incentive was that a trained model can be useful in initializing a new model for use in reinforcement learning, which I look forward to trying in the next iteration of the competition.
Specific challenges
There were several key challenges to building a successful ML bot, the overcoming of which became important to find a good solution and perform well in the competition. Here are a few of those key hurdles:
-
- Machine learning, and deep learning in particular, is expensive in data. One of the challenges in Halite was that each unique version of a bot only had a small number of available replays, and each version moved in different ways, possibly making them incompatible for learning. This problem is exacerbated further when combining data across players.
- One of the most important challenges was keeping within the time-per-move allotment. This was difficult because not only was 1 CPU second per move very limiting for machine learning computation, but because available CPU time was not consistent. Depending on unknown server conditions or other bots’ CPU usage being unpredictable, my bot would sometimes exceed the limit by as much as 50%, causing a timeout and losing the game. Unlike heuristics that computed each component one at a time, often in a loop, being sufficiently easy to exit out of in case time is getting close to the limit, the machine learning model acts as a single large operation that is not easily exited before time is up.
- Another consideration I had while designing my model was having a sufficiently large receptive field to capture long-range dynamics. Although each move is very local (only moving to an adjacent square per move), units needed to see far enough to identify which direction to head. This was important for moving toward valuable production, or moving units from far inward toward a border. I couldn’t have too large a receptive field, however, because creating sufficiently large padding per each frame became expensive, so balancing power with computational efficiency become important here, too.
- The distribution of moves over time varied drastically. For example, we start each game with only a single unit per player, but depending on the conditions of the map, we could have very few to very many squares in the mid game, and depending on how well we performed during the game, have zero or the whole map at the end. Additionally, the most important moves are those in the early game, because these often dictate whether the game would end in a win or in a loss.
- Machine learning, and deep learning in particular, is expensive in data. One of the challenges in Halite was that each unique version of a bot only had a small number of available replays, and each version moved in different ways, possibly making them incompatible for learning. This problem is exacerbated further when combining data across players.
Each of these challenges influenced my final solution.
Data
To ensure there was sufficient data to train a model, I made the assumption that successive updates to a player’s bot were only incremental, and that a certain range of bot versions could be safely aggregated together during training. This assumption won’t always hold — if the strategies between replays are too different, this can interfere with the learning process and result in a weaker model — but it seemed to be sufficient for the replays I trained on. The replays used in my training pool came from nmalaguti’s bot, versions 47 through 54. This assumption helped solve challenge #1.
The first step to building the model included getting the data in a learnable format. The raw replays were available in JSON, which I parsed then formatted into 4 dimensional arrays:
[num_frames, y_dim, x_dim, channels]
The first three dimensions are fairly obvious (a stack of images through time). However, the information stored in the channel dimension took a little more thought. Starting with an example frame (again, my bot in green):
I split the data into three different channels. This first channel stored the raw strength values (0 – 255) for each square, regardless of ownership. Originally I utilized separate channels for each of enemy, myself, and neutral, but found that learning actually improved when combining ownership information in a single channel. Thus, the second channel stored ownership information in the following way: +1 for self (shown as white in the figure below), 0 for neutral (black), -1 for enemy (red). I don’t have an intuitive explanation for why this worked better other than perhaps having a single, denser layer rather than two or three sparse layers resulted in better weight sharing across unit formations and eased learning. Also, my first convolutional layer had 256 filters, so using fewer input channels required slightly less computation. The third channel simply stored production information. The first (strength) and third (production) channels were normalized by dividing max value (255 for str 17 for pro) then subtracting 0.5, thus, each were in the range -0.5 – 0.5. The ownership channel didn’t need further normalization and remained the same.
Model Architecture
I tried a number of different design strategies and components including network-in-network, skip-connections, batch-normalization, very deep narrow networks, and a variety of kernel sizes and arrangements, but it turned out that although these often improved bot performance, the computational cost outweighed any benefit. I had the most success by stripping away state-of-the-art methods leaving just a very simple convolutional network with very leaky relu activations (slope = 0.3), and ramping up the number of filters per layer to a point just before timing out became too excessive. I found it also important to have sufficient representation in the very first and very last layers of the network. Perhaps this is because the time to move a square is very precise, e.g., 50 strength attacking 51 strength is very different from attacking with a 52 strength piece, even though these are numerical close together in the overall distribution; and also since the last layer essentially makes the final decision for moves. The inner design of the network didn’t matter much since we aren’t using any pooling layers, so data can freely flow from beginning to end without significant data loss (conv kernel that simply returns the input), allowing the network to learn at what depth to abstract information or to continue passing along raw data (with the caveat that it passes through a non-linear activation after each layer). At the end of the network 5 different 2D output maps are returned, each representing one of the available moves (still, north, east, south, west).
To determine the loss, the output array with shape [batch_size, y_dim, x_dim, 5_outputs] could be compared to the correct moves as provided by nmalaguti. Cross-entropy was computed between the predicted and ground truth moves along the last dimension resulting in a 2D array of losses. Since we are essentially performing a 5-class classification problem at each square, it made sense to do square-wise cross entropy rather than an alternative but less efficient loss such as SSE. We then normalize the loss by dividing by the number of “self” squares in the loss map, since this value varies significantly across frames, even within the same game phase. Also, since we don’t want predictions made for enemy or neutral squares to be included in the loss, these were filtered out using a mask.
To account for challenge #3, I ensured that the network was sufficiently deep with a total of seven convolutional layers, enabling the network to have an effective receptive field size of 23×23. The game map wraps at each edge, so it was important to pad the frames properly. Before I used wrapped padding, I tried simple zero padding and learning was greatly hampered and often resulted in erratic behavior in some of the pieces’ movement, particularly near edges. Wrapped padding was straightforward to implement by simply taking some portion of the opposite edge and concatenating it along the edge to be extended. Here’s an example:
Training Details
Once data was prepared and in the right format, I created three data queues for the early (frames 1-25), mid (26-125), and late game (126+). This allowed for mixing of frames from multiple replays per each mini batch of data for increased variety at each gradient step, and also allowed for custom sampling or weighting of each game phase. To address challenge #4 and ensure that the early game was adequately represented, I sampled from the early game queue just as often as the mid and late game, even though the total number of frames in the early game is significantly less than the total number of frames usually seen in a game. I also weighted each game phase during the gradient step with early, mid, and late phases having weights 1., 0.75, 0.45 respectively, although this likely had the same effect had they been weighted the same, but sampled more or less frequently. I also reduced the weights of still moves by 53% since these are seen more frequently than other moves. To increase the effective dataset size, data augmentation was performed by rotating and mirroring each frame, resulting in an 8x larger dataset.
Training was completed using Adam optimizer with initial learning rate 1e-4 (decrease to 3e-5 after 140k gradient steps) and trained on mini batches of 64 frames. Training completed after 180k steps. Multiprocessing was used to prepare data on the CPU (essentially for free) while training proceeded on a single NVIDIA K80 GPU.
Optimizations
One of the most important challenges in utilizing ML in the competition was staying within computation limits. A few tricks that helped minimize compute time in the final bot included:
- Rewriting the get_frame method in hlt.py to remove unused variables and directly dump map contents into a numpy array, minimizing unneeded calculations.
- Trim each frame to the edge of my squares + receptive field size. This allowed for much faster computation in the early game, so that on large maps where there was higher risk of timing out I could at least get to the mid to late game and perhaps finish in a higher place than if I had timed out on the first frame.
- Removed softmax when not training. The softmax activation is important during training so that we can compute the cross entropy, but in production we don’t need to make this expensive calculation. Because we are doing an argmax and only need the max raw logit, i.e., the max move is the same whether the softmax is computed or not, this step can be ignored.
- Utilize numpy as much as possible to perform array manipulation, which is much more efficient than performing equivalent calculations in pure python.
- Removed extra neural network components (network-in-network, batch norm, etc.) and simply increased the number of filters for each conv layer.
Potential improvements
Toward the end of the competition there was an explosion of new tactics being used, and the game meta changed significantly. The replays I had for training were older and did not represent these recent shifts in strategy. I didn’t have the opportunity to retrain on newer data in the final days of the competition and I’d be interested to see how the same model trained on new data could perform. I am quite satisfied that my bot still performed as well as it did even amidst new tactics, particularly the non-aggression pact employed by top players, but apart from training on old data, I’m sure there are many other areas for improvement.
Conclusion
Halite was an interesting and rewarding challenge, and I’m very grateful for the opportunity I had to participate. Congratulations to all the players, and I look forward to the next Halite iteration.