理解 Q-Learning - 用TensorFlow構(gòu)建你的第一個(gè)游戲AI入門教程

了解 Q-Learning 的一個(gè)好方法，就是將 Catch 游戲和下象棋進(jìn)行比較。

在這兩種游戲中，你都會(huì)得到一個(gè)狀態(tài) S。在象棋中，這代表棋盤上棋子的位置。在 Catch 游戲中，這代表水果和籃子的位置。

然后，玩家要采取一個(gè)動(dòng)作，稱作 A。在象棋中，玩家要移動(dòng)一個(gè)棋子。而在 Catch 游戲中，這代表著將籃子向左、向右移動(dòng)，或是保持在當(dāng)前位置。據(jù)此，會(huì)得到一些獎(jiǎng)勵(lì) R 和一個(gè)新狀態(tài) S'。

Catch 游戲和象棋的一個(gè)共同點(diǎn)在于，獎(jiǎng)勵(lì)并不會(huì)立即出現(xiàn)在動(dòng)作之后。

在 Catch 游戲中，只有在水果掉到籃子里或是撞到地板上時(shí)你才會(huì)獲得獎(jiǎng)勵(lì)。而在象棋中，只有在整盤棋贏了或輸了之后，才會(huì)獲得獎(jiǎng)勵(lì)。這也就是說，獎(jiǎng)勵(lì)是稀疏分布的（sparsely distributed）。大多數(shù)時(shí)候，R 保持為零。

產(chǎn)生的獎(jiǎng)勵(lì)并不總是前一個(gè)動(dòng)作的結(jié)果。也許，很早之前采取的某些動(dòng)作才是獲勝的關(guān)鍵。要弄清楚哪個(gè)動(dòng)作對(duì)最終的獎(jiǎng)勵(lì)負(fù)責(zé)，這通常被稱為信度分配問題（credit assignment problem）。

由于獎(jiǎng)勵(lì)的延遲性，優(yōu)秀的象棋選手并不會(huì)僅通過最直接可見的獎(jiǎng)勵(lì)來選擇他們的落子方式。相反，他們會(huì)考慮預(yù)期未來獎(jiǎng)勵(lì)（expected future reward），并據(jù)此進(jìn)行選擇。例如，他們不僅要考慮下一步是否能夠消滅對(duì)手的一個(gè)棋子。他們也會(huì)考慮那些從長(zhǎng)遠(yuǎn)的角度有益的行為。

在 Q-Learning 中，我們根據(jù)最高的預(yù)期未來獎(jiǎng)勵(lì)選行動(dòng)。我們使用 Q 函數(shù)進(jìn)行計(jì)算。這個(gè)數(shù)學(xué)函數(shù)有兩個(gè)變量：游戲的當(dāng)前狀態(tài)和給定的動(dòng)作。因此，我們可以將其記為 Q（state，action）。在 S 狀態(tài)下，我們將估計(jì)每個(gè)可能的動(dòng)作 A 所帶來的的回報(bào)。我們假定在采取行動(dòng) A 且進(jìn)入下一個(gè)狀態(tài) S' 以后，一切都很完美。

對(duì)于給定狀態(tài) S 和動(dòng)作 A，預(yù)期未來獎(jiǎng)勵(lì) Q（S，A）被計(jì)算為即時(shí)獎(jiǎng)勵(lì) R 加上其后的預(yù)期未來獎(jiǎng)勵(lì) Q（S'，A'）。我們假設(shè)下一個(gè)動(dòng)作 A' 是最優(yōu)的。

由于未來的不確定性，我們用 γ 因子乘以 Q（S'，A'）表示折扣：

Q(S,A) = R + γ * max Q(S'，A')

象棋高手擅長(zhǎng)在心里估算未來回報(bào)。換句話說，他們的 Q 函數(shù) Q（S，A）非常精確。大多數(shù)象棋訓(xùn)練都是圍繞著發(fā)展更好的 Q 函數(shù)進(jìn)行的。玩家使用棋譜學(xué)習(xí)，從而了解特定動(dòng)作如何發(fā)生，以及給定的動(dòng)作有多大可能會(huì)導(dǎo)致勝利。但是，機(jī)器如何評(píng)估一個(gè) Q 函數(shù)的好壞呢？這就是神經(jīng)網(wǎng)絡(luò)大展身手的地方了。

最終回歸

玩游戲的時(shí)候，我們會(huì)產(chǎn)生很多「經(jīng)歷」，包括以下幾個(gè)部分：

初始狀態(tài)，S

采取的動(dòng)作，A

獲得的獎(jiǎng)勵(lì)，R

下一狀態(tài)，S'

這些經(jīng)歷就是我們的訓(xùn)練數(shù)據(jù)。我們可以將估算 Q（S，A）的問題定義為回歸問題。為了解決這個(gè)問題，我們可以使用神經(jīng)網(wǎng)絡(luò)。給定一個(gè)由 S 和 A 組成的輸入向量，神經(jīng)網(wǎng)絡(luò)需要能預(yù)測(cè) Q（S，A）的值等于目標(biāo)：R + γ * max Q(S'，A')。

如果我們能很好地預(yù)測(cè)不同狀態(tài) S 和不同行為 A 的 Q（S，A），我們就能很好地逼近 Q 函數(shù)。請(qǐng)注意，我們通過與 Q（S，A）相同的神經(jīng)網(wǎng)絡(luò)估算 Q（S'，A'）。

訓(xùn)練過程

給定一批經(jīng)歷，其訓(xùn)練過程如下：

1、對(duì)于每個(gè)可能的動(dòng)作 A'（向左、向右、不動(dòng)），使用神經(jīng)網(wǎng)絡(luò)預(yù)測(cè)預(yù)期未來獎(jiǎng)勵(lì) Q（S'，A'）；

2、選擇 3 個(gè)預(yù)期未來獎(jiǎng)勵(lì)中的最大值，作為 max Q（S'，A'）；

3、計(jì)算 r + γ * max Q(S'，A')，這就是神經(jīng)網(wǎng)絡(luò)的目標(biāo)值；

4、使用損失函數(shù)（loss function）訓(xùn)練神經(jīng)網(wǎng)絡(luò)。損失函數(shù)可以計(jì)算預(yù)測(cè)值離目標(biāo)值的距離。此處，我們使用 0.5 * (predicted_Q(S,A)—target)2 作為損失函數(shù)。

在游戲過程中，所有的經(jīng)歷都會(huì)被存儲(chǔ)在回放存儲(chǔ)器（replay memory）中。這就像一個(gè)存儲(chǔ) 對(duì)的簡(jiǎn)單緩存。這些經(jīng)歷回放類同樣能用于準(zhǔn)備訓(xùn)練數(shù)據(jù)。讓我們看看下面的代碼：

classExperienceReplay(object):""" During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. In training, batches of randomly drawn experiences are used to generate the input and target for training. """def__init__(self, max_memory=100, discount=.9):""" Setup max_memory: the maximum number of experiences we want to store memory: a list of experiences discount: the discount factor for future experience In the memory the information whether the game ended at the state is stored seperately in a nested array [... [experience, game_over] [experience, game_over] ...] """self.max_memory = max_memory self.memory = list() self.discount = discountdefremember(self, states, game_over):#Save a state to memoryself.memory.append([states, game_over])#We don't want to store infinite memories, so if we have too many, we just delete the oldest oneiflen(self.memory) > self.max_memory:delself.memory[0]defget_batch(self, model, batch_size=10):#How many experiences do we have?len_memory = len(self.memory)#Calculate the number of actions that can possibly be taken in the gamenum_actions = model.output_shape[-1]#Dimensions of the game fieldenv_dim = self.memory[0][0][0].shape[1]#We want to return an input and target vector with inputs from an observed state...inputs = np.zeros((min(len_memory, batch_size), env_dim))#...and the target r + gamma * max Q(s’,a’)#Note that our target is a matrix, with possible fields not only for the action taken but also#for the other possible actions. The actions not take the same value as the prediction to not affect themtargets = np.zeros((inputs.shape[0], num_actions))#We draw states to learn from randomlyfori, idxinenumerate(np.random.randint(0, len_memory, size=inputs.shape[0])):""" Here we load one transition from memory state_t: initial state s action_t: action taken a reward_t: reward earned r state_tp1: the state that followed s’ """state_t, action_t, reward_t, state_tp1 = self.memory[idx][0]#We also need to know whether the game ended at this stategame_over = self.memory[idx][1]#add the state s to the inputinputs[i:i+1] = state_t# First we fill the target values with the predictions of the model.# They will not be affected by training (since the training loss for them is 0)targets[i] = model.predict(state_t)[0]""" If the game ended, the expected reward Q(s,a) should be the final reward r. Otherwise the target value is r + gamma * max Q(s’,a’) """# Here Q_sa is max_a'Q(s', a')Q_sa = np.max(model.predict(state_tp1)[0])#if the game ended, the reward is the final rewardifgame_over:# if game_over is Truetargets[i, action_t] = reward_telse:# r + gamma * max Q(s’,a’)targets[i, action_t] = reward_t + self.discount * Q_sareturninputs, targets