AIspace

Experiment 1: SARSA, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 1152181
Experiment 2: Q-learning, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 641052
Experiment 3: SARSA, α=0.1, γ=0.5, Greedy Exploit = 100%, and initial Q-value = 20: 2664318
Experiment 4: Q-learning, α=0.1, γ=0.5, Greedy Exploit = 100%, and initial Q-value = 20: 2664659

Compare the results of the first and second experiments to the third and fourth experiments.

The difference between the first and second experiment should be much larger than the difference between the third and fourth. This is because when the Greedy Exploit parameter is set to 100%, SARSA should be exactly equivalent to the Q-learning strategy. However, in the first and second experiment the agent explores 20% of the time which causes a difference in the policies generated between SARSA and Q-learning.