So, in implementing Q-Learning I found that often the situation would arise where the agent would move back and forth, back and forth, between the same adjacent squares. I thought about this for a bit, then decided this must be the case:
0 0 0 Initially on left
.25-> 0 0 Moves to right and evaluates space it was on
.25 0 0 Searchs for most valuable space near it
.25 <-.25 0 Finds the left space and moves
.50 ->.25 0 sees that the right space has value and moves, updating the leftmost as well
ad. infitum.
implement
After a bit of investigation I found my suspicions confirmed and proceeded to implement a few solutions. One being a varying epsilon during the total number of episodes, allowing for max exploration (completely random movements) in the beginning and then controlled exploitation (the Q-Learning algorithm itself chooses what to do, albeit with a small epsilon percent chance of random moves) later on. -- This strategy did not work very well.
I also implemented a type of memory, where during a single episode the agent would not get a reward for visiting the same space twice, and a modified version where the agent recieved a reward the first time it entered the food space from the left or right, but never recieved it again if it moved into the space from the left or right, and likewise on up and down. This was to try to cancel out oscillations. I also created a version where not just food, but all rewards were (including negative reinforcement) were taken away after the first visit. All these methods improved the performance of the algorithm drastically! It was great.