How about having a FIFO queue for things happening, like text to be spoken or the player to walk? As long as there's something in the queue, player input is ignored. When processing these entries, they could send messages for entities to play animations (like speaking).