Really depends on what you want to achieve with it.
You can have some pixel "picking" with a shader, but you probably can't differentiate between different sprites themselves.
You can do everything in memory, but that's rather slow and you end up with a sort of software renderer, as you have to figure out what pixel is shown where.
Depending on the precision you could also have a sort of transparency mask, that may make it easier to differentiate the layers. It's probably still rather performance hungry and done on the CPU, but should probably be better than the second option.
But as I said, it really depends what you want to achieve and what trade-offs you can accept.