Sure, and if they were illustrative of hands, you’d get good hands for output. But they’re random photos from random angles, possibly only showing a few fingers. Or maybe with hands clasped. Or worse, two people holding hands. If you throw all of those into the mix and call them all hands, a mix is what you’re going to get out.
You can sort of see where it’s coming from. Some parts look like a handshake, some parts look like two people standing side by side holding hands (both with and without fingers interlaced), some parts look like one person’s hands on their knee. It all depends on how you’re constructing the image, and what your input data and labeling is.
Stable Diffusion works by changing individual pixels until it looks reasonable enough, not looking at the macro scale of the whole image. Other methods, like whatever dalle2 uses, seem to work better.
Sure, and if they were illustrative of hands, you’d get good hands for output. But they’re random photos from random angles, possibly only showing a few fingers. Or maybe with hands clasped. Or worse, two people holding hands. If you throw all of those into the mix and call them all hands, a mix is what you’re going to get out.
Look at this picture: https://petapixel.com/assets/uploads/2023/03/SD1131497946_two_hands_clasped_together-copy.jpg
You can sort of see where it’s coming from. Some parts look like a handshake, some parts look like two people standing side by side holding hands (both with and without fingers interlaced), some parts look like one person’s hands on their knee. It all depends on how you’re constructing the image, and what your input data and labeling is.
Stable Diffusion works by changing individual pixels until it looks reasonable enough, not looking at the macro scale of the whole image. Other methods, like whatever dalle2 uses, seem to work better.