The photoblog at DrQue.net has over 10,000 images, and search engines are told not to index them. The scripts that run the gallery use random numbers to order the images and a search engine would get a different set of pages each time it tried to spider them. Most search engines respect this. Others ignore the robots.txt and try and index anyway. I setup Apache such that it won’t serve an image to a bot by checking the user agent. That took care of most of the others. However, I’ve been seeing some bots from China that identify themselves as a normal web browser and they have been wasting bandwidth downloading every image in every size from my site. Seeing this drives me nuts. Who knows why said bot is doing this. I’m sure I’m not the target of whatever thing they are searching for. Since I’ve done all I can to try and block bots that identify as bots, I have no choice this time but to block the IP addresses. Whomever is doing the downloading is big as I saw a lot of different IP addresses doing downloads. I stopped every one of them by simply fire walling three entire subnets:
que@Sun-Dragon:~$ sudo iptables -I INPUT -s 18.104.22.168/255.255.255.0 -j DROP que@Sun-Dragon:~$ sudo iptables -I INPUT -s 22.214.171.124/255.255.255.0 -j DROP que@Sun-Dragon:~$ sudo iptables -I INPUT -s 126.96.36.199/255.255.255.0 -j DROP
That took care of it.