We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message
Contact Forum Editor

Send an email to our Forum Editor:


PLEASE NOTE: Your name is used only to let the Forum Editor know who sent the message. Both your name and email address will not be used for any other purpose.

Tech Helproom


It's free to register, to post a question or to start / join a discussion


 

Exceeding bandwidth can robot.txt file help


spot the braincell

Likes # 0

For the last 2 months I have been exceeding my bandwidth and having to fork out for more. Nativespace who host it have suggested that it's either
1) naturally attracting more visitors or
2) it's being targeted by robots some of whom may not be welcome.

Question is is there any way of identifying which of the 2 scenarios or any others which may be the cause?

How can I restrict the robots to 'friendly' ones?

I've listed below (for what it's worth) the robot, the number of hits and the bandwidth for the current month

Unknown robot (identified by 'bot*') 2114+222 11.26 GB
Googlebot 457+42 14.56 GB
Voyager 435 14.83 GB
Unknown robot (identified by empty user agent string) 216+11 1.79 GB
Unknown robot (identified by hit on 'robots.txt') 0+127 33.36 KB 21
Unknown robot (identified by 'spider') 19+49 169.00 KB
Unknown robot (identified by 'robot') 24+44 443.42 MB
Yahoo Slurp 0+65 17.08 KB
Unknown robot (identified by 'crawl') 28+37 388.06 MB
MSNBot 31+27 401.33 KB 21
Ask 4+7 25.86 KB 18
MSNBot-media 4+2 38.17 KB
Netcraft 2 6.07 KB

cheers in advance

Like this post
Ansolan

Likes # 0

Hi

You could construct a robots.txt file to allow the robots you want and ban others, trouble is a lot of bad robots, scrapers etc. don't obey the directives.

Useful to go right through server logs and make sure robots are the issue, not someone hotlinking to images for example. If you then want to try and establish the identity of a caller, try a reverse DNS look up somewhere like click here

Depending on your server and the access you have, you may be able to block by IP, quite straightforward for example on an apache server using htaccess click here Equally possible to block by domain.

There are more complex solutions but you need to be careful not to block any people, or robots you don't want to be restricted. As often as not, blocking by IP is effective in this sort of situation. You may need to keep adding a few for a while but should get there in the end.

Like this post
spot the braincell

Likes # 0

Hi Ansolan,

thanks for the tips - I did a sample reverse DNS look up on a few IPs and got the following

77.88.42.27 resolves to
"spider50.yandex.ru"
Top Level Domain: "yandex.ru"
Country IP Address: RUSSIAN FEDERATION

119.63.198.13 resolves to
"baiduspider-119-63-198-13.crawl.baidu.jp"
Top Level Domain: "baidu.jp"
Country IP Address: JAPAN

169.237.7.228 resolves to
"alizee.cs.ucdavis.edu"
Top Level Domain: "ucdavis.edu"

My problem is knowing what constitutes a good or bad robot. From your previous response I could presumably block these IPs with htaccess but how do I know if I want to?

cheers

Like this post
Ansolan

Likes # 0

Baidu Japan are an arm of China's largest search engine, they should obey robots.txt if you don't want them indexing your site. Yandex are an established Russian/Ukranian search engine, in theory they will also obey robots.txt but deserve closer watching. In both cases if you needed to block by IP, bear in mind there will be multiple IPs.

Why the University of California is causing issues don't know. They are likely to have bots out and about for research projects, or could have other forms of link to your site. Equally, not unknown for large .edu IPs to be involved in abuse, scraping etc.

Much will depend on your site/business, a case by case decision. Of those you mention, Yandex can often be high bandwidth users and if not relevant would be best blocked.

Like this post
spot the braincell

Likes # 0

Many thanks Ansolan, if you don't mind may I present others that are visiting as I have no idea how to identify whether they are friendly or not.

Like this post
spot the braincell

Likes # 0

I've had a nosey around and found a sample .htaccess file which spparently blocks all bad bots as follows:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

will see if this does the trick

Like this post

Reply to this topic

This thread has been locked.



IDG UK Sites

45 Best Android games: top Android games for your smartphone or tablet in 2014 (24 are free!)

IDG UK Sites

How Apple, Adobe, Microsoft and others have let us down over UltraHD and hiDPI screens

IDG UK Sites

Do you have the X-Factor too? Mix Off app puts fans in the frame

IDG UK Sites

iPad Pro release date, rumours and leaked images - 12.9 screen 'coming in 2015'