[PLUG] wget and politeness

Keith Lofstrom keithl at kl-ic.com
Mon Dec 27 15:16:58 UTC 2004


Keith Lofstrom wrote:
> When I manually web to the site, I get:
>
> You are black listed at the TWiki web site due to excessive access or
> suspicious activities. Please contact site administrator
> peter.thoeny at attglobalSTOPSPAM.net if you got on the list by mistake.
> Black listed IP addresses will be submitted to major blacklist databases.

On Mon, Dec 27, 2004 at 08:08:14AM -0800, Brent Rieck wrote:
> It's a not uncommon topic for the newsgroups at my web host, many people 
>  find that their website is getting hammered by people with wget (or a 
> windows equivalent) causing bandwidth overages.  And since most of these 
> websites are just labors of love they'll block wget and blacklist ip 
> addresses so their users don't cost them more money, the theory being 
> that humans don't actually look at most of the website they scrape down. 
>  I'd guess that you won't look at most of what's on twiki.org while 
> offline but that it's impossible to predict what you will look at, so 
> you want to download it all.  twiki.org could be in this bandwidth 
> situation.
> 
> Try emailing Peter to explain what you were trying to do, he might have 
> a solution for your offline needs.

Brent, I am taking the liberty of quoting your mail and my reply to
both you and plug because it is an insightful observation expressed
very well, and all of us can learn from it.

I contacted Peter, and while his response was positive but terse (he
will un-blacklist me, he will put up a note about scrapers on the index
page, I should just take the download), your hypothesis provides a great
explanation of why he might want to restrict wget scrapes of the whole
site.  It is indeed a big site (wikis get that way!) so a few dozen
folks doing what I did would indeed put it into bandwidth overage.  The
fact that he did have *some* page areas set to "no robots" proabably
imply that he might want to set "no robots" for much of the rest of
the site, too!

Next time, I should ask before I scrape.  I should also think about
scraping policies for my own wikis.  

Thanks for the clue;  perhaps if I accumulate enough of them, I will
not need LARTing so often.



Keith

-- 
Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs



More information about the PLUG mailing list