These fields are all optional and need only
be supplied if you would like a direct reply.
Subject
Your email address
Your real name
You must answer this!
If you don't, my spam filtering will
ensure that I never see your email.
What's 8 plus five (in digits only)?
Please make your changes here and then
Editing tips and layout rules.
File: InitialForaysIntoScrapingHackerNews Like many other technically minded programmers, I've spent some time on Paul Graham's ''' <a href="https://news.ycombinator.com/news">Hacker News[0]</a> web site. There are some gems there, although there is also some dross. I've decided that it's important for me to be more aggressive in my filtering. So I've done what I always do and indulged in some ''' <a href="http://en.wiktionary.org/wiki/yak_shaving">Yak</a> ''' <a href="http://sethgodin.typepad.com/seths_blog/2005/03/dont_shave_that.html">Shaving</a>[1][2]. I've been thinking about automatic and semi-automatic filtering, writing a system that automatically finds stories it thinks I'll find interesting, and then leavening that with some other material to try help prevent the ''' <a href="http://en.wikipedia.org/wiki/Echo_chamber_%28media%29">"echo chamber"</a> effect[3]. I've done a little reading around to see what might be tolerated. An ''' <a href="https://news.ycombinator.com/item?id=1721997">old comment by Paul Graham[4]</a> seems to indicate that a pull every 30 seconds or so would be OK, and that's supported by the ''' <a href="https://news.ycombinator.com/robots.txt">robots file[5].</a> So I've written a script that pulls an item with its discussion, extracts the hierarchy and saves it, then waits a minute or so and goes again. Actually, I started by looking at ''' <a href="https://www.hnsearch.com/api">HNSearch[6].</a> Another comment suggested that it was requested that rather than scraping HN directly we should use that API. Well, I've done that, and it seems not to have the first million or so items, and the searches I've done are just full of holes. It seems a reasonable first step, but the database I've pulled from it is just woeful, with only about 2% to 3% of items being present. I've randomly chosen items that should be covered by a search query, and they're just not there. So I'm intending to rethink that, to look for other sources, or better interrogating that source, and in the meantime I've started up the direct scraping. And been IP blocked. OK, I've backed off, reset my modem to change my IP (just this time - I don't do that very often), changed to pulling only every five minutes, and been blocked again. At a rate of one pull every 5 minutes I expect to get the first million or so entries by late 2017. So it's time to reconsider. Do I check for failed requests, back off, try later, and hope to get my IP unblocked? Do I expect to find a sustainable rate? When the *robots.txt* file says 30 seconds, PG's comment says 30 seconds, and yet I get IP blocked for querying less than every 5 minutes, it seems that there's more going on. And when the officially sanctioned source of material only gives 3% of the result, you start to think that there has to be a better way. I think I'd better think it out again. ---- [0] https://news.ycombinator.com/news [1] http://en.wiktionary.org/wiki/yak_shaving [2] http://sethgodin.typepad.com/seths_blog/2005/03/dont_shave_that.html [3] http://en.wikipedia.org/wiki/Echo_chamber_%28media%29 [4] https://news.ycombinator.com/item?id=1721997 [5] https://news.ycombinator.com/robots.txt [6] https://www.hnsearch.com/api