Adventures in web crawling

If you’ve seen BrandisBot (+https://tombrandis.uk) in your logs, don’t worry I’m not out to steal your blog or train AI. I’m trying to create a search engine.

I wrote a very flimsy scraper in python.

My scraper crawls the web, starting at my favourite bookmarked sites and saves how many times each word is on the page. This is saved into 2 jsons, one connecting the words to small hashes of the URL’s and another connecting the hashed to the actual URL’s and page titles.

These json files can then be ingested by another python script for searching them. The searcher returns pages which have the most occurrences of any the words in the search.

I chose python because it has a lot of useful libraries that can handle web requests, regex, json encoding and especially robots.txt parsing (I used the Protego library) which I would risk getting wrong if I did it myself, and I don’t want to be messing with people servers too much.

I’m doing this for fun - I love a little programming challenge - but maybe someday it might become a more useful thing or at least a way to discover more about this side of the web. In the future I want to improve it so that the information is saved in a database and add a web form like a real search engine. I might also do another scrape at some point because the internet dropped out for a bit and it missed some links. If I do crawl again I’ll make sure that I save which pages I’ve already done - right now its a variable that is only stored in memory.