Introduction

I'm a freelance journalist, this is my personal blog, it will host a variety of posts. Its main purpose is to give me a place to write at length about a topic in which I am immersed -- a blog about searching the Internet, mostly by talking to Googlebot, for what you could loosely categorize as information of interest to the public that someone might not want the public to see.

I'm using the Blogger platform because it's free to use my custom Google domain on Blogger, and it also forces me to focus on writing and ignore aesthetics, as the latter is a lost cause. I hope you like the flowers.

"Google hacking" for research is not all about using "Google dorks" or finding marked, controlled documents. Sometimes it's about finding enough breadcrumbs -- you don't have to recreate the bread, just follow the remnants to the final destination of the slob who was sauntering down the path eating a crispy baguette (I figure break the metaphor down before it can break down itself).

Googlebot is Google's World Wide Web spider. Google's index of the WWW makes it not only the most powerful metasearch engine in the world, but combined with the structured information cached on Google's server makes Google's collective networked brain a likely candidate for Earth's most powerful Artificial Intelligence.

Google's filetype operator lets you tell Googlebot to bring back files of specified types, given by (usually) 3-character extension suffixes, like .pdf, .xls, .docx. Searching Google for filetype:kml tells the Bot to bring back Keystone Markup Language -- Google Earth - files. The format is widely used by US defense and intelligence agencies, and combined with the right key terms can find things like planned National Guard deployments for the upcoming 58th Presidential Inauguration.

As importantly, Googlebot is prodigious -- chances are fair that if information is connected to the World Wide Web at a bare minimum of online visibility, Googlebot will have seen it. Even if it doesn't know what you seek by name, it is only a matter of finding the right way to describe it according to a limited set of strict rules, while at the same applying the right amount of "fuzz," because Google doesn't make flawless html copies. Especially when crawling other formats (not just webpages with metadata more friendly to crawling), Googlebot is likely to transpose an O to a 0, skip spaces in the middle of words, etc.

This post is just an introduction -- next post we'll go right to looking at a topic by a keyword: AFRICOM. I'll show some custom search approaches and some potential product, also using a concentrated sample of the corpus, I'm not sure exactly what size, honestly, double-digit Gb, I've collected and indexed over the past several years. I've found a lot of things I think people deserved to know about, things the public should have been told about. Below are some links to published work from some of my more striking discoveries for you to peruse while you wait for post 2, because I am not only very good at talking to Googlebot, I am nice.

-- Kenneth

networked inference

Search This Blog

Introduction

Labels

Comments

Post a Comment