jump to navigation

Byte Searching August 30, 2006

Posted by admin in : Web Tech , trackback

I occassionally have good ideas, and the better the idea, the more likely some else already thunk it.

Please tell me if this is a good idea, and whether you know it to have been thunk.

I was trying to find an image on the web. I saw the image, but I wanted to find out whether that same image was already out there, or what its origin might be. I know I can search google images by key word, which I guess looks at file name, or perhaps image tag/link data or other contextual data to support a keyword search. But I had no keywords to go by, so I thought it would make sense to be able to submit the actual image to a search from, and find all instances of web page which hosted that exact same jpg. So it would essentially be searching by the serial data that make up the file rather than human supplied tag data. I know this could be problematic as the same image could exist in many different formats, but it still seems like it would be useful to find exact matches by bytes. This would make as much sense on a local machine as on the internet.

?

I guess the obvious flaw here is that for such a search engine to work, it would have to index all the bytes for all the hosts it intends to provide results for, rather than the comparatively cheap text only indexing they do now.

Comments»

1. Drew Bixcube - September 3, 2006

I wonder if the crawler in question could just sample the data in each file at a given site, learn just enough data from each file that it must belong to a unique object (or at least narrow the search to a small number of objects). Maybe sample the file in three random locations within the file, so that another file with the same strings on the same lines has a high probability of being the object you’re searching for.

Of course, I don’t know nothin’.

2. ectostan - September 4, 2006

I’ve many a time wanted to search for an image in a way that wasn’t limited to what the image’s name was.
I imagine sampling would work, but I also imagine that could be time consuming. Maybe jpgs, for instance, could incorporate a line of sampled information that conforms to some set standard. A ‘fingerprint’ code. That would be a background process inherent to the rendering of the image, so it would be painless. You’d just call up your jpg’s fingerprint to search for similar or identical fingerprints on the web. I can’t really think of a browser-side strategy that could do the job without being a bandwidth hog.