Indexing Large Volumes of Binary Content for Fast SearchThe Microsoft anti-malware lab receives approximately one terabyte of binary content for analysis each month, and the volume is increasing faster than the number of professional anti-malware analysts employed in the lab. The applications of a binary content index spans most areas in the anti-malware field: accurate malware naming, locating related samples, signature authoring and sample clustering. Whilst much has been written about index construction algorithms for textual information, we found that comparatively little was written about indexing binary data for queries involving large binary documents, where queries are for short binary sequences. We adopt an n-gram technique for indexing binary files, similar to the method used to index Asian languages (where there are no spaces to use for tokenisation). Our technique goes to some trouble to minimise the space and time requirements of index construction and optimize the search speed. The n-gram technique can be applied using any content index, and our presentation will contrast the standard approach with the performance we managed to get from our own code. We index executable files as well as memory dumps that we source from existing automation infrastructure. This approach permits useful indexation of raw data when automation fails and unpacked and decrypted data when it works, thus minimizing index pollution. Our presentation trawls through the algorithms used, lessons learned and scars earned when constructing insanely fast, fully queryable indexes for terabyte scale binary data. |