In a recent blog post, I began this theme with a brief discussion of how text mining can help find white spaces for discovery and business intelligence. This post is about how with help of text mining techniques you can transform masses of “unstructured” data (texts, documents, etc.) into a very “structured” set of information elements relevant to your business needs. For instance, you would be able to answer questions such as:
This level of detail really is possible and it doesn't need to take a long time nor a search specialist to get real results.
Through www.freepatentonline.com we have access to all relevant patents globally via our text mining tool. As you know patents contain not only bibliographic data like company names, inventor names, publication date etc., but also a lot of full text. It is possible to dissect the texts and get very detailed structured information from them.
First, we define an area of interest or domain, say 40,000 patents. It is not so difficult to identify the most relevant players, for instance, of the last ten years of a certain domain. You can do this for universities, small companies, large companies to make our exercise easier and more targeted. Then, we create a large representative text out of a big subset of the documents, so that we think that we have the “language corpus” of our domain chosen. Then we do a linguistic feat of acrobatics and automatically extract each distinctive term out of this large text, so that we have a lexicon of terms to our disposal – this takes a couple of seconds. All in all, to this point, it can take less than an hour.
Now we're getting serious. We take the resulting list of terms and structure it in subsections of professional terms that you think are relevant in your business, and then compare them as search terms to the patent database, for instance focusing on the patents of our competitors. We submit lists of terms (not term by term) to the database, the system processes tens of thousands of queries in one go and in no time with simply the touch of a button. Then the system returns the matches and misses by competitor, all quantified, back to you as a matrix for further analysis. You can also have the system distribute the results over a timeline, so that you can see the usage trends per single term, per player of your domain. You can have this kind of result in less than half a day of work.
Amazing, isn't it?