In a recent blog post, I began this theme with a brief discussion of how text mining can help find white spaces for discovery and business intelligence. This post is about how with help of text mining techniques you can transform masses of “unstructured” data (texts, documents, etc.) into a very “structured” set of information elements relevant to your business needs. For instance, you would be able to answer questions such as:
- Which research topics appear the most strategic, trend-wise, for my most important competitors? Can I see the topics in great detail?
- In which technologies will my competitors invest in the near future? Can I see that level of detail down to individual raw materials, chemical substances, organic materials, and application areas?
- Where are my company's strengths and weaknesses compared to my strategic research arenas? How strong are my strengths relative to the market? And weaknesses? Can I see that per single research topic? Trend-wise over an extended time period?
- Who are the small niche players that could be of relevance to me, because they are one of the very few who deal with a particular topic?
This level of detail really is possible and it doesn't need to take a long time nor a search specialist to get real results.
Through www.freepatentonline.com we have access to all relevant patents globally via our text mining tool. As you know patents contain not only bibliographic data like company names, inventor names, publication date etc., but also a lot of full text. It is possible to dissect the texts and get very detailed structured information from them.
First, we define an area of interest or domain, say 40,000 patents. It is not so difficult to identify the most relevant players, for instance, of the last ten years of a certain domain. You can do this for universities, small companies, large companies to make our exercise easier and more targeted. Then, we create a large representative text out of a big subset of the documents, so that we think that we have the “language corpus” of our domain chosen. Then we do a linguistic feat of acrobatics and automatically extract each distinctive term out of this large text, so that we have a lexicon of terms to our disposal – this takes a couple of seconds. All in all, to this point, it can take less than an hour.
Now we're getting serious. We take the resulting list of terms and structure it in subsections of professional terms that you think are relevant in your business, and then compare them as search terms to the patent database, for instance focusing on the patents of our competitors. We submit lists of terms (not term by term) to the database, the system processes tens of thousands of queries in one go and in no time with simply the touch of a button. Then the system returns the matches and misses by competitor, all quantified, back to you as a matrix for further analysis. You can also have the system distribute the results over a timeline, so that you can see the usage trends per single term, per player of your domain. You can have this kind of result in less than half a day of work.
Amazing, isn't it?