Thursday, March 10, 2016
CHENG Juan InfoTrie, Singapore
Every day, millions of news are generated, and all these news are produced by humans in natural language. These news contain the information that can help on field of decision making, risk management and algorithmic trading. Consider, one piece of news may trigger a company stock price explosion. Or a hiding trend may be buried in a huge pile of news data. For now, these works are often done by professional analysts. But the tasks become harder and harder as soaring flows of news. That’s why we need the automated and quantitative news analysis.
Let’s first put aside how to do the automated analysis of news. Instead, think about what does a reader need to know from a piece of news. Normally, at the starting point, he or she may answer these two questions :
- What’s objective that the news is talking about? For example, Apple or Facebook?
- In general, is it bad or good?
The technology of Named entity recognition (NER) is for answering the first question. More specifically, quickly determining which item in the text maps to proper names, such as people or places. For us, we need to go further to determine which company is involved in the news. We decouple the task into two parts:
- Use the popular community package like nltk and Stanford NER to nar- row down the searching
- Search for the company name using our own company synonym database.
After NER process, the news will be documented under the identified company name for delivery or further analysis. Sometimes, one news mentioned several companies. In this scenario, relevance measure is conducted. The relevance measure considers the location of a term in the text. For example, intuitively, one news may be more relevant to a company when the name of the company occurs in the title.
To know whether a news is bad or good to a company, a common way is to search for the emotional states such as “angry,” “sad,” and “happy.” and count on the occurrence of these states. In our case, we first collect a library of these emotional states specialized in the financial community. Next, we count on all the words that both in the library and text. Then normalize the counting result for both positive and negative words to [0, 10], where score 0 means that all words are negative and score 10 means that all are positive. These scores can be treated as a quantitative measure of sentiment that can be used to compare between companies and time.
Finally, both NER and sentiment scoring process is completed on the dis- tributed computational clusters so that the analyzing result can be delivered and documented in real time.