Monday 5 Jun, 2017 Juan Cheng
News analytics in finance
1.Access to News / News management
– Visualization tools
– Filtering tools
– On demand view
Feed from multiple sources:
– Social Media
– Web based content
– Private sources
– Internal data
– News Tag Cloud
– Filtering newsfeed with Social media blotter, news blotter
– Search Engine on demand
2.News Content Alerts based on sentiment indicator
Provide accurate information from Big Data environment and pushed it front of Users in real time for Risk management.
– Topics detection
– Rumours alerts
– News qualification per importance
– Consolidated Dashboard
– Portfolio Alerts
– Relevant information from single screen
– Automatic Alert
– Integrated to OMS
Users receive news signals for trading/hedging / risk management based sentiment indicator.
Provide relevant news analytics indicator for hedging or trade idea generation.
5.Algo Trading / Robo Trading
Real Time algorithmic trading Sentiment indicator and News Analytics.
Fully integrated news analytics signals integrated to algo trading strategies.
A news analytics case
Information extraction of text
Let’s check a news. What’s the information you are going to highlight in the piece of news?
This is what I see in the news:
The news is published by reuters on Oct 21. It states acquirezation. The two entities mentioned are AT&T Inc and Time Warner Inc. The tone of the reporter for the events is “boldest move”. I think maybe you can see more in the news.
So this is the term-based way to view a news as I mentioned. These will extremely useful when one try to monitor a set of events or companies and so one.
Text Feature Extraction for Machine Learning Classification
Another way is to look at the news as a whole.
For computer to understand a news, we have to be represented it by numbers.
VSM is a popular way, which uses a vector to represent a document. The basic the idea is to use the vector of term frequency to represent a document.
This is minimal example:
At the beginning, we have a vocabulary that contains the universe of the words. We can just count the term frequency for document set and put them in a vector. So one document is a term-frequency vector for machines.
To solve the issue that the simple TF emphasize too much on the a term which is almost present in the entire corpus, It is weighted by the document frequency.
So that one document is able to represent by TD-IDF vector.
Such vectors then are ready to the machine learning applications:
Let me summarize the processes.
The news come from all kinds of sources like bloomberg, blogs and social medias. We will apply feature extraction to each article on real-time and apply the machine learning models that are trained using 15 years history of articles and end up with topic labels for the article.
Another thing we can do is to calculate the sentiment on each article in real-time. And the sentiment are either aggregated on time-base or indexed upon instruments, sectors or emotions. We also can scan the information mentioned in the news and highlighted by its company, people, events, regions.
After these calculations, we need to render or even send the alerts to users on real time.
Architecture Requirements and Big Data Tools Applied
So here are 5 things that we’d like to consider for implementation on our applications:
- Guaranteed data processing is saying that we cannot afford to lose any information.
- We want our analytics engine grows with less efforts as the data grow bigger and bigger
- The servers have to be fault-tolerance.
- Higher level abstraction are prefered, which the workload on programming is minimum
- The model received from batch training can be loaded into the real-time layer in order to achieve realtime classification and predictive analytics.
This is the final solution we have:
We combine the apache spark and apache storm together to form our news analytics engine.
The function of spark is to produce the model on massive historical text Data. And these models will be loaded into storm to produce the real time analytics.
First, when to use distributed data processing tools? I would say the time that your data and computation can not fill in one single machine.
In our case, yes, our data are too much and the computation we are trying to do is beyond the capacity of one machine.
Hadoop is the previous generation to solve the issue.
Since our purpose it’s to build a real-time news analytics system, so that we definitely need a realtime computation system. At the end, we choose storm.
One of reason is that it’s really fast. In its official website, it claims that it produce one million 100 byte messages per second per node in a quite common machine. It may be not comparable to the Hedge Fund speed. But it’s quite good to our news analytics application.
Second reason we choose it, because it’s really a robust system. Remember our 5 requirements. It satisfied our needs.
Architecture that combines all
Let’s have a look at the architecture of the whole applications:
The producers are generating the news to the message queue kafka. And such messages passed in storm clusters. And many analytics are running on real-time. For example for the topic classification, a machine learning model will be loaded from HDFS to storm and the results come out immediately if any news arrives. Then the results will be published to web app or updated to the database and search engine. That’s the real-time layer.
In the batch layer, we train and test the models using apache spark using our massive historical data that stays either in database or the distributed file system. And produce and dump the models in DFS.
The architecture is not only fit for our news analytics application. It’s also fit for others. For example, scale analysis pipeline, live stats, recommendations, predictions, real-time analytics, online machine learning. It all depends on the message feeded to the producers and algos in the storm.
Our data are available in 3rd party data vendors. And if you are interested in our services please contact firstname.lastname@example.org