Real Time Machine Learning Architecture & Sentiment Analysis

Monday 5 Jun, 2017 Juan Cheng

News analytics in finance

1.Access to News / News management

– Visualization tools

– Filtering tools

– On demand view

Feed from multiple sources:

– Social Media

– Web based content

– Private sources

– Internal data

– News Tag Cloud

– Filtering news feed with Social media blotter, news blotter

– Search Engine on demand

2.News Content Alerts  based on sentiment indicator

Provide accurate information from Big Data environment and pushed it front of Users in real time for Risk management.

– Topics detection

– Rumours alerts

– News qualification per importance

3.Dashboard

– Consolidated Dashboard

– Portfolio Alerts

– Relevant information from single screen

– Automatic Alert

– Integrated to OMS

4.Actionable indicators

Users receive news signals for trading/hedging/risk management based sentiment indicator.

Provide relevant news analytics indicator for hedging or trade idea generation.

5.Algo Trading / Robo Trading

Real Time algorithmic trading  Sentiment indicator and News Analytics.

Fully integrated news analytics signals integrated to algorithmic trading strategies.

A news analytics case

Information extraction of text

Let’s check a news. What’s the information you are going to highlight in the piece of news?

This is what I see in the news:

The news is published by Reuters on Oct 21. It states acquisition. The two entities mentioned are AT&T Inc and Time Warner Inc. The tone of the reporter for the events is “boldest move”. I think maybe you can see more in the news.

So this is the term-based way to view a news as I mentioned. These will extremely useful when one tries to monitor a set of events or companies and so one.

Text Feature Extraction for Machine Learning Classification

Another way is to look at the news as a whole.

For a computer to understand a news, we have to be represented it by numbers.

VSM is a popular way, which uses a vector to represent a document. The basic the idea is to use the vector of term frequency to represent a document.

This is a minimal example:

At the beginning, we have a vocabulary that contains the universe of the words. We can just count the term frequency for document set and put them in a vector. So one document is a term-frequency vector for machines.

To solve the issue that the simple TF emphasize too much on the a term which is almost present in the entire corpus, It is weighted by the document frequency.

So that one document is able to represent by TD-IDF vector.

Such vectors then are ready to the machine learning applications:

Let me summarize the processes.

The news comes from all kinds of sources like Bloomberg, blogs, and social medias. We will apply feature extraction to each article on real-time and apply the machine learning models that are trained using 15 years history of articles and end up with topic labels for the article.

Another thing we can do is to calculate the sentiment on each article in real-time. And the sentiment is either aggregated on time-based or indexed upon instruments, sectors or emotions. We also can scan the information mentioned in the news and highlighted by its company, people, events, regions.

After these calculations, we need to render or even send the alerts to users in real time.

Architecture Requirements and Big Data Tools Applied

So here are 5 things that we’d like to consider for implementation in our applications:

  1. Guaranteed data processing is saying that we cannot afford to lose any information.
  2. We want our analytics engine grows with fewer efforts as the data grow bigger and bigger
  3. The servers have to be fault-tolerance.
  4. Higher level abstraction is preferred, which the workload on programming is minimum
  5. The model received from batch training can be loaded into the real-time layer in order to achieve real-time classification and predictive analytics.

This is the final solution we have:

We combine the Apache spark and Apache storm together to form our news analytics engine.

The function of spark is to produce the model on massive historical text Data. And these models will be loaded into storm to produce the real time analytics.

First, when to use distributed data processing tools? I would say the time that your data and computation can not fill in one single machine.

In our case, yes, our data are too much and the computation we are trying to do is beyond the capacity of one machine.

Hadoop is the previous generation to solve the issue.

 

Since our purpose is to build a real-time news analytics system so that we definitely need a real-time computation system. At the end, we choose storm.

One of the reason is that it’s really fast. In its official website, it claims that it produces one million 100 byte messages per second per node in a quite common machine. It may be not comparable to the Hedge Fund speed. But it’s quite good to our news analytics application.

The second reason we choose it because it’s really a robust system. Remember our 5 requirements. It satisfied our needs.

Architecture that combines all

Let’s have a look at the architecture of the whole applications:

The producers are generating the news to the message queue Kafka. And such messages passed in storm clusters. And many analytics are running in real-time. For example for the topic classification, a machine learning model will be loaded from HDFS to storm and the results come out immediately if any news arrives. Then the results will be published to the web app or updated to the database and search engine. That’s the real-time layer.

In the batch layer, we train and test the models using apache spark using our massive historical data that stays either in database or the distributed file system. And produce and dump the models in DFS.

The architecture is not only fit for our news analytics application. It’s also fit for others. For example, scale analysis pipeline,  live stats, recommendations, predictions, real-time analytics, online machine learning. It all depends on the message feeded to the producers and algos in the storm.

Our data are available in 3rd party data vendors. And if you are interested in our services please contact contact@infotrie.com

Leave a Reply

Your email address will not be published. Required fields are marked *