As an AI-driven hedge-fund, market data is critical for our proprietary trading system. We work with several data providers and we’re now working to make use of a real-time market data feed from Nasdaq. Nasdaq’s feed will provide us real-time options and equities data daily during market hours. The technology used by Nasdaq to deliver their real-time data feed is Apache Kafka. In this article we’ll explain how Kaiju approached the processing and delivery of this data to our AI systems.
Our core AI engine (known internally as theo engine) is natively written in C/C++ and is capable of processing high-volume data extremely efficiently. Our goal for this project is to deliver options and equities data to theo engine in real-time as well as in aggregated form. So our scope included:
- Real-time options data
- Aggregated options data (in 1-minute intervals)
- Real-time equities data
- Aggregated equities data (in 1-minute intervals)
As there are a huge number of stream processing and big data technologies available, we did several proofs-of-concept to choose the right combination of technologies for our project (we love to do POCs).
Our current data pipeline looks like this:
Our current tech stack looks like this:
Our core stream processing application, which interprets and transforms the data for our AI systems, was implemented in Apache Flink. We used Java for the implementation and chose version 1.11 as most deployment platforms recommend. It's a highly scalable and distributed system responsible for consuming, filtering, transforming and aggregating high-volume data in real-time.
The processed and aggregated data delivered by our Apache Flink stream processor is stored in Apache Cassandra and used by our theo engine.
Nasdaq publishes the pre-market data before the market opens. We use Redis to cache this pre-market data and it is used by our Apache Flink application to interpret and transform market data throughout the day.
Building a cost-effective cloud solution can be challenging when it comes to processing and storing high-volume data. Here’s what we decided to go with after careful maintenance and cost considerations:
As AWS is our main cloud provider, we decided to leverage services provided by AWS for Flink and Redis. We decided to go with an Apache Cassandra managed services provider called Aiven over the AWS-managed service, Amazon Keyspaces, which has some limitations with its configuration options.
Below are the initial benchmarks we observed that will be a helpful baseline as we iterate on performance improvements:
- Data consumption and processing: We are able to consume and process ~10 million records per minute.
- Data ingestion and storage: We store options and equities for ~1300 symbols in our universe (Nasdaq publishes data for ~8000 symbols). We are able to ingest and store ~600K records per second in our Cassandra cluster.
In our next iteration, we plan to:
- Replace our Cassandra data storage layer with something more economical that also is better suited for our read patterns.
- Optimise the performance of our pre-market data cache.
- Implement data quality monitoring and analytics.