Subscribe to the InfoTech eNewsletter

infoTECH Feature

July 11, 2017

The Most Advanced Big Data Streaming Systems

Big Data sets are often created by the collection and parsing of streaming data, which is the name given to digital information that is generated and collected on a constant basis from a considerable number of sources. The data records streamed are often small enough to be measured in kilobytes, and they tend to conform to prearranged structures.

Examples of streaming data scenarios include telematics from sensors and devices installed in automobiles, telemetry from medical devices, flight instruments, call center workflows, and many other business enterprise, government and scientific purposes. In most cases, organizations start off data streaming with lightweight applications to handle one or two data sources; since Big Data is highly scalable, companies can add data sources gradually or at once. When the need arises for real-time data streaming to be run through complex algorithms, advanced systems or platforms will be required.

As of 2017, the following data streaming platforms are considered to be the most advanced:

Apache Hadoop

Even though this is one of the first Big Data frameworks, Hadoop enjoys great support from its open source community and thus it can be relied upon in terms of constant development and evolution. Hadoop offers batch processing functionality on a very large scale, which means that it can access massive data sets to extract information and display results at a later time without disrupting the collection process. The most significant advantage of this platform is the support it enjoys and the fact that it can be deployed on older hardware. As long as there is enough patience to wait for batch processing routines to complete, Hadoop will be a reasonable data streaming option.

Apache Spark

For data streaming and batch processing operations that require faster turnaround than what Hadoop has to offer, Apache Spark is a next-generation option designed to run on powerful hardware systems. With Spark, data extraction, complex queries and algorithm functions can be conducted in-memory, thereby reducing the burden on disk-related tasks. The way Spark processes data streams is by means of accepting very small batches that are sent off for task completion. One advantage of this platform is that quite a few libraries have been coded for use with complex queries and machine learning applications.


Similar to Hadoop and Spark, Kafka is an Apache Software Foundation project. Kafka started off as a LinkedIn project to improve the social network's performance. Eventually, Google (News - Alert) engineers took notice and jumped into development. Kafka runs on clusters and treats data streaming processes as producers that can be arranged into topics and queries known as consumers. Kafka on AWS is one adaptation of this data streaming platform, which performs adequately when hosted on cloud environments. One notable Kafka adopter is the online music streaming service Spotify (News - Alert).


As suggested by its name, Samza works with the Kafka platform to handle data streaming. With Samza, information is fed into topics intended to be read by consumers; the data is split into partitions, which are in turned managed by declarations made by producers. Samza offers the advantage of being resilient in the sense that it can hold data for longer periods without risking data loss. This system is adequate when low latency is an issue.

Apache Storm

As of 2017, Storm is the best choice for projects that require real-time processing of extremely large data sets. Data streams are handled on a topology that takes unbound information to be processed into a structured stream. This system is used in operations that demand immediate feedback to users. Another advantage of Storm is that it supports many coding languages whereas Kafka is limited to Java Virtual Machine environments.

In the end, the immediate future of data streaming will depend on the development of the platforms listed herein, particularly of Apache Storm since the enterprise world will likely focus on competitiveness across the Internet of Things, telemetry, telematics, and machine learning.

Edited by Alicia Young

Subscribe to InfoTECH Spotlight eNews

InfoTECH Spotlight eNews delivers the latest news impacting technology in the IT industry each week. Sign up to receive FREE breaking news today!
FREE eNewsletter