When Matthew Fontaine Maury was restricted to a desk job because of a leg injury, little did he know that he was showcasing an impressive example of crowdsourced, open source and big data time series analysis (as the world knows it today).
Soon after his injury he devoted his time to studying navigation, meteorology, winds and currents. He studied thousands of ships’ logs and charts and analyzed more than a billion data points to publish wind and weather charts. He made these charts widely available, which showed sailors across the globe how to use ocean currents and wind to reduce the length of voyages.
Fast-forward about 150 years and we realize how important time series data is to today’s push toward data-driven decision making. Today, millions of data points are collected by sensors, smart meters, RFIDs and other sensors — every second of our lives.
The world is becoming more and more connected, intelligent and device driven, resulting in the Three Vs of big data: Volume, Velocity and Variety. With storage costs coming down significantly in the past decade and the emergence of distributed computing, companies want to capture and leverage this data for conducting informed analysis.
In modern supply chains, the entire journey from source to destination is tightly monitored through sensors, which continuously transmit data. Telematics data transmit vehicular information like engine performance, fuel consumption, braking performance, etc.; GPS data continuously relays location; RFID tags are used to track shipments at intermediate hubs and customer data collected at different interfaces.
All this data is immensely valuable, and offers enormous opportunities of optimization and cost savings. For example, UPS proudly claims that their trucks never take a left turn. Similar examples are seen in manufacturing, power plants, energy and utilities and the financial services industry.
Why not RDBMS?
Honestly, RDBMS in its natural format was not designed for storing and analyzing time series data. For example, a company like UPS has 16 million daily shipments. Even if data is relayed every hour, that is 16 million data points added every hour.
That is more than 300 million data points added every day. Now, if the data was relayed four times an hour, we are already over a billion data points in a day. A regular relational database would store data in separate rows, which means the table keeps growing vertically. Moreover, there is a lot of duplication of data, like static shipment data such as customer information, source and origin data, etc. repeated in every row. Ideally you can prevent all that by normalizing into multiple tables and creating indexes.
A good DBA would tell you how to do away with indexes and how de-normalizing is the way to go. However, even if we create tables to efficiently store such high volumes of data, querying the data is a different story. With more than a billion rows being added every day, matching today’s customer SLAs is indeed challenging.
Reading, not storage, is the issue
In a time series database, usual queries are a comparison of data points across different time intervals. “Is the number of shipments served today more or less than a year ago? Is the average time spent on the road more or less than a year ago? Has the mileage of my truck dropped in the last six months?”
Producing a lot of data is easy and producing a lot of derived data is even easier. Solution? Compress all the data. But how do you answer the queries then? Scan through the data. Will this be fast? Definitely NOT! Waiting for queries is never productive. We need compression, but not at the cost of speed. To reduce the querying time, the goal should be to minimize IO time by reducing the number of records read each time to answer a query.
The solutions are a specialized time series databases based on open-source technologies and a smart data model to overcome said deficiencies. We at Sigmoid suggest Parquet as the file format. Apache Parquet is a columnar storage format available to any project within the Hadoop ecosystem, regardless of the choice of data processing framework, data modeling or language.
Columnar storage has several advantages. Firstly, organizing data by columns allows for better compression, as the data is homogenous. Secondly, IO is considerably reduced because we can effectively scan only a subset of the columns. Thirdly, as data of the same type are stored in each column, it allows for effective encoding techniques. Lastly, Spark is known to work better with Parquet.
Advantages of the suggested solution
The columnar design of the suggested solution offers greater speed and efficiency. It is optimized for time series databases and vastly improves on performance of queries, aggregation and analysis. At the same time, it is not restricted to a particular data type or industry.
The suggested solution uses HDFS and distributed computing at its core. As the data volume increases, more machines can be added to handle the increased load. Across industries like financial institutions, utilities, telecommunications and oil and gas, more and more companies are adopting columnar storage for time series optimized databases.
Time series databases typically have two challenges; the rate at which new data is added is very high and the historical data on which queries are made keeps building. So any solution should be able to combine fast ingestion and fast response times. Combining Spark for analysis and Parquet for storage helps on both those fronts.
Using open-source technologies and cloud infrastructure helps in drastically reducing the total cost of operations, especially when we compare much more expensive alternatives.
Now that technology is capable of storing and analyzing huge volumes of data, what does that enable us to do? There are a variety of use cases, including BI, exploratory analytics and classification and detection of anomalies. Open-source technologies and distributed computing have opened doors to new possibilities and new frontiers in BI on time series databases.