When we consider the Internet of Things (IoT) the numbers quickly reach beyond the bounds of human comprehension. The number of devices is set to double from 15 billion in 2015 to 31 billion in 2020 and double again by 2024 according to a report by IHS Markit. Gartner predicts that by 2021, 1 million IoT devices will be coming online every hour.
But even bigger numbers emerge when we think about the “Internet of machine-generated data”. Each one of these billions of devices may be capable of sending messages as frequently as once a second. The messages may range from a few bytes to a few kilobytes in length. Multiply all these numbers together and the potential bandwidth hit becomes apparent.
However, pause for a moment and ask if it’s really necessary to send so much data surging through the network. What will be done with all that data? What is the business value?
The answers differ by industry, so let’s narrow our focus to one industry that has long experience of machine-generated data: manufacturing. The rapid miniaturization and price decrease for sensor technology, as well as a proliferation of new connectivity options, is enabling manufacturers to monitor and apply predictive analytics to ever more aspects of their processes.
The traditional approach to such analytics has been to route all this measurement data to a central facility with large scale computing services. Skilled data scientists here analyze the data, draw conclusions about problems and opportunities, and make recommendations to the various plants and processes. The approach is problematic for several reasons. For a large manufacturer with plants in many locations, the central facility may be distant, perhaps on the other side of the world, introducing significant delays in response time. Centrally based data scientists are less familiar with local conditions in the plants, and so less knowledgeable about the context of the data and the nuances of possible solutions. With additional sensors and more frequent measurements, the volumes of data being transferred may become challenging.
Furthermore, in many cases, ongoing measures of a variable are usually of less value than the ability to recognize a significant change in the data. For example, sending the mostly unchanging temperature of a furnace to headquarters every few minutes adds nothing to predictive analytics. A gradual or sharp change in temperature, combined with changes in other variables, is of much greater value for analytic purposes, but may also signal some problem that much be urgently addressed locally first.
Such considerations lead inevitably to an important question. Could we move the analytic function to the data, rather than the data to the analytic function? This approach has been widely seen in the big data environment, but applying it to the analytic process for machine-generated data requires an understanding of the different phases of the process.
Initial in-depth analysis and model development is best done by expert data scientists with access to large scale computing and extensive data sets. However, during day-to-day operations, running the models against live data is best done close to the data source. This could be on a server within the plant, if data from several machines is needed as input, or if the machines have no intelligence. In cases where a machine does have an embedded computer, the analysis is relatively simple, and requires data only from that machine, the analytics can be run there, right at the edge of the network. In either case, only the results are returned to the business users. Model maintenance and upgrade is the responsibility of the central group of data scientists.
Clearly, this approach requires a distributable analytic model coded in a language appropriate for the plant server or a machine’s on-board computer. Furthermore, a two-way distribution infrastructure is needed. Both these components exist within Statistica’s Native Distributed Analytics Architecture. Models can be exported from the central Statistica instance in Java, PMML, C, C++, and SQL. Statistica works with a variety of partners to deliver and leverage analytics at the edge.
This approach offers significant benefits in the manufacturing environment (as well as in other industries). On the technical side, it has a lighter network footprint; only data that absolutely needs to be transferred to the central location is actually carried. The business side benefits are more important. First, local plant resources can prepare data and make it available immediately to address urgent situations (as well as subsetting it prior to dispatch to the central analytic environment). Second, local expertise is available for ongoing analytic work, supporting the growth of citizen data scientists across the enterprise adding much needed resources to this important area. Third, with plant-level skills and skin in the analytic game, faster and locally appropriate decisions and actions are enabled, with improved return on investment.
With analytic resources and skills appropriately dispersed between central and distributed environments, manufacturers can achieve early and valuable benefits from the substantial volumes of additional data becoming available as modern IoT devices are promulgated in plants and warehouses across the enterprise.
About the author:
Dr. Barry Devlin is a founder of the data warehousing industry, defining its first architecture in 1985. A foremost authority on business intelligence (BI), big data and beyond, he is respected worldwide as a visionary and thought-leader in the evolving industry.