Healthcare has lately been dotted with data-driven innovation. Companies with huge data warehouses like Google, Facebook, Amazon have made use of next generation technologies to deliver connectivity, ease of access, and user experience. Then, why not healthcare? The future of Big Data is promising- the analytics market in healthcare is projected to grow from $7 billion in 2016 to $24 billion in the next five years and tapping Big Data could offer several more possibilities.
Bringing Big Data to healthcare
In 2011, the entire healthcare industry managed to generate 150 billion gigabytes, or 150 Exabytes, of data- comprising of patient information, disease registries, regulatory data, etc. Managing these tremendous amounts of data with conventional hardware and software became challenging over time and healthcare began leveraging Big Data:
• Helps to track and manage population health more efficiently
• Will significantly improve, transform and reduce the cost patient care
• Enhance the ability to deliver preventive care
To deal with big data, healthcare needs technology like Hadoop, Spark, and Scala. But, very few health systems have invested in IT systems/solutions to optimize data processing, without which health care will struggle to exploit the opportunities and benefits that big data offers.
Currently, the existing standards used for integrating data are relatively slow and need an upgradation. The reason for this varies- from siloed data, to getting patient consent and furthermore, vendors charging a hefty fee for data solutions. Healthcare is far from its aim of achieving real-time data integration.
Aiding Big Data
What’s needed is to explore the better options that are out there in healthcare.
Technologies like Hadoop and Spark along with Scala will give health data the much-needed boost it needs in terms of storage, the speed of transfer and scalability. Currently, the organizations using these technologies include Yahoo, Facebook, eBay, Amazon, Federal Reserve Bank, etc. It makes no sense that the healthcare industry should lag behind sectors like retail and banking in the use of big data and work on reducing costs, improving the quality of care and delivering superior outcomes.
Addressing the elephant- Hadoop
This Linux-based, open source framework of tools was made to support the running of applications on Big Data. It addresses the challenges with big data stepwise- velocity, volume, and variety. This means that data flows in with speed, in large amounts and is of varied formats. A batch processing set of tool, Hadoop breaks higher tasks into smaller pieces and sends the final result back to the application. Hadoop assists researchers and data scientists in using data to engage clinicians in providing high-quality care with the following core softwares that work with Hadoop:
• Hive- an SQL-like query language suited for Hadoop
• Pig- a high-level query, used for MapReduce
• HBase- a column-like data store that runs on the Hadoop framework
• Spark- a general-purpose analytical and computing framework
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop breaks the data and the computation associated with it into smaller pieces to enable dealing with big data. Hadoop has two useful components- MapReduce, that allows the transfer of programs to different machines where data is stored and Hadoop Distributed File System or HDFS, that controls the storage and access of data.
Hadoop deals efficiently with disparate and skewed data. It’s well prepared for unstructured data. To scale it up, all that’s needed is to increase the number of clusters that too without needing different sets of codes for each. All the data is well distributed amongst clusters- thus making the creation of space in healthcare less challenging.
Spark in healthcare
Another execution framework is Spark wherein data processing takes place simultaneously and in bulk. It is quick, easy to use and has sophisticated analytics, with APIs in Java, Scala, Python, R, and SQL. Since Spark does all the computation and process in memory by using RAM, it is 10 times faster than Hadoop.
Spark can run on a laptop, standalone systems, cloud or with Hadoop. It can access diverse data sources including HDFS, Apache Cassandra, Apache HBase, and S3. It also performs ad hoc data analysis interactively and is actively used by Internet powerhouses such as Netflix, Yahoo, and Tencent where Spark is deployed at a massive scale and runs with over a thousand of nodes.
Apart from Java, Spark also supports Python and Scala. Since Spark stores data in memory, the time-consuming work that MapReduce undergoes by reading programs in disk back and forth is absent in Spark. Another plus point is that the coding required is much less; just one line of code in Spark as compared to hundreds in Hadoop or other frameworks- delivering fast processing.
Scala- scaling up processing
A programming language written in Java, Scala is also the language that Spark is written in. Short for scalable language, it runs on the Java Virtual Machine.
Programming libraries for Spark can be done through Scala which is better designed to work with Spark than Python or Java. Scala requires highly reduced the number of lines of code.
However, Spark does not replace Hadoop. Although Spark can be used as a standalone, both Hadoop and Spark work great together. When MapReduce does all that recording onto the disk, it allows for the possibility of restarting after failure. MapReduce makes Spark faster, more scalable and reliable. Especially for systems which are already running on Java, it’s easier to switch to Scala and improve the flexibility and the scalability of the system.
The road ahead
About two decades ago, the cost to capture data and to store it was too high. Even as early as 5 years ago, the cost of creating and managing a relational database was around $100,000 per TB, not counting the cost of support and maintenance. The introduction of Hadoop has been a great leap in our ability to store and process large volumes of data. Physicians and providers have started to worry less about storage and processing and are working towards delivering superior, value-based outcomes. Maybe in the near future, we can see healthcare industry deliver personalized patient care at controlled costs.
Abhinav Shashank s CEO of Innovaccer, a Silicon -Valley based healthcare data analytics company.