According to the 2011 Forrester report on “Big Data Patterns for Enterprises” where they have highlighted the enterprise use case scenarios into five different categories for Hadoop based Infrastructure projects.
- Exploration and Machine Learning – ex: Social Media networking.
- Operational Prediction – Ex: Predictive Analysis.
- Dirty Operational Data store – Ex: Operational Reporting.
- Bulk operations and extreme ETL – Ex: Bulk Data Extractions.
- Machine generated data – Ex: Streaming real time events via Complex Event Processing.
In this blog, i will highlight the 4th category where the low cost scalable, reliable aspects of Hadoop along with batch processing based on Map-Reduce techniques are put to good use, and how SAP Real Time Data Platform (SAP HANA and SAP Data Services) play an important role to solve the complete end to end big data use case scenario. “Big Data” adoption is rapidly catching up with Enterprise Customers, as new Infrastructure tools are available from various software vendors on how to solve the big data problem to handle both structured and unstructured data with in their enterprise organizations. More and more customers are working on various internal big data projects bringing in their Business and IT teams together, where they are leveraging their existing IT infrastructure tools and investing on new tools to handle big data, and provide more insights into the Information data to their executives in the company, so that they are able to make day to day business decisions very quickly.
Big Data technology is designed to extract value economically from very large volumes of wide varieties of data by enabling high velocity capture, discovery, and analysis as shown in Fig 1. As the amount of information continues to explode, organizations are faced with new challenges for storing, managing, accessing, and analyzing very large volumes of both structured and unstructured data in a timely and cost-effective manner.
Fig 1. Real time insights from a variety of data
In addition, the variety of data is changing enormously. According to Gartner, “Enterprise data will grow 650% over the next few years, with 80% of that data unstructured” – meaning that the data explosion spans traditional sources of structured information (such as point of sales, shipping records, etc.), as well as non-traditional sources (such as Web logs, social media, email, documents, etc.). Also, organizations today are facing new challenges in increasing the speed by which they process data and deliver information to users to ensure competitive advantage.
With SAP Real Time Data Platform, which includes: In-memory and real-time capabilities delivered by SAP HANA, deep integration with Sybase database products, EIM Solutions for data movement, data integration, data quality and data governance, for seamless modeling, universal orchestration, and a common metadata repository, enterprise customers today can leverage their existing IT infrastructure tools to integrate with Hadoop MapReduce and Hive Infrastructures to solve their big data use case business scenarios.
Fig-2. SAP Real Time Data Platform
When a customer starts exploring and working on big data projects, they have to consider the following costs at each layer of the integration steps.
- The costs to collect and store the data in a Hadoop infrastructure
- The costs to analyze and process the data via Map reduce techniques
- The costs to integrate with their existing enterprise infrastructure and consume the data
The biggest advantage that comes from Hadoop is the inexpensive scalable and robust storage where the files are stored across the file cluster. SAP Data Services delivers a Hadoop connector that provides high-performance reading from and loading into Hadoop. SAP Data Services identifies, extracts, structures and transforms the meaningful information from Hadoop/Hive and provisions the data to SAP HANA, Sybase IQ, or other data stores for deeper analysis. This allows for reliable, efficient and optimized real-time analysis across all enterprise information assets – structured or unstructured.
Fig3. Hadoop Enterprise Integration with SAP Data Services
SAP Data Services provides easy-to-use UI design paradigm for Hadoop/Hive components with a specific behind-the-scenes extensions, leverage the power, scale and unique functionality of Hadoop to extract, load and text data processing and Hive Data Warehouse. SAP Data Services generates queries using Hive Query Language to push down operations and Hive converts queries to Map Reduce jobs against files in Hadoop. It later uses multiple threads to process data from Hive and load data in SAP HANA using high-performance, bulk loading mechanism as shown in Fig3. The push down logic is accomplished:
- Through SQL like operations by utilizing Hive – add on and Pig scripts
- Basic functions including aggregate functions
- Transformations via text data processing
Enterprise customers can achieve the following benefits of gaining unprecedented insight for competitive advantage by analyzing unstructured data stored in Hadoop with structured data in EDW:
- Lower TCO for enterprise organizations that do not have in-house Hadoop developer expertise. With SAP Data Services, you can write a query that is pushed down into Hadoop and translated into MapReduce, so BI experts can discover information without having to program directly in the MapReduce programming language.
- Identify emerging trends, opportunities, or risks by processing text data stored in Hadoop. With SAP Data Services, text data processing can be pushed down into the Hadoop file system to extract relevant data hidden in mounds of files, Web logs, or surveys. This enables organizations to move only relevant data to a data warehouse for deeper analysis with structured data for new, contextual insights.
- Increase confidence in decision making by cleansing data from Hadoop before storing in data warehouse. With the data quality management functionality of SAP Data Services, organizations can clean, validate, and enrich data from Hadoop to ensure compliance with company standards. As a result, organizations gain confidence in the quality of data when making critical business decisions.
- Enable real-time analytics by rapidly uploading relevant data from Hadoop into high-performance data stores such as SAP HANA enabling organizations to glean new insights gleaned from mounds of unstructured data combined with billions of rows of transaction data to perform up-to-the-second sales forecasting, more efficient inventory management, etc.
In conclusion, Enterprise customers should evaluate their Technology & Business use-case scenarios to determine the most appropriate technology combination based on factors such as data, storage, cost, concurrency and latency considerations. Customers must also look into on how to Enable new business scenarios and applications from big data through actionable, real-time insights in its business process context, volume and velocity, combined with batch deep behavior and pattern recognition with reference to volume and variety (Fig-4.). SAP Real-time Data Platform provides a strong infrastructure foundation support with technology combination & deployment options for enterprise customers to handle their Big Data Analytics in real time in conjunction with Hadoop batch data processing.
VN:F [1.9.22_1171]Big Data for Enterprise Customers Leveraging Hadoop Distributed File System and SAP Real Time Data Platform,