Data is flowing and the volume is growing. In recent years, it became fashionable to dump so many numbers about big data in every key note, email and blog post, that I’d need a big data solution just to count that. The bigger the numbers, the more intimidating they become, with the only goal to awe the reader and boast about the insurmountable complexity of the data. Maybe it’s time to stop mentioning how much data is out there and just start understanding it.
Companies that are able to tap into the potential of all this data can unlock the value and find ways to win and drive market disruption. But for some organizations, this surge can be overwhelming, and managing this rapidly growing heterogeneous and distributed data of all sizes, formats, and shapes is even more challenging.
Two weeks ago, we launched the DataOps management solution SAP® Data Hub to face these challenges and to simplify Big Data processing. It is a data sharing, pipelining, and orchestration solution that helps your company to accelerate and expand the flow of data across your diverse data landscape. SAP Data Hub provides unprecedented data access and governance, creates powerful, organization-spanning data flows, and quickly delivers results by leveraging existing data processing investments.
By using SAP Data Hub, customers can not only connect various data sources but also process and save their data within the flow. SAP Data hub uses top notch technologies like „serverless computing“ and „container management“ to provide a scalable solution.
The motivation behind this blog is to provide an overview of SAP® Data Hub and to promote our new YouTube Channel. Please check out SAP Data Hub Channel and watch our launch and demos videos as well as the how-to videos we created within our TechEd hands-on sessions. I have tried this blog to be simple and it aims to explain our new solution on a fundamental level. For deeper questions, please check out the FAQ blog of my colleague Marc Hartz, Product Manager of SAP Data Hub.
My blog will cover the following topics in brief:
- What are the main Big Data landscape management challenges?
- Why SAP Data Hub?
- Features of SAP Data Hub
- How SAP Data Hub is different from other SAP tools?
- Who will benefit by using SAP Data Hub?
- Basic Hands-on
What are the main Big Data landscape management challenges?
The major pain points of enterprise data landscapes that we see:
- Data is kept in silos (files, Hadoop, Data Warehouses, etc.) across the enterprise. Users can’t access and work with the data they need across the silos where it’s stored. In particular, it is complex, time consuming, and costly to connect Big Data with enterprise data and business processes to gain insight and value from it.
- End-to-end data governance required across complex landscapes: The need to manage and govern data across a landscape is well understood. Ensuring data lineage and impact analysis of changes, managing security and privacy requirements, etc. are all critical aspects of a trusted enterprise landscape. With the increased complexity of enterprise landscapes, which can now include Hadoop data lakes, EDWs, Cloud storage, enterprise apps, etc., the ability to appropriately provide effective governance is more difficult. Without end-to-end governance across all data sources, organizations cannot trust and rely on the data’s accuracy, creating risk for anyone using analytics or operational applications that use the data.
- Big Data technologies lack enterprise readiness: Businesses generally cannot solve the complexity of their landscape simply by storing all their data in a Hadoop data lake. Hadoop solutions, while powerful, often do not have the extent of governance and security measures that enterprises require. Data lakes often have limited governance for Big Data initiatives, little automation to schedule processing in the landscape, fragmented monitoring and tracing capabilities of individual technologies, and lack common security and access management.
- Currently available tools require high effort to productize data scenarios across the enterprise. Many integration tools today are point to point, require highly trained resources to execute, and are highly manual. This makes it challenging to rapidly connect and implement desired data outcomes.
- Specialized skill sets are often needed to implement, scale and create value out of Big Data initiatives. These specialized resources are often difficult to find and difficult to retain.
Why SAP Data Hub?
SAP Data Hub (Figure 1) offers an integration layer for data-driven processes across enterprise services to process and orchestrate data in the overall landscape for all user groups – IT, business analysts, data scientists, data engineering, and data stewards. It integrates and prepares data within a digital landscape to drive business decisions. It offers an open big data-centric architecture with open source integration, cloud deployments, and third-party interfaces. It leverages massive distributed processing and serverless computing capabilities provided by SAP Vora, which is bundled with SAP Data Hub.
SAP Data Hub establishes a new category of software solution, and provides a comprehensive answer to an emerging and painful challenge for enterprise customers: integrating data and establishing data-driven processes across an increasingly diverse data landscape. The solution addresses data integration, data orchestration, and data governance capabilities across a complex landscape, and harnesses big data processing to create uniquely powerful data pipelines that are based on the serverless computing paradigm. Data and processes can be managed, shared, and distributed across the enterprise with seamless, unified, and enterprise-ready monitoring and landscape management capabilities.
Figure 1 Architecture of SAP® Data Hub
Key-Features of SAP Data Hub
SAP Data Hub is built to deploy on-premise, on cloud, or in hybrid landscapes. Existing customers continue to use data integration tools – SAP Data Services, SLT, SAP HANA smart data integration for data integration, data virtualization, and data replication – but with SAP Data Hub, they will enjoy the additional benefit of the orchestration of the underlying integration tool and end-to-end monitoring.
SAP Data Hub includes a modern UI for hub management to manage an end-to-end data landscape that connects enterprise data with big data. It enables creation of a copy of a landscape segment for data scientists, which can be erased or destroyed to meet data governance requirements as well as provide isolation from productive use cases for resources optimization. With fine grain, end-to-end security provided by policy management functionality, it will manage access control of all data – enterprise and big data – consistently.
Data discovery capability includes profiling of data natively without leaving the source. Managing a repository of all forms of data and metadata, it offers data lineage and enables impact analysis. It offers rich data transformation, quality, and enrichment offered by SAP EIM tools natively in SAP Data Hub.
One of the powerful features of SAP Data Hub is data pipelining, which connects data in different formats (for example, AWS S3, HDFS) and makes them accessible to SAP Vora for analytics. All of these features are summarized in Figure 3. In addition to allowing the use of existing data ingestion tools and orchestrating them within SAP Data Hub, it offers an easy-to-use advanced data pipeline to move data when needed from various sources, including Amazon S3 to Hadoop HDFS, HDFS to SAP Vora, and SAP Vora to HDFS.
How SAP Data Hub is different from other SAP tools?
In the past, it was always about to manage different data warehouse for operation systems of records. But now, facing the big data landscape challenges, we saw the customer need to manage these extended data landscapes. By using SAP Data Hub, you can gain a clean and complete view of your data landscape and its interconnections, no matter where the data lives (on-premise, cloud, application, data warehouse, data lake, SAP or non-SAP source) and without having to centralize data. You can easily manage data accessibility and data policies to ensure appropriate data governance with security across your entire enterprise and understand the source and history of data flowing through the system.
Who will benefit from using SAP Data Hub?
Everyone will benefit using SAP Data Hub, who quickly wants to get data where it needs to go, no matter the origin or destination. It will help you to create complex, multi-step data pipelines and a comprehensive, open data landscape by working across a diversity of data sources and applications. SAP Data Hub will help ensuring dataflow visibility, integrity and security by creating and enforcing access policies and conducting lineage and impact analysis. Every organizations that is searching for solutions to control and manage Big Data Lakes effectively (Data Transformations, Governance, Operations, Harmonization, Stream Integration, Coding, Scripting, Consolidation) will benefit immensely from using SAP Data Hub.
The following paragraphs should help you better understand the SAP Data Hub features. For more detailed hands-on please take a look into our Administration Guide and Developer Guide.
Data Pipelines and Refinement Features of SAP Data Hub
You can readily create complex, multistep data pipelines to refine and augment or enrich data at the source. And having a full landscape view of data lineage and quality enables you to make better decisions.
The canvas tool for creating unified workflows or task workflows lets you orchestrate multiple tasks and execute them in a given order. You can do this for several tasks that are created and executed through SAP Data Hub, including data pipelines, file operations, and data flow graphs. Alternatively, you can engage with process chains from the SAP Business Warehouse (SAP BW) application, flow graphs from SAP HANA® software, jobs in SAP Data Services software, and activities in SAP Predictive Analytics software, as orchestrated by the solution. You can, also, graphically model tasks and task workflows with the modeling tool in SAP Data Hub (see Figure 2). Finally, by scheduling data pipelines in large cluster environments, you can choose to handle batch-driven jobs and streaming jobs in a single environment.
Figure 2: Workflows in SAP® Data Hub
Data Flow Graphs
Enhance and transform data from a source data set to a target data set with the aid of data flow graphs. The data sets are abstractions of actual data or structures that reside in an SAP Vora engine or Apache Hadoop system.
Modeling Lifecycle Management
SAP Data Hub enables you to save, extend, and manage data models. You can group the modeling objects into projects, as well as import and export the model content based on project levels. You can use model synchronization and versioning through integration with the GitHub repository and trace dependencies and impact to related models, flows, or pipelines.
SAP Data Hub Features for Orchestration and Monitoring
Rapidly execute powerful data flows using distributed local processing with scheduling and monitoring across the data landscape.
Quickly access tools that can be applied to end-to-end scenarios. You can embed related tools into the cockpit or create custom links to frequently used tools and pages – for example, to the SAP HANA cockpit database or the modeler in SAP Vora.
Learn more about your data by profiling, previewing, and viewing the metadata as you navigate a connected system. You can profile data sets, providing metrics such as minimum, maximum, pattern, and cardinality to verify data quality and help determine whether the metadata accurately describes the source data set. You can view the metadata and preview data in various formats, including comma-separated values (CSV), Apache Parquet, and optimized row columnar (ORC) format in Hadoop Distributed File System (HDFS), Amazon’s Simple Storage Service (S3), and SAP Vora connections.
Create data sets – generic, logical data references consisting of a name, type, URL, and destination name – with metadata that is an abstraction of the actual data or structures residing in distributed stores. Design data sets for files, directories, or folders in HDFS and S3, as well as for tables in the SAP Vora engine.
SAP Data Hub Features to Simplify Operations
Create a comprehensive, open data landscape by working across a diversity of data sources and applications with governance.
You can create and manage zones, systems, and connections in your landscape. Zones represent logical groupings of a distributed data landscape shared for a specific use case. Systems are stand-alone data sources in the distributed data landscape. And connections are technical access points to a system through an agent of SAP Data Hub. You get smooth data connectivity to various data lakes and storage, such as Amazon S3 and Apache Hadoop, as well as to solutions such as SAP BW, SAP HANA, and SAP Vora.
Centrally manage connectivity of distributed data, discover data, and schedule and monitor workflows across a connected data landscape using the visual, intuitive user interface. Figure 3 shows the cockpit in SAP Data Hub.
Figure 3: Cockpit in SAP® Data Hub
Security and Policies
Use this area for establishing security settings and policies for processes and modeling objects in SAP Data Hub and for identity control (users, groups, and roles), policy management, and security logging.
Monitoring and Scheduling
Schedule the execution of task workflows in the dedicated scheduling area of the cockpit. Through the monitoring dashboard, you can keep track of the status of task workflows that you have scheduled for execution within the cockpit or in the modeling perspective of SAP Data Hub (described below). You can also suspend or resume a task workflow if necessary.
All the features described in this blog will be explained in detail in our new YouTube channel. Please check it out and feel free to subscribe: SAP Data Hub Channel.
VN:F [1.9.22_1171]SAP Data Hub solution overview & YouTube channel announcement,