Article appears in SAP Startup Focus Newsletter – Issue 5
Nowadays, massive amounts of unstructured data are being captured in operational, CRM, maintenance and call center systems as well as social media, blogs, forums, e-mails, documents, etc. Companies are struggling to extract meaningful, structured information from the huge volume data, trying to find and analyze the content to help their business strategy running and solve critical problems.
In SP05, SAP HANA introduced Text Analysis. Text Analysis is a technology to structure, transform, or enrich unstructured data for the purpose of discovery or analysis. Text Analysis in SAP HANA can extract the meaningful information from texts, apply appropriate linguistic rules for the particular language and then semantically interpret the data. During text analysis, the following text pre-processing steps may be executed inside HANA without data transformation:
- File filtering: converting binary document formats to text/HTML
- Tokenizationdecompose word sequence, e.g. ‘the California sunshine’ -> ‘the’ ‘California’ ‘sunshine’
- Stemmingreduction of tokens to linguistic base form, e.g. cars -> car; ran -> run
- Linguistic analysispart-of-speech identification
For example, deriving useful data out of multitude of web pages/accounts. By writing a simple program to fetch the keywords from the websites and run a text analysis on the results would be a simple way to extract most popular keywords, focus areas like social media or big data, and other good to know details en masse.
Extracting header keywords, analyzing the text, calculation frequency of occurrence, and finally visualizing the result using SAP HANA creates a beautiful and simple Tag Cloud on a bevy of webpages.
Text Analysis in SAP HANA details can be found in the Developer Guide for HANA SPS05. Let us get started and enjoy the power of HANA!
Eric Du, HANA Development Expert for SAP Startup Focus.
VN:F [1.9.22_1171]HANA Curious - Text Analysis,