Data Preparation, to the Moon and Beyond
According to a University of Southern California study, less than a decade ago, overall digital information located on storage devices, for the entire world, reached 300 Exabyte. To figure out what it’s like, just imagine that it would require over 400 billion CD-ROMs. If you were to build a stack with it, you would go over the distance from the earth to the moon…
In the last decade we have seen the convergence of many elements that have propelled us into the Digital Data Intelligence Era. Alongside the ongoing growth of data repositories, the cost of storage technologies keeps dropping and we have experienced the rise of data analysis software capabilities and the growing consciousness of businesses toward data-driven decisions and capitalizing on insights from their own data.
Modern Business Intelligence (BI) and Big Data Analytics tools along with IoT, are now closely tied to a wide spectrum of industries such as Marketing and Financial Technologies and Risk Management, Healthcare, Automotive, Retail and even to Biotechnologies or Environmental Researches.
The Data Intelligence applications are numerous and will most likely multiply in the future. Here are a few key areas to consider:
- Customers acquisition and customer retention
- Fraud detection and revenue protection
- Predictive Maintenance and quality of services analysis
- Industrial Techniques Optimization
- Financial Services
- Marketing Studies
- Social Media Sentiment Analysis
All of these use cases serve the same goal which is deriving insights from data. These initiatives now involve a broad spectrum of departments within each organization including Application Developers, Data Analysts, Data Scientists but also Business Analysts and even Business Users.
These teams have a wide range of data sources some come from historical Data Warehouses and Data Marts and some from new approaches, such as Data Lakes. In the modern concept, Data Lakes are designed to store huge amount of data, regardless of their format and content. Most of the time, Data Lakes means Hadoop storage with no predefined schema where data is stored according to its origins. The Data Lake’s main purpose is to combine data sources and avoiding information silos. All the while reducing the costs of data ownership and increasing the sharing and collaborating capabilities between teams. IT groups have been given the responsibility of building, feeding, governing, referencing these sources (and Data Lakes), it appears that Business Users are often struggling with accessing and then analyzing the data.
“Analysts spend up to 80% of their time on data preparation, delaying the time to analysis and decision making. Compounding the problem is many analysts are using traditional tools like Microsoft Excel that cannot easily handle the volume or variety of new data sources.”
2016 Gartner Business Intelligence Analytics & Information Management Summit
In the modern world, the expectation is that Business Intelligence and Big Data Analytics have to be easily used by Business Users. The Data Preparation goal is to be a complementary solution that lets Users leverage the value of their data independently of their IT teams.
Good Data Preparation tools offer features such as discovery, inventory, sampling, data enrichment, data blending and filtering. It supports for different data models and structures and it helps detect relationships across attributes. It also offers data curation capabilities that allow the maintenance of data quality as an example.
Data Preparation tools are often used by business users because they offer a dedicated graphical environment and includes powerful capabilities.
“Data preparation is an iterative process for exploring and transforming raw data into forms suitable for data science experiments, data discovery, and analytics”
- 2016 Gartner
The purpose of Data Preparation, as a self-service capability made available to all Business Users, should be to reduce the time and the complexity in the discovery process that precede analytics processes in order to allow a more rapid move towards decision making.
In his post "Determining the Economic Value of Data " Bill Schmarzo states "The cost of acquiring each of the data sources would need to consider not only the cost to acquire the data, but also the cost to clean it up, align it, transform it and enrich it."
Three tips that could be valuable for you:
- Do not waste your time and money cleaning data manually on Excel files
- Use a visual tool to get your data ready for analysis from user-friendly interface
- Try not to reinvent the wheel each time you need to clean up data, but get sure to save your processes and methodology
Data Prep should not be rocket science when you don’t aim to get to the moon.
“As a former IT manager, I was always embarrassed to tell my business owners that their simple request to reformat or split fields in our ODS would take 3 weeks to develop, test and deploy to production.”
Mark Balkenende - Technical Product Marketing Mgr for Talend
Talend combines data preparation and data integration to transform how business and IT can turn data into insight. As part of the Talend Data Fabric, Data Preparation gives IT full control to manage data access and facilitate collaboration across the enterprise.
Data Preparation enables anyone, from Data Scientists to Business Users, to access and cleanse data using browser-based, point-and-click tools. A dataset inventory makes it easy to reuse data clean up recipes on updated data to automate your work.
Talend Data Preparation’s web-based interface and workflow is intuitive and provides users intelligent assistance as they import, structure and transform data. Even without an IT skillset, users can quickly get data in the form needed, while avoiding having to create complicated formulas, write code, or complete the same tasks over and over again. Key features include auto content discovery or identification, smart suggestions, cleansing and enrichment functions, as well as data visualizations to allow users to immediately begin their data analysis. The product also allows key “recipes” or formulas to be saved and reused, further speeding up future projects.
Talend Data Preparation #1 ranked at the
Quality Identification micro- and macro-scale by
Synaltic analyzed and presented (in french) the Data Preparation processes along 7 steps: Import, Discovery, Organization, Cleansing, Enrichment, Validation and Publication. They demonstrate the value provided by Data Prep tools that let Business User to build reusable script (we call this a “recipe” in Talend Data Preparation), how they facilitate to establish an effective bridge between IT projects and Data Governance.
This white paper also shows Data Preparation is the field where IT and Business teams put together their expertise to improve their overall organization Data Management strategy.
Ressources les plus consultées
Vous ne savez toujours pas où commencer ?