Right Approach, Right Solution to Big Data

As big data goes bigger, IT managers are challenged with the task of finding the right solutions and identifying the right data.

Every decade or every five years, the hype around one technological innovation or trend creates a buzz in the IT community. It was big cloud till the recent past. Not that people have stopped talking about it. But Big Data is the new hero. The purpose of big data is still multi-layered, multi-defined. Gartner’s 3Vs--Volume, Variety and Velocity--focused more on the overall management of data. Not all IT managers agree about 3Vs. It all depends on the organisational need and what kind of big data solutions they plan to implement. There is another V that IT managers like to add-- Value.

Big Data Bothers?

Everyone has been pondering over the need of big data, but there is a tire kicking.

The confusion among IT managers and CIOs is around the alignment of these 4Vs in the real need. It has been observed in several cases that the client may not have a need of big data solution as a traditional data warehouse solution is good enough to meet the requirement.

It’s easier to qualify a use case but convincing some IT managers about the adoption of a big data solution is a daunting task, especially if the organisation is coming from a pure play data warehousing background. To get to the right approach or solution, it is critical for them to understand the big data life cycle and take into account its inherent challenges, changes in approach to big data, taking cues from big players, understand potential issues with packaged solutions, besides the changes required in the IT thought leadership, and work out an effective implementation plan.

Big Data Life Cycle

The data life cycle in big data environments has four stages – acquire, organise, analyse and decide large amounts of data from both new data formats, as well as traditional formats, in real time.

Big data grows incredibly fast; plenty of statistical examples reveal the speed at which fast unstructured and semi-structured data grows. Each day, we create 2.5 exabytes of data. Most big data is fleeting by nature as the data mined from timely sources such as sensor data, social media and web logs, when used in real time, is outdated before one knows it.

So, in the big data life cycle, acquiring data from different sources and organising them, paving the way for intelligent analysis for better decision making is what defines the entire lifecycle of big data.

As one scales the big data environment, it is important to ensure that life cycle requirements can be supported within your current constraints of storage capacity, bandwidth, processor and memory speeds and metadata depth which covers all the 3Vs.

But there is an end of life which we often overlook. With the excitement around big data, it’s normal practice not to foresee the ephemeral nature of data that no longer is necessary or to determine what should happen when that day comes. It’s pertinent to review the policies as IT managers might need to redefine what part of data they should retain, delete and archive.

Challenges in Approach

A typical data warehousing project approaches a traditional waterfall methodology, where in requirements, design, implementation, verifications and maintenance follows a serial approach.

In most of the cases, despite all the upfront detailed planning, the only thing that comes out at the end of implementation is the solution, which is past reporting requirements while not addressing much needed valuable future key decisions.

Due to the tight deadline and serialisation, there is always a rush to complete all the individual states before the final goal could be met, resulting in rejecting key change requests and delivering something which was valuable in the past and may not count towards key decision making in future.

Big Data projects have been a new kid on the block; right from the concept commit to execute commit to the project execution, a different mindset and methodologies are needed. With the changing work scene of pay-as-you-go with shoe-string budgets for newer technologies, using waterfall methodology for a Big Data project is setting things up for sure failure. With the mix of availability of different distributions, it could be a daunting task to select the right vendor when there is a strong open source option available, that of Hadoop software Project.

Hadoop and the Big Players

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. It has been designed to scale across thousands of machines with fewer setups with reduced latency and high fault tolerance. Hadoop helps enable a computing solution that is scalable, cost effective, flexible, and fault tolerant. With the changing dynamics, agile project methodology is one of the keys to success for any big data project. Develop, evaluate, demonstrate rapidly, iteratively and collaboratively—so that business team members get the opportunity to feedback in real-time, ensuring that the technical team is delivering what the business team anticipated and envisioned.

Change in Approach

Look at Solution from Top Down rather than Bottom Up

Traditionally IT tackles data problems from the bottom up, as in after-the-fact data analysis. But with big data, the solution approach needs to be top down, where in the power of the solution is more on pro-active or predictive analysis.

Build Solution for “Unknown Unknown”

With the limitations on storage space and hard hit operations budget, IT pulls in only that data which is the need of the hour and aligns with the defined business requirement. But with big data, the approach needs to be more on pulling in as much as data possible to find “Unknown Unknown” for better predictability and analysis.

Build Now and Show Now

Historically, IT builds reporting solutions keeping “Build and They Will Come” in mind. With wide acceptance of business intelligence, the perception has changed; but IT needs to build in big data application with “Build Now and Show Now” attitude. In short, they need to partner well with the business owner and need to have a feedback loop at the every stage of application development; also, they need high performance solutions for supporting big data analytic workloads.

A New Job Role: “Data Science”

Big data is not all about the technology; it’s the emergence of a new stream of “Data Science.” It is a combination of mathematics, computer programming and computer science. As per experts, a Data Scientist is a technical expert, curious to know details, a great storyteller with a great creative eye to look at problems in different ways.

Newer Data Governance Policies

Typically, corporations are very sensitive to their data and don’t bother to categorise governance rule for different datasets. Some corporations have strict governance rules on data which doesn’t necessarily mean having a strict rule around governance on bug description which could be overkill. With big data, governance policies need to be relaxed and should be more favourable for data access. A favourable and relaxed policy doesn’t mean “Access to All.” Rather, it means it needs to be relaxed compared to history.

Potential Issues with Packaged Solutions

Traditionally, packaged software solutions have been the best for the problem statements which are slow changing. But when the problem is strategic and business need is exclusive, packaged solutions have tended not to work. Big data being a newer technology, packaged solutions may not work as anticipated for the companies whose business model is unique. Some of the above vendors have different approaches to BI Data architecture. Having worked with cloudera and Hortonworks, these two packaged distribution of Apache Hadoop-based solution have provided immediate value to customers from the technology stack. With vendor locking, recurring support cost and vendor uncertainty becomes an issue.

The Key to Success for any Big Data project involves these:

Data identification
Ingesting and cleaning
Hardware and platform selection
Machine learning
Data storage
Sharing and acting (the most important)

Desired Changes in IT Thought Leadership

As Big Data is really big, IT is undergoing a reversal of urgencies: it’s the program and processes that needs to move, not the data. It’s not all about huge infrastructure. Big data experts consistently report that 80 per cent of the effort involved in dealing with data is cleansing. Because of the high cost of data acquisition and cleansing, it’s worth considering what you actually need to source yourself.

Remember that big data is not a Nirvana. You can find patterns and clues in your data, but then what? Like any investment, a tangible goal for big data would always benefit.

Mercurial Superfly CR7 High