We Store More Data than the Measures We Have to Secure Them

There are organizations leveraging the insights from big data already and then there are others who’re unnecessarily storing data

Erik Dörnenburg (Head of Technology) & Dave Elliman ( Market Technical Principal) ThoughtWorks Europe cut through the clutter to help you decide what to store, why, and how to leverage insights from big data

‘Capture-It-All’ Big Data approach raises serious concerns of security and privacy!
Good security is hard to attain. At a basic level we still see best practices such as the ones outlined in the OWASP Top 10 ignored, despite the fact that the OWASP list has been published for a decade now. That said, even for organizations that attempt to implement good security practices, achieving a secure system is difficult. Criminal hackers are extremely inventive, the gains from successful security breaches can be so substantial that inside jobs must be defended against, too, and, last but not least, state-sponsored agencies have resources that most organizations simply cannot defend themselves against.

“Achieving a secure system is difficult with criminal hackers being extremely inventive and state-sponsored agencies have resources that most organizations simply can’t defend themselves against”

In the light of this security threat landscape, the amount of data that is collected today, poses a serious problem. We are storing ever more data while at the same time the chances of keeping that data secure are decreasing.

The hype around big data is making businesses store humongous personal data unnecessarily!
One approach to deal with this issue is being selective about which data to store. If the data isn't stored, there's no danger of it ending up in the wrong hands. We have included this principle on the technology radar report under the term Datensparsamkeit, which is taken from German privacy legislation and roughly translates to data economy or data austerity.

A good example of this approach is the information stored in an access log. Usually the visitors' IP addresses are logged for later analysis, mainly to see which region or organization the visitors are coming from. This information, however, is available from the first three octets of the IP address; there is no need to log the full address. Conversely, logging the full address means that the entries become potentially personally identifiable, this is undesirable from a privacy perspective.

In summary, we see the benefits of storing and analyzing large amounts of data, however we are advocating for thinking very clearly about the implication, and finding solutions to business problems that play well with privacy concerns.

What's the big data equation for businesses that are new to the trend and want to leverage it for business benefit?
Big Data is often softly defined as 'the 3 Vs', volume, velocity and variability. Often more 'Vs' are added, the most common is 'value', and this, in fact, is the only thing that is important. 'Big Data' is not about a technology, or even a set of technologies, or even an analytical model. It is quite simply a business being able to understand enough about the data that it has or needs and what possible questions they can ask of it. This focuses any data analysis firmly on following the value. It is always best to start small with a short test and learn cycle building valuable insight incrementally, and then over time adopting the larger technical frameworks often associated with big data. This approach avoids unnecessary up front wasted time and effort with little pay back.

Please share an example of how big data is being leveraged to business benefit.
As companies start to experiment with data, analysis and its possibilities, they begin to realize that the sources of data have changed so much in the last few years, for example, social media data. They then realize that new possibilities exist. An example would be personalization, being able to process disparate data sources, possibly correlate them to build a picture of the consumers you wish to interact with. Doing so requires different analytical approaches to group, filter, correlate, and/or recognize patterns in data sets.

"It is always best to start small with a short test and learn cycle building valuable insight incrementally, and then over time adopting the larger technical frameworks around big data"

A good illustration of this could be seen when researchers, led by Adam Sadilek of the University of Rochester in New York, were looking for a way to use Twitter to predict when individuals will get sick with the flu. By analyzing the 4.4 million tweets with GPS location data from more than 630,000 Twitter users around New York City, the team created a heat map of where people were unwell. They then created a video mapping the spread of illness across the city over the course of a day. Based on that data, the team could predict when an individual would get sick up to eight days before symptoms appeared with 90% accuracy. This not only illustrates the use of different techniques but also showed how visualization was key to understanding how to interpret the results.

Air Max 95 Jacquard Flyknit