Thoughts from the trenches: Little data, dark data, “ramps” and related things

Little_data“Big data” and its twin, AI, is a big theme, has been so for some time (for us too), and is continuing at an accelerated pace.  But seeing big data “in the trenches” tend to reveal its “underbelly”, namely, dark data, chaotic data, traffic chaos, and so on.  Put in a different way, not all data is created equal: there is much data that are little, dark, unintelligible and unintelligent and some of these can be the less-than-useful “sand” in the gear of these AI machines that are, at least in principle, moving us to the smart new future.

The key question is what we can do about it, which requires me to first define the “what”.

First, the “what”: by this is meant

1. Dark data: a lot of data actually already exists but are somewhat hidden from sight. Getting the benefits of “big data” often becomes a matter of overcoming data silos, giving these dark data visibility, cleaning them up, assigning access right to them, and so on. The quality of the data matters: whether light or dark.

The opposite of this: data that is easily collected or readily available may not be the most useful for businesses.  Or, to be even more specific, not all data or “data exhaust” is useful: “big data” can be bigger than big, and some “exhaust” can have great potential, but a lot isn’t immediately valuable, so you need to figure out what is or could be.  Not only do data need to be stored and create much waste if they are never going to be useful, but there are risks associated with data exhaust, not just legal but business too as you may alienate your customer.

Avoiding this discussion may simply reflect an avoidance of the data silo problem; digital transformation comes from being able to have the right people access relevant data at the right time and in a timely fashion, and not just doing analytics on the “bright” data that is easy to get to.

2. Little data: the power of “big data” is dependent on bringing to bear the power of many little data. The power of big data actually comes from working out which of these “little” data is powerful or useful, otherwise applying big data and having some sense of the relationship between them – otherwise, it’s garbage in garbage out.

1+2 are important as they often result in less dangerous biases.  Bringing “dark data” to light is proving very useful in industrial settings and bias tends to be less of an issue in this type of “big data” when used in more B2B settings, for example, to improve operational efficiency rather than decide consumer tastes or product features.

3. Data is much more useful when they are integrated into the workflow, in the physical space of your workspace when you need it.  They are also much more useful when they are user-behaviour and need driven (in consumer applications), or worker-behavior and need driven (in enterprise applications).  Framing the conversation that way helps drive the collection of data that may have to cross the data silos and enables information and potentially analysis and insight to be developed and business value to be realised.  This can mean the difference between isolated or siloed data (and their respective siloed analytics products) and in-context information, which in turn can mean the difference between simple digitisation and digital transformation.   

4. Calling everything data can be misleading and encourage us to lose sight of the fact that there is usually a complicated human being behind a piece of data: as an example, does that person’s expressed preference tell us more about his or her buying intentions or is his or her implied preference(s) from past behavior more useful?  We cannot avoid making judgements of this kind or needing these to shape how we (construct the algorithms to) analyse the data.  We also need to recognise the limitations of the data and exercise care when using the predictions.

5. Data that have (identification or travel) restrictions on them: relatedly, we are living in a world where it is becoming a “must” to distinguish between various types of data, with privacy-sensitive and personal data requiring (different levels of) (privacy) protection and sometimes geographic restrictions (limitations on “crossing borders” or “travelling”) or what has been called “data sovereignty”.

Bringing these 5 points together, and considering what to do about them, we come to 2 major considerations:

First, we need to avoid the internet of connected unintelligence.  Not-useful data take up storage space, processing them also takes up power, and sending them to the cloud creates the additional series of problems, including wasting bandwidth, increasing security risks on the network, and so on.

Second, when we work out which data to “dump” and which data need to be sent to the cloud – and which cloud – then we’ll know what we need to do at the “edge” (and there is also data that needs to stay at the device or the device edge.)  Edge native will inevitably arise; data sovereignty is increasingly a driver for edge computing.

Further, it is sometimes not necessary to make “big data” bigger than it needs to be.

A lot of data will be unused and thrown away, which is probably a good thing.  Some data is useful only for that day or within a short timeframe.  Unless we find ways to manage the data flows, “traffic jams” will create latency – i.e. slow response time – which reduces the value of “big data”.

This brings us full circle back to point 1 above.  A lot of data is only temporarily useful; let’s deal with them quickly and dump them quickly.

Finally, the “so-what”:

  1. Differentiating between different values and assessing the quality of data is and will become increasingly important, especially within an enterprise context.
  2. AI today and over the next few years will be focused more on quality of algorithm and quality of data.
  3. There are limits to the race-to-as-much-data-as-possible approach, and all AI solutions will have some built-in “speed ramps” including an accelerated focus on “explainability” (on what data is used, how it is collected, and transparency over the algorithms being applied).  
  4. Further, there are some simple observations that tell us data is not the new oil: each unit of oil looks and performs fairly similarly to the another, while the medical data of a person will be really quite different from one person to the next, while a person’s genomic data are themselves correlated in various different (still somewhat unknown) ways to the person’s phenotype data.

Question to companies young and old: are you building your data architectures based on these considerations, and is your business model aligned accordingly?

Pin on PinterestEmail this to someoneShare on FacebookTweet about this on TwitterShare on LinkedIn

Leave a Reply