Artificial Intelligence and Big Data are not the same, but they are entwined and vital to each other. As remarkable as the industry’s achievements are and have been, there are still issues to clear up
In the beginning…
The concept of Artificial Intelligence (if not the term) dates back to 1947, although the thinking behind it may go back further still. Alan Turing gave a public lecture in that year, when he discussed the idea of a machine having intelligence – learning from its own experience, altering and writing its own instructions. The pursuit of making machines that can think (and the phrase ‘Artificial Intelligence’ coined to describe it) began at the 1956 Dartmouth Summer Research Project on Artificial Intelligence organised by John McCarthy.
‘Big Data’ as a concept has been current since the 1990s, said (although unable to confirm) to have been coined by John R. Mashey, who worked at Silicon Graphics, in the 1990s. Whilst the human compiling and storage of knowledge goes back millennia, ‘Big Data’ as a concept (Big data happens when there is more input than can be processed using current data management systems’) didn’t really come into its own as ‘a thing’ until the appearance of smartphones and tablets.
Given the scale of the ‘big’ in ‘Big Data.’ examining big data without AI is currently impossible, It could be said the two fields of work have a symbiotic relationship, with ‘Big Data’ providing a massive boost to Artificial Intelligence.
AI is used in areas such as Banking & Securities, Healthcare Provision, and Manufacturing & Natural Resources. It can build models of the cosmos to test theoretical ideas such as dark matter, detect financial fraud, and match drug combinations to produce novel treatments and therapies - finding solutions to problems in hours, where humans would take years.
The ’B’ in ‘Big Data’
One of the methods of thinking about ‘Big Data’ is the ‘six Vs’; volume, velocity, variety, veracity, value, variability.
- Volume speaks to the scale, the amount of the data at hand. Data flow from sources such as smartphones, laptops networks can now be measured in the many billions of gigabytes.
- Velocity describes the speed at which this data is accumulated from the varied data sources we now use. To take one example, there are in the order of 3.5 billion Google searches each day.
- Variety refers to the multiplicity of types and sources of data, be it structured, semi-structured and unstructured. It also refers to data drawn from a variety of different sources.
- Veracity highlights the inconsistencies and uncertainties within that data. Data can be confusing, it can be wrong and it can be difficult to draw conclusions from.
- Value within data is the utility, the usefulness of that data. Data has to be gathered and subject to study before it can be said to be valuable, that is before the information it contains can be made useful.
- Variability is how much your data changes and how often it changes.
There is a lot of big data about.
Nearly 2.5 quintillion bytes of data are created each day. To change the unit of measurement slightly, 328.77 million terabytes or 0.33 zettabytes every 24 hours.
This year, 2024, we will create 120 zettabytes of data, while forecasts suggest 181ZB will be produced during 2025, a 150% increase.
Indeed, the volume of data generated in 2010, 2ZB, increased 60 times in the years up to the end of 2023. It is now believed 90% of data has been generated in only the last two years, while the volume of data doubles every two years. Naturally, this produces enormous challenges in how this data is organised, managed and maintained.
Without artificial intelligence, using and making sense of big data would be nearly impossible, and with the right data inputs, AI can produce impressive results in how big data is sorted and managed.
There are criticisms
The problems identified with AI can range from the relatively trivial such as the whacky targeting of ads and recommendations for product, services an individual wouldn’t use let alone want, to the inability of Google Flu Trends to accurately predict flu levels for several years after 2011.
It is claimed that analysing large volumes of data provides greater accuracy, which is not necessarily true. Freshly acquired data is not as valuable as data with some history behind it. The volume of data may not be as valuable as what it contains and where it has come from. The larger the data set, the more likely it is to reproduce bias and error, if there are mistakes and inconsistencies within the data set itself. For example, Electronic Health Records (EHRs) in the US can give a distorted picture of the health of the population, because a substantial number of Americans have no health insurance or make use of healthcare only infrequently.
A sample of problems
The larger the data set, the more likely it is to reproduce bias and error, if there are mistakes and inconsistencies within the data set itself. To use one instance - in the United States of the 1980s, the large Nurses Health Study followed 48,470 postmenopausal women, 30–63 years of age for 10 years (337,854 person‐years). The conclusion was that Hormone Replacement Therapy (HRT) cut the rate of serious coronary heart disease nearly in half. What the study failed to take account of was the unusual nature of the sample and the conflation of oestrogen use with other positive health factors. The study proved unable to recognise the atypical nature of the sample, in spite of its large size.
A later study which controlled for self-selection, a part of the Women's Health Initiative (WHI), showed that HRT did not lower the rate of coronary disease, instead suggesting oestrogen replacement could actually be harmful.
Subsequently, investigators were able to show that the results of the Nurses Health Study do map to the WHI if the focus is on new hormone replacement users. Observational studies can provide valuable causal information, but only when the investigators have the right model. When the underlying sampling model is wrong, large sample size can magnify the bias.
Only when the Nurses Health Study focused on new HRT users did this study correspond with the WHI study. Mistakes in the sample can lead to mistakes in results. When these results are fed into an AI, they create biases and inaccuracies, which affects the performance and outcomes of AI.
Say again…
Another issue revolves around a central plank of science and engineering, but is far more complex in AI.
Repeatability (also ‘reproducibility’, or ‘replicability’), the ability of a person to use the same tools and data to reproduce the results of research, lacks a standard model. Barriers to this include the availability of data and models, infrastructure and publication pressure.
Within data science, this is considered trivial in a way it wouldn’t be for real-world, hands-on research. Because running the same code in the same situation and data can reproduce exactly the same results (far easier than if a researcher has to buy equipment and materials, work with her hands and exactly reproduce laboratory conditions), the standard for reproducibility in computational research is very high.
Being unable to repeat results could be crucial for other researchers rolling out a new tool or algorithm.
Nearly is not nearly good enough
In November 2023, Dr James Luke, the Innovation Director at Roke, told an event hosted by the IET that 95% or 99% accuracy in AI can be meaningless. In the past, clients he worked for required 100% accuracy from his AI because the consequences of that 1% error could mean a person losing an arm by accident. His clients needed better than 95%.
Less is more
Dr Luke pointed out other issues with AI and big data. He described a sector where there is a demand for ever larger capacity, large language models, larger data centres and servers. He expects the industry to top out at some point and be unable to progress. He believes the field needs to learn to do work with less.
It could be that the days of massive volumes of data will come to an end, or the need to work with massive inputs and large-scale infrastructure will as of necessity become a thing of the past.
When it works
A recent article on the Live Science website serves as an excellent showcase of what AI and big data can achieve.
Researchers at the Pacific Northwest National Laboratory (PNNL) collaborated with Microsoft to identify materials that can be used in building low-lithium batteries. Microsoft’s Azure Quantum Elements tool was used to screen 32 million candidates to find new materials to use in the new battery.
Lithium is difficult, energy intensive and highly polluting to mine and produce. Replacing or reducing the volume of the metal used in batteries is clearly a valuable goal.
The main focus of the research was to find a new material to replace the liquid electrolytes currently used in batteries. The material would need to be compatible with electrodes and allow lithium ions to pass through it and prevent electrons moving through the battery.
Choosing a combination of AI techniques, filtering for different properties and narrowing down the criteria, reduced 32 million different candidates down to 18 finalists in 80 computer hours. Humans would have required 20 years to reach a similar point.
Clearly, as this instance demonstrates, the possibilities make big data fuelled AI a prize too important to put aside.