“Big Data” is the new hot topic in academia and in the public. With the increased ability to collect and analyze information there is a quest to mine the great sources of data that are out there in the wild. The deeper question is not if this is a positive or negative development; but just what the data means. Does our ability to analyze and display greater amounts of information actually make us more knowledgeable, and potentially more secure? I would suggest that in the cyber security field, big data is perhaps trivial, and at worst, fear inflating without careful analysis and theory.
Vox recently posted an article with a link to a “terrifying and mesmerizing” display of information on cyber attacks as they occur released by the group Norse. What the map is really showing is the flow of data between automated bots and a series of honey pots, traps set out to look like ATM units. Norse is a “cyber intelligence” unit where potential clients can “request a demo” and see lists of live cyber attacks by countries, 48 million in China and 36 million in China at the time of this writing. Reporting this display of information is not news, but unfiltered and unspecific network flow data weighted towards financial attacks.
Vox is certainly not the only offender in the misuse of big data. The Chinese news agency Xinhua reported thousands of attacks launched by the United States. The National Computer Network Emergency Response Team in China alleged from that March 19th to May 18, 2014, 2,077 Trojans or botnets controlled 1.18 million host computers in China. Further, 135 host computers in the US lead 14,000 phishing operations. US based computers had installed 1,754 backdoors in Chinese websites which lead to 57,000 attacks.
The US-CERT (Computer Emergency Response Team) organization reports 49,562 incident reports in 2012 as compared to 5,503 in 2006. There has obviously been an increase in the number of cyber attacks and incidents, but just what does this all mean? These increases sound like a massive amount of infiltrations and a serious security threat. Yet, compared to reporting in the U.S., China is relatively restrained in their rhetoric. The U.S. and UK government, private industry, and news media reports massive amounts of infiltrations:
- The U.S. Department of Defense reports 10 million cyber attacks a day.
- British Petroleum reports 50,000 cyber attacks a day.
- The Nuclear Security Enterprise (the U.S. stockpile) experiences 10 million significant cyber security incidents a day.
- The UK suffered 44 million cyber attacks in 2011.
- Even Utah is getting in on the game, reporting 20 million attacks per day.
The problem is that we do not question these numbers. We do not investigate the context and scale of the reporting being done on the issue. Instead, all these figures thrown about are accepted as fact and become a reason to marshal massive amounts of money and resources to fight this coming cyber scourge. This is the true downside of “big data.” We do not take the time to analyze the amount of information being thrown around as fact.
The question then becomes what is to be done about the situation. My coauthor (Ryan Maness) and I have found that we are able to gain a greater leverage on cyber security data by taking what might perhaps be termed the Correlates of War method. Like the Militarized Interstate Disputes project, we coded incidents and disputes based on government sources, cyber security industry case reports, and news sources collaborated by more than one source. Are we missing data? Sure, every dataset has its flaws and misses a certain amount of information, but cutting through the hyperbole found in the massive amounts of big data that various sources report, our method is reliable and forms a useful starting point to engage the cyber security debate. We report just over a 100 government based or condoned cyber incidents from 2001 to 2011.
There are many promising avenues to incorporate big data back into the analysis of cyber threats once the ground has been sifted. By using our dataset, we can know where to focus our efforts in the future. Data on thousands of botnets attacks against fake banking honey pots as setup by a self interested cyber security firm is of no use to us, the academic community. There will always be threats to business organizations that allow transactions online. This threat is much different than the threat of a militarized cyber attack.
In the future, focusing in solely on DDoS attacks is a promising place to start in investigating big data sources and cyber interactions. DDoS means distributed denial of service attacks, basically a server is flooded with so many requests it cannot keep up with the traffic and the goal is to bring down the target. In this case, the target would then go online. The method of attack is usually ghost computers or botnets controls that hijack computers without the target knowing this is happening. Since this method produces a large noticeable spike in internet traffic, these attacks are observable and code ready for machine sources. This data would tell us the shape and direction of attacks, and in combination with our dataset on specific government condoned attacks, we might gleam a better picture of the landscape.
Perhaps another route would focus on automated traffic data collected by national governments. National level Computer Emergency Response Teams (CERTs) do provide a massive amount of publically available data, but the targets focused on business organizations seeking help. The problem with this data is it varies widely by country; the typology of the information also is not consistent. It will take some time to sort through this and find a use for it. The MIT Data Dashboard does provide a head start for this process but it is unclear at this point how comprehensive and comparative the information is right now.
The future is bright in terms of using large data sources to analyze cyber security interactions once we accept the limitations and have a skeptical view of the data. We must remember that relying on machine coded events alone is likely insufficient and inadequate. To be accurate, these sources need to be supported by human coders to cut through the noise. While traffic data might be useful, without a proper method to analyze such data and a theory, the efforts might be fruitless and simply “mesmerizing” without any specific context.
In the end, any amount of data is a good thing. Most cyber security experts prognosticate using a few examples to make sweeping claims. The field cannot reject the advances of “big data” in order to rely on isolated folklore to produce policy. Big data can help us understand the cyber security landscape, but only if we are willing to dive in and work with the data. That is our goal after the initial dataset was released, now we must more towards finding more sources and information, but this must be a careful and considered process.
*The majority of this post comes from a talk I gave at the University of Glasgow Big Data in the Social Sciences Workshop