global politics, relationally

Coding Cyber Security Incident Data


By Brandon Valeriano and Ryan Maness

Cyber Security Data

It is exciting to report that the field of cyber security has finally moved towards accepting that data on cyber incidents can be collected and analyzed. The Department of Homeland Security now has an incident reporting feature and ongoing efforts to collect data. Without this step, we are operating in known environment wearing a blindfold.

There is no need to restrict ourselves from understanding the basic contours of the cyber security field through the analysis of empirical events. To not take this step is a self-defeating strategy that betrays our standard operating procedures in other military and political domains. The first step is to always understand the behavior of the key threat actors in a domain, yet in the cyber security field we seem to think that the adversary is inherently unknowable and without a past, this is an unhelpful conjecture.

The plethora of new sources of information is heartening but also reinforces a key point we made in our book Cyber War versus Cyber Realities, we have observed restraint in cyber interactions. Escalation is rare, and most disputes piggyback on previously known foreign policy conflicts and crises that are well established, often connected to territorial disputes.

In this post we will cover what is needed in any dataset of cyber interactions, explain the process of the expansion of our data, other recent efforts to code cyber events, and review the need for empirical work in the cyber security field.

Important Components of any Dataset

Many groups have lists of cyber events, the most prominent might be Hackmageddon. The key aspect to understand is that making a list of cyber events is not enough to produce social science inferences or data analyses. So much more is required. There are efforts in progress, and we are hopeful that many will see the light in utilizing data for cyber research endeavors.

However, we must be clear that datasets need some things in common to make them usable to the wider community.

  • Clear coding rules
  • Independent and associated variables
  • Compatibility with other coding efforts
  • Reliability

Clear coding rules are critical; basically, how do we know what is coded?  This is associated with the condition of replication.  Can someone come behind your effort and produce something similar? Clear instructions are critical in order to ensure the progression of knowledge.

A dataset cannot simply be a list of events, that is just a list. Independent variables are critical for any data source.  This should include location, characteristics of the unit of observation, issues such as linkages to other events, damage and severity, and a host of other factors.

The whole purpose of data collection efforts and analyses being clear and replicable is to ensure that knowledge is moving forward based on some sort of basic consensus. Others should be able to build on your work and push things forward. The data should be compatible with other sources, our cyber events coded in the Dyadic Cyber Incident and Dispute (DCID) dataset has country codes, dates, and other events that can be linked and merged to other data efforts.  This effort is based on the Correlates of War project, a long standing data collection effort.

Reliability is likely the most critical aspect of any dataset.  Is it reliable in that we are sure that it was coded correctly, absent of as much bias as possible, and others should be able to take the coding rules and agree with the basic judgments made?  Our DCID was independently checked by three other hired coders at both rounds of data coding.  Version 1.1 of the DCID also had a group of 15 military officers go through all the coding of the more subjective elements to ensure that our coding of success, impact, and actors was reliable.

Idean Saleyhan has a useful review of the things needed to produce data in the conflict studies field. There are a host of other issues I have not even begun to mention such as source bias, source inclusion, scalability, information extraction, and the challenges of analysis.

One such challenge rarely admitted in cyber security is the problem of selection effects. If we are only taking a sample, such as state based actions reported by the press, or in our case, only actions between rivals, we are only coding a selection of the wider possible universe of cases. This constraint is critical in understanding the implications of the possible analysis done on the data.

Our Data Expansion

Our team (including Ryan Maness and now Benjamin Jensen) has been coding cyber incident data since 2010. Our first peer reviewed published work appeared in 2014 at the Journal of Peace Research, entitled the Dynamics of Cyber Conflict.  Of course, publishing is a long and complicated process and this article started in 2012. In this article we noted that cyber conflict is much more restrained than generally understood by the popular discourse.  Threat inflation is ripe in cyber security and the real use of cyber tools seems to be to enhance the power of strong states. (All my research is ungated at

Subsequent work in our book Cyber War versus Cyber Realities reinforced these points and added case studies to support our empirical findings. Our next book will be out in 2018 and will include cyber incident data from 2000 to 2014 between rival states.  We released this data a few weeks ago and it can be found here (codebook).  Our cut point is 2014 because the majority of the coding effort was done in 2016 and we are firm in belief that while cyber incidents can be coded, one needs to wait at least a year to make sure the sources, actors, and targets are confidently known.

The main addition in our new book, Cyber Strategy, is a consideration of the efficacy of cyber actions.  Simply, do they work?  To that end we have now coded concessions and targets in the data.  We also altered the severity coding to account for a wider scale of events.

All cyber incidents in the DCID are dyadic and the countries must be considered rivals, which are states with recent past animosities with each other. For the coding of the variables for all pairs of states added to the dataset (non-state actors or entities can be targets but not initiators as long as they critical to state based systems, or if the original hack escalates into an international incident in the non-cyber domain), the initiation must come from a government or there must be evidence that an incident or dispute was government sanctioned (see below for responsibility confirmation).

For the target state, the object must be a government entity, either military or non-military; or a private entity that is part of the target state’s national security apparatus (power grids, defense contractors, and security companies), an important media organization (fourth estate), or a critical corporation. Third parties are noted and coded as an additional variable in the data.

We are also now including information on cyber strategies, breaking this down into a four-point typology that is mutually exclusive and logically exhaustive.

  1. Disruptions: which include taking down websites, disrupting online activities, and are usually low cost, low pain incidents such as vandalism or DDoS techniques.
  2. Short-term espionage: gains access that enables a state to leverage critical information for an immediate advantage example; an example being the Russian theft of DNC emails and publicly releasing them in a disinformation campaign during the 2016 US presidential election.
  3. Long-term espionage: seeks to manipulate the decision-calculus of the opposition far into the future through leveraging information gathered during cyber operations to enhance credibility and capability, an example being China’s theft of Lockheed Martin’s F-35 plans.
  4. Degrade: attempt physical degradation of a targets’ capabilities, Example: USA’s Stuxnet against Iran; create chaos in a country to invoke a foreign policy response

New Data Sources

The Council of Foreign Relations cyber operations tracker covers state-sponsored cyber incidents from the years 2005-2017 (the data can be explored here). It includes incidents that  are “suspected” to have state sponsorship.This is a problem for datasets of this kind, as laying blame on a state for cyber actions has enormous geopolitical implications. Throwing suspected state-sponsored incidents in with verifiable ones is problematic coding and raises the possibility of retractions at a later date.

For the variable coded as affiliation, which attempts to attribute the group responsible for the cyber incident, 105 cells of this variable are left blank. Furthermore, 37 of these cells either begin with the phrases “believed to be” or “possibly”, indicating further uncertainty of who just might be responsible for the cyber incident. This translates to the coders having 74 percent of their coded incidents being uncertain that the culprit had been a state actor.

In the DCID, we wait at least one calendar year to pass before we begin to code a year. Right now our latest, version, 1.1 covers all dyadic cyber incidents between rival states from 2000-2014. We are in the process of coding version 1.5, which will include state-initiated incidents from the years 2000-2016.

Many cyber incidents can take months to find the proper attribution, especially covert espionage incidents. The analogy of the iceberg is often made with the idea that much what we know about cyber interactions falls below the surface.  Instead we argue that at some point, the iceberg flips over and we know most of the interactions.  What is unknown is important but it is also unknownable.

For an incident to make it in the DCID, we must have at least two verifiable sources that have given enough confidence to place the blame on a state actor. Sources include government intelligence reports and cyber security forensic reports.

What We Can Learn from Data

Establishing knowledge about the cyber security domain is critical.  By undertaking data exploration efforts we can progress forward with critical security questions.  There is appears to be a consensus in the field that there is evident restraint in cyberspace. This finding is supported by the CFR data which locates a 191 incidents from 2005-2017.  The DCID data, which is restricted to only rival states, locates 192 incidents from 2000-2014.

Future datasets need to expand to investigate non-state actors and internally repressive cyber incidents.  We believe this is the critical future of cyber security investigations. Investigating the macro data inherent in cyber processes can help us understand much more about the domain than the conjecture that seems to dominate the field. All these efforts are a work in progress but working as a team and avoiding duplication is the only way to move forward.

This is not to say that data is the only way forward in cyber security.  Rigorous case study logic that establishes critical casual actions is welcome.  Examining wargames and responses in combat scenarios is also important.  Formal modeling would be useful in deducing behavioral options and the constraints imposed by institutions.  The cyber security field is ripe for more social science based investigations, but these must include the direct collaboration of social scientists who have experience in coding data, practitioners who experience the events first hand, and policy-makers who seek to transform the data into actionable events.

Author: Brandon Valeriano

Brandon Valeriano is the Donald Bren Chair of Armed Politics at the Marine Corps University.