Generate a panel version of the UCDP/PRIO Armed Conflict dataset

I spent the morning writing a short R script that transforms the UCDP/PRIO Armed Conflict dataset into a panel dataset in which each conflict is observed from the year it started to the year it ended (or the last year in the dataset). The unit of observation in the original dataset is an active conflict observed in a year t. More precisely, it is an ongoing conflict episode (for a particular conflict) in a year t. Inactive years between conflict episodes are not recorded (see figure with an example from Ethiopia and two variables from the dataset). The purpose of the code below (especially the second part) is to fill-in the observations between conflict episodes. In addition to the panel structure, I wanted to have a more natural variable structure / variable names (this is what the first part is doing).

ucdp-data

The code below might not be optimal for all applications, but it is probably still useful for others that need to get a panel version of the UCDP/PRIO. If you spot mistakes or have ideas for improvements, please share them in the comments. For an introduction to the dataset, definition of variables etc, please read the original codebook.

First, load packages, load the data and make the variable names more comprehensible to ordinary humans.

sapply(c("plyr", "stringr", "zoo"), require, character.only=T)
ucdp <- read.csv("ucdp.prio.armed.conflict.v4.2013.csv", sep=",")
ucdp <- rename(ucdp, c(	"ID" = "ConID",
	"YEAR" = "Year",
	"Startdate" = "ConStartDate",
	"StartPrec" = "ConStartPrec",
	"StartDate2" = "EpStartDate",
	"Startprec2" = "EpStartPrec",
	"GWNOA" = "COWSideA",
	"SideA2nd" = "SideASupport",
	"GWNOA2nd" = "COWSideASupport",
	"GWNOB" = "COWSideB",
	"SideBID" = "NSASideB",
	"SideB2nd" = "SideBSupport",
	"GWNOB2nd" = "COWSideBSupport",
	"GWNOLoc" = "COWLocation"
))

Second, generate an ID for each conflict episode and fill-in each episode’s end date (and its measurement precision); if the episode is ongoing, fill-in a ‘NA’.

ucdp$EpID <- as.factor(paste(ucdp$ConID, ucdp$EpStartDate, sep="_"))
levels(ucdp$EpID) <- seq(1,nlevels(ucdp$EpID))

ucdp$EpEndDate <- as.character(ucdp$EpEndDate)
ucdp <- ddply(ucdp, c("ConID", "EpID"), function(x) {
	end <- unique(x$EpEndDate[x$EpEndDate!=""])
	prec <- unique(x$EpEndPrec[x$EpEndDate!=""])
	if( length(end)==0 ) {x$EpEndDate <- NA; x$EpEndPrec <- NA }
	 else { x$EpEndDate <- end; x$EpEndPrec <- prec }
	return(x) })

Third, reformat date variables and recorder the variables. Why reordering the variables? I like to arrange variables in a way, that they indicate the groupings of units to the analyst and the variance of variables within these groups (or levels).

In the UCDP dataset logic the highest data level is a conflict. Hence, the conflict ID (ConID) comes first followed by variables that take the same values across all observations of the same conflict. That applies to the variables SideA/COWSideA (countries that have the primary claim remain the same, [but see footnote 2]), Region, ConStartDate and ConStartPrec [see 1]. The second grouping level in the UCDP dataset is the conflict episode (EpID) which is nested in the conflict ID. Each episode has a unique start and end date (EpStartDate, EpStartPrec, EpEndeDate, EpEndPrec) and one geographic location (Location, COWLocation). The third data level is the observed year. Across years (but within an episode), the opposition actor can change (SideB,NSASideB,COWSideB) and the support group for both sides (SideASupport, COWSideASupport, SideBSupport, COWSideBSupport). For each year, there is also the possibly of different intensity levels (Int,CumInt), change in conflict type (Type) and the underlying incompatibility (Incomp, Terr). When we order variables in this way, it is for example easier to see that each observation is uniquely identified by the product of conflict ID (ConID) and year (Year) or that countries can experience multiple conflicts.

ucdp$ConStartDate <- as.Date(ucdp$ConStartDate, format="%Y-%m-%d")
ucdp$EpStartDate <- as.Date(ucdp$EpStartDate, format="%Y-%m-%d")
ucdp$EpEndDate <- as.Date(ucdp$EpEndDate, format="%Y-%m-%d")

colorder <- c("ConID", "SideA", "COWSideA" ,"Region", "ConStartDate", "ConStartPrec", "EpID", "EpEnd", "EpStartDate", "EpStartPrec", "EpEndDate", "EpEndPrec", "Location", "COWLocation", "Year", "SideB", "NSASideB", "COWSideB", "SideASupport", "SideBSupport", "COWSideASupport", "COWSideBSupport", "Incomp",  "Terr", "Int", "CumInt",  "Type", "Version")
ucdp <- ucdp[,colorder]

Fourth, expand the dataset to a panel that runs from the conflict’s start year to the last year observed in the data. Afterwards fill-in some information by carrying-forward the last observed value of the preceding episode. This makes sense for most variables since they can either not change within a conflict (SideA, COWSideA, Region, ConStartDate, ConStartPrec, Verion) or their value is logically implied (Location, COWLocation, Int, CumInt). However for other variables (Type, Incomp, Terr) carrying forward the last observation requires to belief certain assumptions (e.g. that the conflict’s fundamental nature doesn’t change in ‘peace’ episodes).

Notice also, that I create a new intensity category (Int=0) that indicates if a conflict is inactive in a particular year (less than 25 battle-related deaths).

tmp <- ddply(ucdp, c("ConID"), function(x){
	return(data.frame(Year=seq(min(x$Year),max(x$Year)), ConID=unique(x$ConID)))
	})
ucdp <- merge(tmp, ucdp, by=c("Year", "ConID"), all=TRUE)
ucdp <- ucdp[order(ucdp$ConID,ucdp$Year), ]

ucdp <- ucdp <- ddply(ucdp, "ConID", function(x){
	x$SideA <- na.locf(x$SideA)
	x$COWSideA <- na.locf(x$COWSideA)
	x$Region <- na.locf(x$Region)
	x$ConStartDate <- na.locf(x$ConStartDate)
	x$ConStartPrec <- na.locf(x$ConStartPrec)
	x$Version <- na.locf(x$Version)
	x$Location <- na.locf(x$Location)
	x$COWLocation <- na.locf(x$COWLocation)
	x$Int[is.na(x$Int)] <- 0
	x$CumInt <- na.locf(x$CumInt)
	x$Type <- na.locf(x$Type)
	x$Incomp <- na.locf(x$Incomp)
	x$Terr <- na.locf(x$Terr)
	return(x)
	})

Fifth, clean up and insert an episode ID that identifies episodes of conflict (Int!=0) and ‘peace’ (Int==0). All EpIDs with ‘XYZ_1’ are peace episodes.

ucdp$EpEnd <- NULL
ucdp$exp <- 0
ucdp$exp[is.na(ucdp$EpStartDate)] <- 1
ucdp$EpID <- paste(na.locf(ucdp$EpID),ucdp$exp, sep="_")
ucdp <- ucdp[,colorder]

Footnotes:
[1] With on exception: ConID=29 has three different start dates (same year, but different months). Not sure if that is mistake or intentionally.
[2] For ConID==33 the SideA variable changes (North Yemen until the nineties, Yemen in recent years), but the corresponding code (COWSideA) remains the same.

Update Nov 6, 2013
There was a typo in the code above. Replace the ‘x’ in max(x$Year) with ‘ucdp’, otherwise the panel for each conflict ends after the last episode ended (which might not be the case, since potentially a new episode could start in the future).

tmp <- ddply(ucdp, c("ConID"), function(x){
	return(data.frame(Year=seq(min(x$Year),max(ucdp$Year)), ConID=unique(x$ConID)))
	})
Advertisements

One Comment on “Generate a panel version of the UCDP/PRIO Armed Conflict dataset”

  1. Bine says:

    Hi, thank you very much for the code and the great explanation! It works fine as well with older versions of the UCDP data. Just one little thing, in step five (clean up) the colorder hasn’t work in my case because for some reason EpEnd drops out of the dataset. Thus, removing EpEnd from colorder and everything works fine.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s