Skip to the content.

Power-Outage-Investigation

by David Sun & Yijun Luo

Click me to see all visualizations

The Data used for this exlploratory analysis is here.

Introduction

This analysis works on a dataset pertaining to the major power outages witnessed across the US, from January 2000, to July 2016. Gathered and compiled by Sayanti Mukherjee and others in this article, the dataset includes information on the specific time of each outage, the causes related to outage, regional climate information, impact of outage, geographic and economic statistics of the affected state, regional land usage and population information.

One of the major objectives for analyzing power outages is to understand the underlying causes, identify attributes that facilitates the incident, and summarize an overal risk factor for each region. In this analysis, we took the lasting duration of each outage as the measure for impact severity, and centered our focus around the question, what are the major causes for the varying duration of outages? More specifically, What attributes tends to produce longer duration outages? Understanding this question is significant, as it could lead to future research on generating a holistic risk factor considering all attributes, and helping with outage prevention in real world scenarios.

There is a total of 1534 rows of data, each corresponding to a single observed power outage within the time frame in continental US. The major columns that are related to this investigation are:

Cleaning and EDA

Data Cleaning

After loading the data, few of the columns mentioned above caught our attention. There are missing values in the OUTAGE.RESTORATION.DATE, OUTAGE.RESTORATION.TIME, CAUSE.CATEGORY.DETAIL, OUTAGE.DURATION, DEMAND.LOSS.MW, and CUSTOMERS.AFFECTED columns. The missing values regarding restoration is likely due to a permanent shutdown or unknown reasons, so we could not arbitrarily impute values to replace them. The missingness in the cause category detail column is likely NMAR, explained in the analysis section below, and it isn’t necessary to replace null values, as most rows are identical to the CAUSE.CATEGORY column. The OUTAGE.DURATION, DEMAND.LOSS.MW, and CUSTOMERS.AFFECTED columns all contain NaN values and 0s, but we have determined that the 0s are valid and faithful data. OUTAGE.DURATION contains zeros as a direct result of subtracting the start time from the restoration time, so if the power was immediately restored, the duration could become 0 minutes. In both DEMAND.LOSS.MW and CUSTOMERS.AFFECTED, the zeros represent minimal amount of damage done during the outage, which indicates less than 1 megawatt of demand loss, and not significant amount of population affected.

We added two columns, OUTAGE.START and OUTAGE.RESTORATION, to the dataframe by combining the corresponding date and time column, storing the elements as Timestamp objects. This is done to conveniently map time series data.

This is the first five rows of cleaned data

OBS YEAR MONTH U.S._STATE POSTAL.CODE NERC.REGION CLIMATE.REGION ANOMALY.LEVEL CLIMATE.CATEGORY OUTAGE.START.DATE OUTAGE.START.TIME OUTAGE.RESTORATION.DATE OUTAGE.RESTORATION.TIME CAUSE.CATEGORY CAUSE.CATEGORY.DETAIL HURRICANE.NAMES OUTAGE.DURATION DEMAND.LOSS.MW CUSTOMERS.AFFECTED RES.PRICE COM.PRICE IND.PRICE TOTAL.PRICE RES.SALES COM.SALES IND.SALES TOTAL.SALES RES.PERCEN COM.PERCEN IND.PERCEN RES.CUSTOMERS COM.CUSTOMERS IND.CUSTOMERS TOTAL.CUSTOMERS RES.CUST.PCT COM.CUST.PCT IND.CUST.PCT PC.REALGSP.STATE PC.REALGSP.USA PC.REALGSP.REL PC.REALGSP.CHANGE UTIL.REALGSP TOTAL.REALGSP UTIL.CONTRI PI.UTIL.OFUSA POPULATION POPPCT_URBAN POPPCT_UC POPDEN_URBAN POPDEN_UC POPDEN_RURAL AREAPCT_URBAN AREAPCT_UC PCT_LAND PCT_WATER_TOT PCT_WATER_INLAND OUTAGE.START OUTAGE.RESTORATION
1 2011 7 Minnesota MN MRO East North Central -0.3 normal 2011-07-01 00:00:00 17:00:00 2011-07-03 00:00:00 20:00:00 severe weather nan nan 3060 nan 70000 11.6 9.18 6.81 9.28 2.33292e+06 2.11477e+06 2.11329e+06 6.56252e+06 35.5491 32.225 32.2024 2308736 276286 10673 2595696 88.9448 10.644 0.411181 51268 47586 1.07738 1.6 4802 274182 1.75139 2.2 5348119 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2011-07-01 17:00:00 2011-07-03 20:00:00
2 2014 5 Minnesota MN MRO East North Central -0.1 normal 2014-05-11 00:00:00 18:38:00 2014-05-11 00:00:00 18:39:00 intentional attack vandalism nan 1 nan nan 12.12 9.71 6.49 9.28 1.58699e+06 1.80776e+06 1.88793e+06 5.28423e+06 30.0325 34.2104 35.7276 2345860 284978 9898 2640737 88.8335 10.7916 0.37482 53499 49091 1.08979 1.9 5226 291955 1.79 2.2 5457125 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2014-05-11 18:38:00 2014-05-11 18:39:00
3 2010 10 Minnesota MN MRO East North Central -1.5 cold 2010-10-26 00:00:00 20:00:00 2010-10-28 00:00:00 22:00:00 severe weather heavy wind nan 3000 nan 70000 10.87 8.19 6.07 8.15 1.46729e+06 1.80168e+06 1.9513e+06 5.22212e+06 28.0977 34.501 37.366 2300291 276463 10150 2586905 88.9206 10.687 0.392361 50447 47287 1.06683 2.7 4571 267895 1.70627 2.1 5310903 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2010-10-26 20:00:00 2010-10-28 22:00:00
4 2012 6 Minnesota MN MRO East North Central -0.1 normal 2012-06-19 00:00:00 04:30:00 2012-06-20 00:00:00 23:00:00 severe weather thunderstorm nan 2550 nan 68200 11.79 9.25 6.71 9.19 1.85152e+06 1.94117e+06 1.99303e+06 5.78706e+06 31.9941 33.5433 34.4393 2317336 278466 11010 2606813 88.8954 10.6822 0.422355 51598 48156 1.07148 0.6 5364 277627 1.93209 2.2 5380443 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2012-06-19 04:30:00 2012-06-20 23:00:00
5 2015 7 Minnesota MN MRO East North Central 1.2 warm 2015-07-18 00:00:00 02:00:00 2015-07-19 00:00:00 07:00:00 severe weather nan nan 1740 250 250000 13.07 10.16 7.74 10.43 2.02888e+06 2.16161e+06 1.77794e+06 5.97034e+06 33.9826 36.2059 29.7795 2374674 289044 9812 2673531 88.8216 10.8113 0.367005 54431 49844 1.09203 1.7 4873 292023 1.6687 2.2 5489594 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2015-07-18 02:00:00 2015-07-19 07:00:00

Univariate Analysis

Below is a choropleth for median duration of power outage with respect to each state. From this plot, we can see that the North East regions suffer longer power outages, while the West Coast in general suffer shorter ones. This could be attributed to the climate region those states belong to.

Below is a histogram showing the distribution of outages according to duration. We can see that most outages aggregates around the lower end, and over half of them is below 600 minutes long. This means that most power outages can be fixed within a reasonable amount of time, while the others take much longer to restore.

Bivariate Analysis

Below is a barchart showing the proportion of outages caused by servere weather each year. In general, we find that the proportion of outages caused by severe weather decreases over the years. (Year 2001 does not follow this trend because it has too little recorded outages.) Presumably, this shows that U.S. has improved its power infrastructures over the years to withstand severe weather.

Below is a scatterplot showing the relationship of mean total power sales versus population of each state. It suggests that total sales of power is positively correlated with the population, which means higher state population corresponds to more power consumption.

Interesting Aggregates

The following pivot table is a breakdown of average outage duration by state and cause category. This helps to visualize which cause category has the most significant impact on length of outage for a particular state.

Mean outage duration measured of each state by cause category

U.S._STATE equipment failure fuel supply emergency intentional attack islanding public appeal severe weather system operability disruption
Alabama nan nan 77 nan nan 1421.75 nan
Arizona 138.5 nan 639.6 nan nan 25726.5 384.5
Arkansas 105 nan 547.833 3 1063.71 2701.8 nan
California 524.81 6154.6 946.458 214.857 2028.11 2928.37 363.667
Colorado nan nan 117 2 nan 2727.25 279.75
Connecticut nan nan 49.125 nan nan 2262.6 nan
Delaware 50 nan 38.9189 nan nan 2153.5 nan
District of Columbia 159 nan nan nan nan 4764.11 nan
Florida 554.5 nan 50 nan 4320 6420.19 205.7
Georgia nan nan 108 nan nan 1422.75 nan
Hawaii nan nan nan nan nan 997.5 237
Idaho nan nan 307.5 nan 1548 nan 179.667
Illinois 149 2761 1450 nan 120 1650.7 nan
Indiana 1 12240 421.875 125.333 nan 4523.29 4671.6
Iowa nan nan 5657.8 nan nan 3353.67 nan
Kansas nan nan 561 nan 913 9346 nan
Kentucky 652 12570 108 nan nan 4480.11 nan
Louisiana 176.333 28170 nan nan 1359.21 7186.93 1144.67
Maine nan 1676 82.6667 881 nan 1669.4 nan
Maryland nan nan 225.32 nan nan 4006.94 304
Massachusetts nan 2891 384.25 nan nan 1556.57 67
Michigan 26435.3 nan 3635.25 1 1078 4831.65 2610
Minnesota nan nan 369.5 nan nan 3585.55 nan
Mississippi nan nan 12 nan nan nan 300
Missouri nan nan 408 nan nan 4483.82 65
Montana nan nan 93 34.5 nan nan nan
Nebraska nan nan nan nan 159 3221.33 nan
Nevada nan nan 553.286 nan nan nan nan
New Hampshire nan nan 60 nan nan 1597.5 nan
New Jersey nan nan 91.125 nan nan 6372.86 748.5
New Mexico nan 76 174.5 nan nan nan 0
New York 247 16687.2 309.083 nan 2655 6034.58 1176.57
North Carolina nan nan 1063.75 nan nan 1738.93 82.2
North Dakota nan nan nan nan 720 nan nan
Ohio nan nan 327.286 nan nan 4322.27 1744.5
Oklahoma nan nan 75.6667 984 704 4206.47 nan
Oregon 200 nan 394.105 nan nan 2295.8 nan
Pennsylvania 376 nan 1526.83 nan nan 4314 329
South Carolina nan nan nan nan nan 3135 nan
South Dakota nan nan nan 120 nan nan nan
Tennessee 404 nan 171 nan 2700 1386.35 20
Texas 405.6 13920 298.769 nan 1140.41 3854.89 810.8
Utah 15 nan 142.286 nan 2275 957 537.5
Vermont nan nan 35.4444 nan nan nan nan
Virginia nan nan 2 nan 683.5 1132.28 241
Washington 1204 1 371.871 73.3333 248 5473.55 25
West Virginia nan nan 1 nan nan 9305 nan
Wisconsin nan 33971.2 459 nan 388 1527.43 nan
Wyoming 61 nan 0.333333 32 nan 106 nan

This following pivot table indicates the number of outages occured in each state, broken down by cause categories. This helps to visualize the most common cause of outages in a particular state.

Number of Outage of each state by cause category

U.S._STATE equipment failure fuel supply emergency intentional attack islanding public appeal severe weather system operability disruption
Alabama 0 0 1 0 0 4 0
Alaska 0 0 0 0 0 0 0
Arizona 4 0 15 0 0 4 2
Arkansas 1 0 6 1 7 10 0
California 21 10 24 28 9 67 39
Colorado 0 0 5 1 0 4 4
Connecticut 0 0 8 0 0 10 0
Delaware 1 0 37 0 0 2 0
District of Columbia 1 0 0 0 0 9 0
Florida 4 0 2 0 3 26 10
Georgia 0 0 1 0 0 16 0
Hawaii 0 0 0 0 0 4 1
Idaho 0 0 4 0 1 0 3
Illinois 1 1 1 0 1 40 0
Indiana 1 1 8 3 0 24 5
Iowa 0 0 5 0 0 3 0
Kansas 0 0 3 0 1 3 0
Kentucky 1 2 1 0 0 9 0
Louisiana 3 1 0 0 14 14 6
Maine 0 1 6 1 0 10 0
Maryland 0 0 25 0 0 32 1
Massachusetts 0 1 8 0 0 7 2
Michigan 3 0 4 1 1 83 3
Minnesota 0 0 4 0 0 11 0
Mississippi 0 0 3 0 0 0 1
Missouri 0 0 3 0 0 11 1
Montana 0 0 1 2 0 0 0
Nebraska 0 0 0 0 1 3 0
Nevada 0 0 7 0 0 0 0
New Hampshire 0 0 12 0 0 2 0
New Jersey 0 0 8 0 0 22 2
New Mexico 0 1 6 0 0 0 1
New York 2 12 12 0 4 33 7
North Carolina 0 0 4 0 0 30 5
North Dakota 0 0 0 0 1 0 0
Ohio 0 0 14 0 0 26 2
Oklahoma 0 0 3 1 3 15 0
Oregon 1 0 19 0 0 5 0
Pennsylvania 1 0 6 0 0 48 2
South Carolina 0 0 0 0 0 8 0
South Dakota 0 0 0 2 0 0 0
Tennessee 2 0 6 0 1 20 2
Texas 5 3 13 0 17 64 20
Utah 1 0 35 0 1 2 2
Vermont 0 0 9 0 0 0 0
Virginia 0 0 1 0 2 32 1
Washington 1 1 62 3 1 20 1
West Virginia 0 0 1 0 0 3 0
Wisconsin 0 4 7 0 1 7 0
Wyoming 1 0 3 1 0 1 0

Assessment of Missingness

NMAR Analysis

The missingness mechanism of column CAUSE.CATEGORY.DETAIL is NMAR. This column appears to be documented and written by researchers, as the labels used for detailed causes are quite messy and inconsistent. For example, there are two very similar labels “Coal” and “ Coal”, both of which corresponds to a power outage caused by a coal power plant issue. Another occurance is the various notations of wind damage, including “heavy wind”, “wind/rain”, “wind storm”, and “wind”. These clues imply that this column is reported by hand, and the names of each label varies from one person to another. Therefore, it is very likely that the missing values are an incident of human error while collecting the information. If the cause details are unknown to the researcher, or the causes are quite obvious and not worth writing its details, then the researcher is more likely to not write anything within this column. And so, the missing values are depended on the missing values itself.

Values found in CAUSE.CATEGORY.DETAIL: [nan, ‘vandalism’, ‘heavy wind’, ‘thunderstorm’, ‘winter storm’, ‘tornadoes’, ‘sabotage’, ‘hailstorm’, ‘uncontrolled loss’, ‘winter’, ‘wind storm’, ‘computer hardware’, ‘public appeal’, ‘storm’, ‘ Coal’, ‘ Natural Gas’, ‘hurricanes’, ‘wind/rain’, ‘snow/ice storm’, ‘snow/ice ‘, ‘transmission interruption’, ‘flooding’, ‘transformer outage’, ‘generator trip’, ‘relaying malfunction’, ‘transmission trip’, ‘lightning’, ‘switching’, ‘shed load’, ‘line fault’, ‘breaker trip’, ‘wildfire’, ‘ Hydro’, ‘majorsystem interruption’, ‘voltage reduction’, ‘transmission’, ‘Coal’, ‘substation’, ‘heatwave’, ‘distribution interruption’, ‘wind’, ‘suspicious activity’, ‘feeder shutdown’, ‘100 MW loadshed’, ‘plant trip’, ‘fog’, ‘Hydro’, ‘earthquake’, ‘HVSubstation interruption’, ‘cables’, ‘Petroleum’, ‘thunderstorm; islanding’, ‘failure’]

Missingness Dependency

Missingness of Outage Duration(OUTAGE.DURATION) depends on Cause(CAUSE.CATEGORY)

We tested if the column OUTAGE.DURATION’s missingness is depended on the values of column CAUSE.CATEGORY. We performed a permutation test, using total variation distance (TVD) as our test statistics, to find out the answer.

This following grouped bar chart indicates the observed distribution of CAUSE.CATEGORY, separated by the missingness of the corresponding outage duration value.

This is the resulting distribution of permutation TVDs versus observed TVD. We can see that the red line is to the right of the entire blue distribution, meaning that our test generated a p-value of approximately 0. This means that the difference between the observed distribution of cause category when duration is missing, versus the observed distribution of cause category when duration is not missing, is significant. Thus, the missingness of OUTAGE.DURATION is likely dependent on the value of CAUSE.CATEGORY, making the missingness MAR.

Missingness of Outage Duration(OUTAGE.DURATION) not depends on number of Customers Affected(CUSTOMERS.AFFECTED)

We further tested if the column OUTAGE.DURATION’s missingness is depended on the values of column CUSTOMERS.AFFECTED. We used the absolute difference in mean as our test statistic to perform this permutation test.

This Distribution graph reveals the distribution of customers affected when duration is missing or not missing.

Similarly, this is the resulting distribution of permutation average difference in mean versus observed average difference in mean. We can see that the red line, the observed, is in the middle of the blue distribution. This corresponds to a p-value equal to 0.696, which is way higher than the standard significance level of 0.05. The results show that our test is inconclusive, and we could not determine if OUTAGE.DURATION’s missingness is dependent on the values of CUSTOMERS.AFFECTED.

Hypothesis Testing

Null Hypothesis: Severe weather related outage durations are randomly sampled from the population of outage duration.

Alternative Hypothesis: Severe weather related outage durations are not randomly sampled from the population of outage duration.

Observation: outage durations caused by severe weather

Population: all outage durations (from data)

Test Statistic: mean of sampled durations

Sample Size: number of outage durations that has been categorized as caused by severe weather

Significance level: 0.05

P-Value: 0.0

Conclusion: Since the P-value is 0, which is lower than 0.05, we shall reject our null hypothsis. Our hypothesis test suggests that the mean outage duration caused by severe weather is significantly higher than the overall outage duration. This favors our alternative hypothsis: Severe weather related outage durations are not randomly sampled from the population of outage duration. This implies that the overall duration when caused by severe weather is significantly higher than the average duration length.

After this hypothesis test, we now know that severe weather is a major contributor to the overall duration of a power outage, and this would require attention when assessing risk factors for each region to prevent future power outages.