An Investigation on the Power Outages Over Years

Introduction

In both industrial production and daily life, electricity is frequently used every day. Consequently, power outages can have a significant impact on people. By analyzing power outage data, we can better guide how to enhance electricity usage quality, analyze potential causes of outages, and develop corresponding improvement measures.

This project mainly centered around a question “whether climate category affect the duration of a power outage?” With knowing such conclusion, people can better indicates a pattern to identify or predict the major power outages, which can help humans to prepare on the sudden power outage or further research.

This dataset includes 56 column variables and 1534 row variables in total. In researching the question stated above, not all 56 columns variables will be included in the analysis. Thus, relevant information will be included for the analysis such as

Column	Description
`OUTAGE.START.DATE`	The date when an outage occurs
`OUTAGE.START.TIME`	The time when an outage occurs
`OUTAGE.DURATION`	The minutes an outage lasts
`CLIMATE.CATEGORY`	The climate type(warm, normal, cold) where an outage occurs
`U.S._STATE`	The state where an outage occurs
`OUTAGE.RESTORATION.DATE`	The date when an outage ends
`CAUSE.CATEGORY.DETAIL`	The detailed reason for an outage.

We noted that there’re many redundant data in the dataset, such as the YEAR and MONTH column is duplicated since OUTAGE.START.DATE is included. And with OUTAGE.DURATION, OUTAGE.START.DATE and OUTAGE.START.TIME, we can directly compute the OUTAGE.RESTORATION.DATE and the OUTAGE.RESTORATION.TIME. Therefore, we exclude some of such columns from our analysis.

Note that we take into account some columns that seems unrelated to our question, but it’s useful to detect the below topic of missingness.

Data Cleaning and EDA

Data Cleaning

Combine Data Column and Time Column

It’s not convenience if a specific datetime is described by two columns, so first we combine the OUTAGE.START.DATE column and the OUTAGE.START.TIME column into a single OUTAGE.START column with the pd.Timestamp type.

The cleaned dataframe’s datatype for each useful column is listed below.

OUTAGE.START               datetime64[ns]
OUTAGE.DURATION                    object
U.S._STATE                         object
CLIMATE.CATEGORY                   object
OUTAGE.RESTORATION.DATE            object
CAUSE.CATEGORY.DETAIL              object
dtype: object

The cleaned dataframe is shown below(with only representative rows selected for display).

OUTAGE.START	OUTAGE.DURATION	U.S._STATE	CLIMATE.CATEGORY	OUTAGE.RESTORATION.DATE	CAUSE.CATEGORY.DETAIL	OUTAGE.RESTORATION.MISSING	OUTAGE.DURATION.MISSING	OUTAGE.START.MISSING	CAUSE.CATEGORY.DETAIL.MISSING	RANDOM	NO_RANDOM
2015-07-18 02:00:00	1740	Minnesota	warm	2015-07-19 00:00:00	nan	False	False	False	True	a	False
2012-06-19 04:30:00	2550	Minnesota	normal	2012-06-20 00:00:00	thunderstorm	False	False	False	False	b	False
2010-10-26 20:00:00	3000	Minnesota	cold	2010-10-28 00:00:00	heavy wind	False	False	False	False	b	True
2014-01-24 00:00:00	108653	Wisconsin	cold	2014-04-09 00:00:00	Coal	False	False	False	False	b	True
2013-08-12 11:55:00	4	Oregon	normal	2013-08-12 00:00:00	suspicious activity	False	False	False	False	b	False

Univariate Analysis

Distribution of Outage Duration

In the univariate analysis, we would analyze the distribution of the outage duration and U.S. states.

This shows that exception from very few data, most of the duration is short. About half outages are less than 1,000 minutes, and most are less than 5,00 minutes, while only few outages lasting longer than that. Also, it means that the variance may be large since the minimal and maximal values differ a lot.

Distribution of Outage U.S. States

Then, we use the same way to analyze the distribution of U.S. states of the outages.

This shows that some states frequently occur power outages(more than 50 times in total), and most others are less frequent, even several states only have several power outages. It’s seems that the distribution is similar to gaussian distribution.

Bivariate Analysis

Then, we do bivariate analysis between the outage duration and the state where the outage occurs.

The scatter plot shows that they may exist very strong correlation between the outage duration and the state. We could say that there may have positive relationship between the two variables.

To further explore it, we calculate the average outage duration minutes of different states and plot a line graph.

It seems that the average outage duration is different between different states. To check it further, we calculate the median, and the quartile below.

It seems that the duration time differs a lot between different states. So we can roughly conclude that it may have strong correlation.

Interesting Aggregates

In the above mentioned bivariate analysis, we get a grouped table stat_df. Here, we show it again in table form. Below shows some rows of a grouped table aggregated by states and calculated the mean, median, and quartile of the outage duration.

U.S._STATE	mean	median	Q1	Q3
Mississippi	84	17.5	4	97.5
Hawaii	845.4	543	237	1367
Massachusetts	944.167	211	19	1443.75
Tennessee	1041.97	310	39	1230
New York	6034.96	2880	268.5	8156.25

Furthermore, we want to explore the relationship between different climate category(i.e. CLIMATE.CATEGORY column) and the outage duration of different states. So we plot an pivot table below (Also, only show some rows).

U.S._STATE	cold	normal	warm
Wisconsin	12545.6	3962.56	1605
Hawaii	1367	205.5	1224.5
Massachusetts	160.333	1159.38	721
Tennessee	1476.58	1015.75	341.857
New York	8914.58	3673.56	6092.83

To observe it more intuitively, we plot the table below.

It seems that the outage duration differ a lot in different climate. The warmer, the longer it lasts. But it’s not necessary that it’s true, so we’d check that in the hypothesis test below.

Assessment of Missingness

NMAR Analysis

One of the column in our dataset with missing values that is possibly NMAR is the OUTAGE.DURATION column. The reason for its missingness is most likely due to some reasons that why the OUTAGE.START.DATE or more frequently, the OUTAGE.RESTORATION.DATE is missing. Any of these columns miss will lead to the OUTAGE.DURATION miss. However, there’s no clear reasoning why they miss recorded in the dataset. If we want additional data to record the reason why the start and restoration datetime of a outage is missing, then we could make it MAR.

Missingness Dependency

We constructed permutation tests to determine the relationship. We’ve decided to test the dependency between the missingness of OUTAGE.DURATION with two columns: OUTAGE.RESTORATION.DATE and CAUSE.CATEGORY.DETAIL.

OUTAGE.DURATION and OUTAGE.RESTORATION.DATE (MAR)

It’s no doubt that the missingness of the OUTAGE.DURATION depend on the missingness OUTAGE.RESTORATION.DATE. We can easily figure out this from both observing the data and reasoning in the reality. And now, we’d like check it by performing permutation tests.

Since it’s a category column, we use TVD(Total Variation Distance) to perform permutation test. Recall that:

\[TVD(X,Y) = \dfrac 12\sum_{i=1}^{n} \left| X_i - Y_i \right|\]

The null hypothesis for the permutation test is that the specific column does not depend on another column. So if p-value is less than the significance level, we reject the null hypothesis, which means that we may think the column depend on another. Otherwise, we fail to reject the null hypothesis, which means that it’s more possible to believe the column does not depend on another.

Below, we computed the TVD, run the permutation tests of 500 rounds, set the significance level to 0.05, calculate the p-value, draw the plot and get the conclusion.

The result is shown below.

The result shows that p-value is 0.00, since the significance level is 5%, we reject the null hypothesis. Therefore, we conclude that it is highly possible that the missingness of the OUTAGE.DURATION depends on the missingness of the OUTAGE.RESTORATION.

OUTAGE.DURATION and CAUSE.CATEGORY.DETAIL (MCAR)

We can easily find that, on the other hand, the missingness of the CAUSE.CATEGORY.DETAIL, have nothing to do with the missingness of OUTAGE.DURATION. Since it only depicts some additional information of the CAUSE.CATEGORY. So we should find that the missingness of the CAUSE.CATEGORY.DETAIL is independent of the missingness of the CAUSE.CATEGORY.DETAIL. Below, we will check our conjecture.

The plot result shows that p-value is 0.89, which is largely greater than the significance level of 5%, so we fail to reject the null hypothesis. Hence, we conclude that the missingness of OUTAGE.DURATION does not depend on the missingness of CAUSE.CATEGORY.DETAIL.

Hypothesis Testing

Permutation Test

Going back to our investigation topic, we are investigating does climate category affect the duration of a power outage. Recalling the plot in the interesting aggregates part, we intuitively propose the idea that there are some relationships between the climate and the duration of power outages. But observation alone cannot be a good indicator. So we determined a good way to test is making a hypothesis test.

We think that permutation test on the distribution of outage durations in the warm climate areas and the distribution in the cold climate areas to see whether there is an actual increase in duration time to warmer areas or not.

Null hypothesis: the outage duration time in the warm climate and the cold climate comes from the same distribution.

Alternative hypothesis: the outage duration time in the warm climate and the cold climate comes from different distributions.

Significance level: 5%.

Number of permutation test rounds: 500.

Test statistics: the outage duration is numeric data rather than categorical. And recall that numeric data uses diff of means, which the categorical data uses TVD. Thereby, we use the diff of means as our test statistics.

The value is numeric rather than category, so we use the absolute difference between the mean of the two groups to check it. Recall the formula:

\[diff\_of\_means(A,B) = |\dfrac 1n\sum_{i=1}^nA_i - \dfrac 1m\sum_{i=1}^mB_i|\]

Using the similar way, we perform the permutation test below.

Conclusion

Out of our expectation, the result shows that p-value is 0.71, which is larger than 0.05, so we fail to reject the null hypothesis, which means that the outage duration time in the warm climate and the cold climate comes from the same distribution. That is, the outage duration have no relationship with the climate.

power_outages_stats

A sample homework of the curriculum DSC80 project3. Some practice on using Python Pandas and hypothesis/permutation test.