Effects of network trace sampling methods on privacy and utility metrics
report
Abstract
Researchers studying computer networks rely on the availability of traffic trace data collected from live production networks. Those choosing to share trace data with colleagues must first remove or otherwise anonymize sensitive information. This process, called sanitization, represents a tradeoff between the removal of information in the interest of identity protection and the preservation of data within the trace that is most relevant to researchers. While several metrics exist to quantify this privacy-utility tradeoff, they are often computationally expensive. Computing these metrics using a sample of the trace, rather than the entire input trace, could potentially save precious time and space resources, provided the accuracy of these values does not suffer. In this paper, we examine several simple sampling methods to discover their effects on measurement of the privacy-utility tradeoff when anonymizing network traces prior to their sharing or publication. After sanitizing a small sample trace collected from the Dartmouth College wireless network, we tested the relative accuracy of a variety of previously implemented packet and flow-sampling methods on a few existing privacy and utility metrics. This analysis led us to conclude that, for our test trace, no single sampling method we examined allowed us to accurately measure the trade-off, and that some sampling methods can produce grossly inaccurate estimates of those values. We were unable to draw conclusions on the use of packet versus flow sampling in these instances.