Log in

Site menu:



Site search

August 2020



The Singularity Is Not All That Near

By now, no one with electricity hasn’t heard about the NSA data-center that is planned for Utah. First mention of it was seen in the wild as far back as May 2009 in H.R. 2346 (“Making Supplemental Appropriations for the Fiscal Year Ending September 30, 2009, and for Other Purposes” (search for ‘Utah’). Digging just a bit deeper takes the time line back as far as April in a supplemental Whitehouse document that included Department of Defense appropriations (see page 66) for “site preparation in advance of data center facilities construction projects to be constructed at the Utah National Guard site at Camp Williams, Utah”. The same document also provides some evidence of the seriousness of our nation’s posture on cyberwarfare, as page 67 explains: “The FY 2009 Defense Appropriations Act funded a National Security Agency project in the Operation and Maintenance, Defense-wide account. It has been determined that the project is more properly funded through the Military Construction, Defense-wide account. This provision would realign these funds from Operations and Maintenance, Defense-Wide to Military Construction, Defense-Wide.”

Understandably, Utah has been abuzz about the NSA data center for some time. Performing a search for “NSA” on the Salt Lake Tribune site yields a number of variably interesting results, each shedding bits of light on the plan and its progress. The earliest piece, dating back to July 1st, does a good job rationalizing the decision to build the massive data center in Utah, opening with: “Hoping to protect its top-secret operations by decentralizing its massive computer hubs…” and later explaining that: “The NSA’s heavily automated computerized operations have for years been based at Fort Meade, Maryland, but the agency began looking to decentralize its efforts following the terrorist attacks of Sept. 11, 2001. Propelling that desire was the insatiable energy appetite of the agency’s computers. In 2006, the Baltimore Sun reported that the NSA — Baltimore Gas & Electric’s biggest customer — had maxed out the local grid and could not bring online several supercomputers it needed to expand its operations.” Environmentalists will both mourn and be fueled by the juicy tidbit from this same piece that the data center “will also require at least 65 megawatts of power — about the same amount used by every home in Salt Lake City combined.”

Curiously, an additional piece of historical information the article fails to mention as possible site-selection rationalization is that Utah was previously selected by the NSA, back in February of 2006, for the linguistic capabilities of its returned missionaries. It would not be at all surprising if this was a factor in Utah’s being the place for what’s been described as a “collection point for surveillance of domestic and international telecommunications“.
So as they say: السلام عليكم, 企鹅性骚扰, יישר כח, etc., Utah.

The piece from July 2 provides some more information on the purpose, cost, and composition of the data center: “The supercomputers in the center will be part of the NSA’s signal intelligence program, which seeks to ‘gain a decisive information advantage for the nation and our allies under all circumstances'” and “President Barack Obama last week signed a spending bill that included $181 million for preparatory construction of the Camp Williams facility and tentatively agreed to two future phases of construction that could cost $800 million each” and “About $70 million has been budgeted for security, including vehicle inspection facilities, fencing, surveillance and separate visitor control centers for construction and technical personnel.”

I can’t yet say anything about the collection of supercomputers, but the eyewitness commentary I can provide as a commuter who drives past the planned construction site everyday is that it seems they’ve already spent more than $70 million on fencing alone, and it’s mostly resulted in heaping piles of deer roadkill. Inexplicably, the rutting deer seem to excel at finding their way through the fence to get onto the road, but can’t seem (literally, to save their own lives) to find their way back through to get off the road. Perhaps experimentation on bloated stinking mangled deer is somehow part of the grand government conspiracy.

July 7 offers up two pieces. The first objectively treats the data center as little more than fiscal stimulus (construction is planned to employ 4,000 to 5,000 people), while the second seems its calculated social counterbalance, offering up the obligatorily banal “we’ll follow orders and won’t ask any pesky questions about civil rights” shtick. At least the comments prove to be far more entertaining than the articles themselves.

The piece from October 23 was the first “mainstream” report in the Tribune on the event, getting somewhat lost in the echo chamber of reports and blogs and tweets that hit at about the same time, triggered by the Office of the Director of National Intelligence press conference (video and transcript). Win-win-win!

It’s an excerpt from this piece that is among the most important of all the coverage offered, noting the crucially irreplaceable role of people in the technologically-driven field of cybersecurity, and citing a report that is recommended reading for anyone in the information security field or the Intelligence Community:

But only a very small slice of the information stored at the center in southern Salt Lake County will ever be scanned by human eyes. And that’s the reality for most of what is collected by the nation’s other spy agencies as well.  In a report commissioned by the Department of Defense last year, the Jason defense advisory group warned that the millions of terabytes of data coming into U.S. spy agencies through ever-improving sensors are being wasted. … It cited Massachusetts Institute of technology defense expert Pete Rustan, who complained that “70 percent of the data we collect is falling on the floor” [because sensor data was failing to be captured and processed].

“We have been blessed with a lot more sensor-type capabilities,” [
said George Eanes, vice president of business development at Modus Operandi, a Florida software company that serves the defense intelligence community.] “That can be a big advantage to have in the theater, but it’s just data. You still got to have the humans in the loop before you make any decisions.”

Data Analysis Challenges

The same report cited above was also recently referenced by FAS (Federation of American Scientists) through their Secrecy News project (“Through research, advocacy, and public education, the FAS Project on Government Secrecy works to challenge excessive government secrecy and to promote public oversight”) in a post on the challenges of dealing with large data sets. The December 2008 JASON (not an acronym) report titled “Data Analysis Challenges” is a must read. Seriously – read it. Notable concepts from this “study commissioned by the Department of Defense (DOD) and the Intelligence Community (IC) on the emerging challenges of data analysis in the face of increasing capability of DOD/IC sensors”:

As the amount of data captured by these sensors grows, the difficulty in storing, analyzing, and fusing the sensor data becomes increasingly significant with the challenge being further complicated by the growing ubiquity of these sensors.  (page 1)

The JASON report opens by summarily describing the challenges facing the Intelligence Community as storing, analyzing and fusing the ever-increasing amounts of data. Storing the data, obviously, should be recognized as foundational to anything but the most cursory analysis, the kind of superficial examination that the report describes as “rapid time scale” (more on this later). Yet despite storage being an unmistakable prerequisite to any kind of deeper, longer time scale analysis, there are today technology vendors hawking data-analysis wares that fail to meet this basic requirement. Because they haven’t figured out how to solve the technical challenge, they attempt to dismiss their critical deficiency with one of two arguments from ignorance: either that high-speed data capture is not possible, or that it’s not necessary.

No one would disagree that in intelligence work, data analysis is more productive than raw data capture, but likewise, no one should suggest that meaningful data analysis is possible today without having all of the data to analyze. As the report states on page 3: “the notion of fully automated analysis is today at best a distant reality.” Companies making a claim that effectively amounts to “we analyze 100% of the data that we don’t fail to capture” does nothing but betray their lack of understanding of the requirements of the Intelligence Community. Best-effort approaches can make sense when coping with current real limitations of computation or storage, but only when employed sensibly; failing to store all relevant data means never being able to analyze that un-captured data, whereas failing to analyze captured data superficially in real-time still means being able to analyze it more deeply subsequent to capture.

But storage should really only be considered table stakes. The practical utility of any storage system comes from the combination of efficient capture *and* efficient retrieval. The capture of the data should be considered the relatively easy part, and the report correctly makes clear that “the main issues in managing this volume of data are not rooted in hardware but in software” (page 23). It goes on to offer an example from the Pan-STARRS project of how commodity-off-the-shelf (COTS) hardware can be used to “serve 3 Petabytes for roughly $1M”.  (The Pan-STARRS “Distilling Science from Petabytes” presentation itself is cited in the report’s end notes. A web search by name will turn up a link to the presentation which is also worth a glance. It’s of service for me to note the bullet on slide 10 which advises: “Science goals require all the data to be accessible and useful: waste no photons“.)  Further, the JASON report describes some of the greater challenges when dealing with these quantities of data, namely, those of managing large data sets:

These include dealing with the complexity in the name-space that is introduced by the enormous capacity of these high performance file systems, and managing the vast archives of data that are produced by both simulation and data collection systems. Old paradigms for locating data based on a simple file path name break down when the number of files exceeds 10^9 as they now frequently do. Users have expressed the desire to locate data based on other properties of the data beyond its file name, including but not limited to its contents, its type and other semantic properties. Such a location service will require new indexing techniques that are currently subjects of academic research. (page 28)

In addition to limitations of conventional filesystems, the report also describes frustrations with commercially available databases, focusing on the paradigmatic experiences of the scientific community:

Broadly speaking the segment of the scientific community that is pushing the forefront of large-data science has been disappointed with the capability and the performance of existing databases. Most projects have either resorted to partitioned smaller databases, or to a hybrid scheme where metadata are stored in the database, along with pointers to the data files. In this hybrid scheme the actual data are not stored in the database, and SQL queries are run on either the metadata or on some aggregated statistical quantities. (page 61)

The authors of the report were astute to make this connection, acknowledging in the executive summary (page 1) that “it is of value to consider the evolution of data storage requirements arising from data-intensive work in scientific fields such as high energy physics or astronomy.” This perspective strongly validates some of Solera Networks inventions in the areas of massively-scalable (DSFS), attribute-based (GaugeFS) filesystems and databases (SoleraDB) (details available under NDA); it also helps illuminate the unique value of having a Chief Scientist (Matt Wood) who is hours from completing his Theoretical Physics PhD work in the Telescope Array Physics group at the University of Utah.

As a quick exercise to appreciate the value of real solutions to the problems encountered with traditional filesystems and databases when attempting to capture and use large sets of network traffic, consider the following:

So you’ve captured just over 3 days of traffic on your generally 1/3 utilized 10Gbps network:

  • That’s about 100TB of data
  • For around 183 billion “average” sized packets (600 bytes)
  • At an average of 650,000 packets per second

And now you want to find all the packets from IP address

  • Do you read through 50 x 2TB or 50,000 x 2GB files?
  • Wouldn’t it be helpful to have an index?
  • Which databases efficiently handle 650,000 inserts per second?

Time Scales

As mentioned earlier, there are different time scales on which data analysis can be performed. Sensitivity to different time scales is important, and the report notes this in the executive summary: “The key challenge is to empower the analyst by ensuring that results requiring rapid response are made available as quickly as possible while also insuring that more long term activities such as forensic analysis are adequately supported.” In greater detail, it broadly distinguishing three cases (page 51):

Long time scale Here there is no critical timeliness requirement and one may want to establish results on a time scale of perhaps days. Applications which match well include retrospective analysis of multiple data sources, fusing of new data to update existing models such as geographic information systems or to establish correlations among events recorded through different information gathering modalities.

Medium time scale Such a time scale corresponds to activities like online analysis with well structured data. Typically this is accomplished in an interactive way using a client-server or “pull based” approach.

Rapid time scale In this scenario, one wants to be cued immediately for the occurrence of critical events. The time scale here may be very near real time. We will argue that a “push based” or event driven architecture is appropriate here.

The long time scale section makes cloud-computing recommendations to  MapReduce / Hadoop, and also makes the wise suggestion “to move computation close to the data rather than move the data to a central point where computation takes place. This minimizes congestion and is more scalable in that there are fewer load imbalance bottlenecks due to data motion or computation” (page 59). With regard to network forensics, it would reasonable to consider such tasks as cryptanalysis, steganalysis, and statistical data mining as likely long time scale candidates.

The medium time scale section recommends the use of a service oriented architecture (SOA – the fundable version of RPC) noting its attractiveness in “applications where large data stores need to interoperate and where fusion of their data is required at a higher level.” It covers IARPA‘s open-source Blackbook 2 project (“a graph analytic processing platform for semantic web”) which appears to be a non-commercial alternative to the impressively scalable and extensible Palantir data analysis platform (you can get a feel for it by playing an online game they provide, or using it to work with data from In the spirit of the JASON report’s recommendation to modularity and sharing, and consistent with Solera Network’s practice on platform collaboration, Palantir avers a “fundamental belief that this openness will lead to long-term customer success over inflexible, closed, and proprietary solutions.” Most sorts of collaborative data analysis could fit into into the medium time scale, and scalable, high-performing, intuitive platforms will make it easier for human analysts to find interesting and valuable results in the data.

The rapid time scale is also described by the report as an “event driven architecture” (EDA) where an event is “simply a significant change of state associated with some data that is being constantly monitored.” The report differentiates an EDA from an SOA by explaining

EDA applications use a publish/subscribe model where loosely coupled software components subscribe to event streams and then either react by actuating some response or by emitting subsequent events to other components. The key idea behind this approach is asynchronous broadcasting or “push” of events.

This fairly accurately describes the sort of integration that exists between Solera Networks platforms and other event generating platforms such as SonicWALL and ArcSight, where pre-classified security events are detected on a rapid time scale through DPI pattern-matching, or security information/log aggregation. Since the platforms generating these sorts of events (either directly, or indirectly, e.g. through a SIEM) are generally in-line traffic-processing devices, their classification of events must occur in real-time (i.e. with latencies imperceptible to users), and cannot today be compared to the deeper sorts of data mining analyses that are possible in medium to long time scales. That is not to say that longer time scales are better than rapid time scales, but rather that both are necessary. I am simply recognizing the difference that exists today between a necessarily fast-twitch intrusion detection/prevention system, and a necessarily more persistent data analysis platform. Rapid time scale, event driven architectures are very good at detecting and preventing reconnaissance attempts, denial of service attacks, and known-exploit attacks easily identifiable by machines, and this type of defense is essential to protecting the day-to-day operation of information systems against tools the like of those found on milw0rm or exploit-db. But it requires longer time scales and the neocortex of a human analyst to detect the unique and unpredictable actions executed by a competent and determined criminal or terrorist agent.

That’s A Very Expensive Cat

I don’t take the fact that Utah’s NSA data center is expected to include more than 1 million square feet of space staffed by only 200 people as an indication that the NSA believes computers provide more value than analysts. Instead, I see these numbers as acknowledgment of a recognized shortage of qualified analysts. Whether its a DHS initiative to hire 1,000 cybersecurity experts over the next 3 years, or a Booz Allen Hamilton study stating that “There is a radical shortage of people who can fight in cyber space””penetration testers, aggressors and vulnerability analysts… My sense is it is an order of magnitude short, a factor of 10 short” there’s no shortage of evidence that we need more human analysts. Today’s silicon and algorithms””fast and clever as they are””get ever-better at assisting humans, but they are still far from being up to the task of understanding or analyzing the behaviors and actions (particularly the pathological behaviors and actions) of other humans.

For perspective on where state-of-the-art computing is relative to the human analytic capabilities, I’ll close with one of the more interesting announcements that just came out of SC09:

Scientists, at IBM Research-Almaden, in collaboration with colleagues from Lawrence Berkeley National Lab, have performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex and contains 1 billion spiking neurons and 10 trillion individual learning synapses.

The simulation was performed using the cortical simulator on Lawrence Livermore National Lab’s Dawn Blue Gene/P supercomputer with 147,456 CPUs and 144 terabytes of main memory.

We need more human analysts, and they need the government, academic, and private sectors to understand their needs well enough to provide them genuinely functional, constantly evolving tools. Kurzweil (either unfortunately or fortunately) was off by a few years. We still have quite a while to go before this:


Share: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Twitter
  • Reddit
  • Slashdot
  • LinkedIn
  • Facebook
  • email
  • Print


Comment from Wrieck
Time: 2011-02-24, 17:31

I read the whole thing…..

You must be logged in to post a comment.