User:Jesse/Vegas2011/TestingStaticAnalysis

From MozillaWiki
Jump to: navigation, search

Talk title: Sticking to the Facts: Scientific Study of Static Analysis Tools

Background

  • NSA "Center for Assured Software" (cas@nsa.gov)
  • Purpose: "To increase the confidence that software used within DoD's critical systems is free from intentional and unintentional exploitable vulnerabilities"
  • This static analysis team doesn't run the analyses themselves, but instead recommends which software to use and what is "appropriate use of automation".
  • The results of the study are anonymized, but the testcases are publicly available, so you can re-run the study yourself.

Talk

Advantages of static analysis

  • good at finding some types of issues
  • examines all parts of the software, not just parts that are hit in some execution
  • automated, scalable, repeatable: can be used early and often

Disadvantages of static analysis

  • the tools we're looking at don't report positive properties (lack of bugs)
  • Many false positives; need human confirmation
  • May report issues you consider unimportant
  • bad at finding some types of flaws, especially design flaws
  • potential for false sense of security: false negatives, exaggerated vendor claims

Study focus

  • tools that identify and report issues in software ("code weakness analysis tools", "static application security testing tools") (presumably in contrast to formal methods proof programs)
  • for C/C++ and Java
    • including standard libraries but not popular third-party libraries
    • including Windows-specific libraries (but not other OSes)
  • looking only at results from the tools (including false positives), but not cost, speed, customization, or usability (IDE integration, reporting across builds, understandability of messages)

Study methods

  • default tool configuration
  • no code annotation
  • synthetic testcases
    • often with both flawed and non-flawed variants: the best static analysis will complain only about the first
    • for most CWEs
      • other than design issues and meta-issues
      • grouped into "weakness classes", such as "buffer handling" and "control flow management"
    • with various control flow and data flow patterns, to test "depth" of each tool's analysis. (for example, scanf-to-printf source and sink pattern). templates generated all the variants for all the relevant CWEs :)

Scoring

  • scoring doesn't adjust for how common or serious each flaw type is, but maybe mitre's CWE scoring system can help.
  • scoring penalties for false positives vs false negatives?
  • only counts tool reports that "match" the CWE (so ignoring "possible null deref" when testing a format string vulnerability). wtf? i guess this helps with inadvertent bugs in the testcases, but it seriously lowers the measured false-positive rate and probably also the true-positive rates.
  • precision: percentage of results that are true positive (TP / (TP+FP))
  • recall: percentage of flaws that a tool correctly reported (sensitivity, soundness) (TP / (TP+FN)) = (TP / flaws)
  • F-score: "harmonic mean"??? (2PR) / (P + R). which tends toward the lower value.
  • To avoid giving a high score to a "stupid grep" tool, only give points when a tool correctly identifies the bad version as buggy and the good version as okay.
    • D = (Disciminations / Flaws)

Conclusions

  • Tools are not interchangeable. Many are stronger in some areas than in others.
    • So using complementary tools may make sense for important projects (or for security teams)
  • Tools perform differently on different languages. Even if a tool supports both C++ and Java, it might do much better on C++.
    • So if you ask "which tool will work best for C#" we don't have a clue
  • Every tool failed to report a significant portion of the flaws.
    • Average tool only even tried to cover 8 of 13 weakness classes
    • Average tool covered only 22 of flaws in the classes covered
  • If you ask "How many tools found this flaw" for all the testscases, the largest slices are "no tools" or "exactly one tool". Very few flaws were found by 4 or more tools!
  • Java tools slightly worse than C++ tools? Despite/Because of not having to deal with buffer overflow testcases?
  • Java tools completely fall over on discrimination (no whole-program analysis - dynamic linking assumption?)

Planned future work

  • The next version of the study, which will start in October 2011, will
    • update testcases based on feedback
    • solicit input from vendors as to appropriate configuration (at least enabling all the weakness classes!)
    • add about 5 tools

Jesse's opinions

Lame that they anonymized the names of the tools.

I think the best thing to come out of this talk is that there's now a benchmark for tools to compete on.

They seemed to have a grudge against open source. "We'd expect commercial tools to do better." Only actually tried one open-source tool for C++. Didn't even try Linux tools. Anonymized the results, except for saying which tools are open-source.