TR2010-064

Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier


    •  Yerazunis, W.S., Kato, M., Kori, M., Shibata, H., Hackenberg, K., "Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier", Black Hat Technical Security Conference, July 2010.
      BibTeX TR2010-064 PDF
      • @inproceedings{Yerazunis2010jul,
      • author = {Yerazunis, W.S. and Kato, M. and Kori, M. and Shibata, H. and Hackenberg, K.},
      • title = {Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier},
      • booktitle = {Black Hat Technical Security Conference},
      • year = 2010,
      • month = jul,
      • url = {https://www.merl.com/publications/TR2010-064}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Data Analytics

Abstract:

In this whitepaper we consider the problem of outbound-filtering of emails to prevent accidental leakage of confidential information. We examine how to do this with GPLed open-source spam filter CRM114 and test the accuracy of this filter against a 10,000+ document corpus of hand-classified emails (both confidential and non-confidential) in Japanese. We look into what moving parts are involved in these filters, and how they can be set up. The results show that a hybrid of multiple CRM114 filters outperforms a human-crafted regular-expression filter by nearly 100x in recall, by detecting greater-than 99.9% of confidential documents, and with a simultaneous false alarm rate of less than 6%. As the programmers creating the machine-learning programs don't know how to read or write Japanese, this problem is an almost ideal case of the Searle "Chinese Room" problem.

 

  • Related News & Events

    •  NEWS    MERL researcher's spam filter finds automobile safety defects at NHTSA
      Date: June 25, 2015
      MERL Contact: William S. Yerazunis
      Research Area: Data Analytics
      Brief
      • The CRM114 Discriminator, an open-source spam filter / text classifier created by William Yerazunis in MERL's Data Analytics group, continues to turn up in interesting places - and apparently one of them is in the US Department of Transportation's process for analysis of car safety defect reports.

        Although CRM114 is usually used as a spam filter, CRM114 has been used to analyze resumes for jobseekers, scanning outgoing emails to detect accidental confidential information leaks, perusing blogs for relevance, scanning syslog files for interesting events, and now, apparently, searching complaints sent to NHTSA to find safety-related vehicle malfunctions.
    •  
    •  NEWS    Black Hat Technical Security Conference 2010: publication by William S. Yerazunis and others
      Date: July 24, 2010
      Where: Black Hat Technical Security Conference
      MERL Contact: William S. Yerazunis
      Research Area: Data Analytics
      Brief
      • The paper "Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier" by Yerazunis, W.S., Kato, M., Kori, M., Shibata, H. and Hackenberg, K. was presented at the Black Hat Technical Security Conference.
    •