[e2e] network coding and spam and anonymous email...

Saikat Guha saikat at cs.cornell.edu
Thu Jul 6 07:30:07 PDT 2006

On Thu, 2006-07-06 at 10:54 +0100, Jon Crowcroft wrote:
> one could _code_ legitiamte messages simply as a set of references
> to spam - the nice thing about this is that there is so much spam that
> it acts as fairly uniform random cover traffic

Awsome idea -- change the role of spam from background noise to a
carrier signal!

A first-stab feasibility analysis is promising. Based on my corpus of
spam and e2e mails, spam has feature-set (think unique words) of roughly
300K while e2e has only 30K. Unfortunately, they have only 7.5K words in
common, so a simple mapping may not be sufficient, but one can easily
construct a dictionary that maps the basis-vector for spam onto the
basis-vector for e2e. With that mapping, a legitimate email can become a
linear-combination of the spam messages.

If the mapping is based on the most frequent words, here is what it
might look like.

For e2e (words in decreasing frequency of use)
1. tcp    (makes sense)
2. but    (apparently we disagree a lot)
3. internet
4. there
5. e2e    (duh)

For spam [*]
1. software  (makes sense; targeted spam)
2. please    (apparently they are more persuasive)
3. $69.95
4. free
5. viagra    (no comment)


* it took me a while to cherry-pick my spam corpus for the desired effect
