Why poisoning attacks are not solved by network generalizing? -Jacob: influence is d/n, so only need 1/d points -Nati: this isn't true, influence is 1/n (0-1 loss) -Jacob: this assumes 0-1 loss, but not true for hinge loss -Percy: attacker can go after specific test inputs -Alex: example: getting past passport check -Jacob: agrees with Nati that good 0-1 loss on train => good 0-1 loss on test -Jacob & Nati both agree that: 1. good 0-1 loss on train => good 0-1 loss on test 2. it's at least possible for an adversary to make training hard -two possibilities for #2: -add data to fool a specific algorithm -or make computatoinally hard for any algorithm -editorial note: both of these have been shown possibly, see work of Servedio & Long for the first, and by Vitaly Feldman for second Dan: two issues, let's focus on one: -targeted attacks -causing training algorithm to fail Possible test case: -can we cause a specific person's face to be mis-classified? -note: Dawn Song's group is working on this -formal question: suppose we have k adversarial data points, k' people's faces we want to mess up, how big does k need to be relative to k'? -when k' = 1, seems like k can be ~5 and still be successful -possible defenses: some sort of pre-filtering? -Percy: this seems hard in NLP tasks with lots of rare words, etc. -possible solution that was brought up: go into some embedding space that gets rid of sparsity issues Alex: maybe some attacks are too strong, we need to curtail our expectations -e.g. in rare words case, just drop out rare words -editorial note: it would be really interesting if we could do this automatically rather than having to do it by hand Stephen Wright: -does sanitization for astronomical data -issue is that sanitization might throw out good data as well -but here at least you know that something was thrown out, rather than failing silently Tara Javldi: what about different possible output formats? -e.g. being able to say "don't know" or list-decoding Martin Abadi: giving some measure of support of the prediction -Jacob: is uncertain you can do this in a meaningful way Percy: maybe there's a systems solution to this -things people suggested here: having a measure of reputation for accounts / data sources / etc. -maybe a good abstraction is something like: X% highly trustworthy, Y% not clearly trustworthy, Z% fully adversarial [how to define middle class?] Tudor: is it realistic to assume that adversary controls the label? -two settings: (1) label always correct [labeled by hand], (2) label created automatically -in case (1), it's something like adversarial distributional shift