Automatic Threat Identification
As someone who works in technology, the part that disturbed me the most was that the suspect posted a video on youtube titled ‘Elliot Rodger’s Retribution’ 24 hours before the killing spree. In the video, the suspect vents his frustration and describes rather graphically and beyond any reasonable doubt his deadly intentions.
Just few weeks after a 14-year-old was arrested over a prank-threat to an airline, it would seem natural to expect that the ‘Elliot Rodger’s Retribution’ video should provide enough evidence to trigger some precautionary measure.
However, there’s a fundamental difference between those two cases. The Twitter prank was directed to humans, those reading the tweets to American Airlines, whereas Elliot Rodger’s video wasn’t directed to anyone in particular.
Only a few cold-shouldered commenters could see it before it was too late. And the youtube servers.
If you bear to take a quick look at the video, the cues that hint at “negativity” are so overwhelming, so bluntly stated, that it would seem natural to think that three years after a computer took downthe best of us humans at “Jeopardy!” the same computer should now be employed full time at peering through social media to detect, rather than playing, real jeopardy.
I worked as an engineer for a couple of search engines, and have some basic understanding of automatic speech recognition and text mining, so I figured I would spend an afternoon testing out my hunches about whether current technology could have helped flag the suspect’s video thus potentially help thwart his plan.
SENTIMENT ANALYSIS ON ‘ELLIOT RODGER’S RETRIBUTION’ VIDEO
Detecting “negativity” in video content requires two steps:
- convert the video audio track into text;
- use some tool to automatically infer emotions from the extracted text.
The first step is called “Speech Recognition” (SR) or “Speech to Text”. The research in SR has roots in the dawn of computer science, and recent advances have made possible turning SR into widely adopted consumer products such as Siri or Windows Speech Recognition.
Youtube has a built-in SR system for automatic captioning uploaded videos, based on the google speech API.
That’s right, Youtube already automatically creates transcripts from video since 2009.
The quickest way to get the transcript automatically generated by youtube is to use the great youtube-dl. The following command downloads the .srt (subtitles) for the video, and saves the audio track in .wav format, in case we want to try some other SR software later.
–audio-format wav –audio-quality 0 \
The .srt file for the ‘Elliot Rodger’s Retribution’ video can be foundhere and it looks like this:
00:00:42,480 –> 00:00:46,510
girls gave their affection
00:00:46,510 –> 00:00:50,960
sec two other men
00:00:50,960 –> 00:00:55,360
but never to me and 22 years old
00:00:55,360 –> 00:00:59,170
still virgin never even kissed a girl
00:00:59,170 –> 00:01:02,550
have been to college
00:01:02,550 –> 00:01:06,720
two and a half years more than that
The timestamps you see allow the captioning to be synched with the video. Let’s extract the bare text:
| tr ‘\n’ ‘ ‘ | tr -C -d “a-zA-Z ” | tr -s ‘ ‘
This is the result:
Not great. But not too bad either. You can already see quite a few negative keywords there: revenge, rejection, torturous, punish, slaughtering…
The sub-par quality of the extracted texts depends on the fact that speech recognition is not a trivial task. That is especially true when the recognition system can’t be trained beforehand on the voice to recognize, like in this case. I explored some other solutions to try to achieve a better extraction accuracy. Below I describe the attempt with one such alternative, the google speech API v2.
Google Speech API V2
The google speech API v2 was released unofficially a few days ago, and it is still undocumented. To reproduce my results, follow the steps described here, after getting a developer key here (make sure you follow the extra steps described here to enable the Speech API in your API list.)
Then, install ffmpeg (brew install ffmpeg on mac) and convert your .wav file into .flac at 44100 Hz.
-ar 44.1k -ac 2 -y -t 20 threat.flac
The previous command takes a chunk of 20 seconds starting at the 3rd minute for the sake of explanation. The google speech API doesn’t bode well with longer files, so you need to script some chunking logic there and submit the chunks one by one (making sure you don’t exceed the 50 calls/day cap). When you finally submit the chunk:
‘Content-Type: audio/x-flac; rate=44100;’ \
The result doesn’t seem to be better than the youtube transcripts:
The correct text, according to this human-curated transcriptionshould read:
true alpha male. [laughs]
You can see how the recognition software creatively assigned some meaning to the laughs.
Sentiment Analysis Tools
After trying a few more text recognition programs, the youtube transcripts seemed to be the most accurate one.
Now the question is, could some automatic method identify that amorphous text blob as bearing some ominous meaning?
The area of computer science and linguistic that study automatic identification of emotions from text is called sentiment analysis (or opinion mining). Sentiment analysis applies machine learning techniques to text analysis to derive the polarity of a given text (positive/neutral/negative).
The rise of social media in recent years, has fostered a flurry of research around the topic (and links thereafter).
Despite the thriving academic research on the topic though, not many readily available software packages for accurate sentiment analysis can be found online. See here for a comprehensive list of the available tools.
The commercial tool that seems to yield the best results on the text we’re considering is provided by lexalytics. This is the result of running their text analysis on the extracted text on their web demo:
The document sentiment is identified as negative, with a polarity score of -0.201 (on a scale from -1=negative to 1=positive). The absolute value of the detected negativity might not seem very high, but as we’ll see in the next paragraph, it’s pretty indicative when taken in comparison to what the same tool derives for other videos.
lexalytics is also able to detect the themes present in the text, and it seems pretty accurate in identifying the negative ones:
Another tool, textalytics, suggests the following categorization for the ‘Elliot Rodger’s Retribution’ transcripts (set the source as “blog”):
social issue > family > courtship (relevance: 89 )
arts, culture and entertainment >
customs and tradition (relevance: 83 )
I have experimented with some other text mining and sentiment analysis software, but none of them provided better results than lexalytics. Most of the tools I tried seemed to be unable to properly deal with the lack of grammar and punctuation present in the text extracted in the speech recognition step. Of all packages I triedTextBlob (Python) deserves a mention as the most promising and easy to use.
Sentiment on Popular Youtube Videos
To put the polarity score returned by lexalytics for the ‘Elliot Rodger’s Retribution’ into context, I ran the same sentiment analysis on the top 200 most popular videos on youtube (as downloaded on May 24, 2014). youtube-dl came to the rescue again, as it can deal with playlists as well:
–audio-format wav \
The command above will download the 200 most popular youtube videos. In my case, only for 60 of them youtube-dl was able to find the automatically extracted subtitles.
Performing the lexalytics sentiment analysis on each of the 60 subtitle files downloaded involves a few steps:
- go to http://www.lexalytics.com/web-demo and submit some text until they ask you to register;
- fill out and submit the registration form;
- open the Chrome Developer Console in the “Network” panel and submit some new text;
- you’ll see the text is first submitted via ajax to the urlhttp://www.lexalytics.com/demo/ajax/process, then results are fetched from http://www.lexalytics.com/demo/ajax/result using an id and config_id returned by the first request;
- grab both requests by using “Copy as cURL” from Chrome’s contextual menu;
- use the information retrieved in the previous step in the script below.
This script assumes you have all the .srt subtitle files in the same folder. Fill in the blanks (YOUR_ADDITIONAL_PARAMETERS, YOUR_CONFIG_ID) with the matching part of the full requests as copied from the Chrome Developer Console at step 5 above.
Please don’t abuse the script below, be nice to lexalytics.
echo “$file”; content=$(cat “$file” | egrep -v “^[0-9]+” | tr ‘\n’ ‘ ‘ | tr -C -d “a-zA-Z ” | tr -s ‘ ‘ );
result=$(curl -s ‘http://www.lexalytics.com/demo/ajax/process’ <YOUR_ADDITIONAL_PARAMETERS> –data “language=English&text=$content&data-mode=document&config-id=<YOUR_CONFIG_ID>&sc=300” –compressed);
id=$(echo $result | cut -d ‘”‘ -f 6);
config=$(echo $result | cut -d ‘”‘ -f 14);
curl -s “http://www.lexalytics.com/demo/ajax/result?success=true&id=$id&mode=document&config_id=$config&language=English&sc=816” <YOUR_ADDITIONAL_PARAMETERS> > “$file”.result_all;
Once the script above completes, you should have a few .result files in your folder. Each .result file correspond to a .srt file and contains the json returned by lexalytics. If you just want to extract the polarity score from each of them, do:
| grep -A 1 ‘^ “sentiment_polarity”‘ ; done \
| grep “score” | cut -d ‘:’ -f 2 | tr -d ‘,’ | sort -nr \
which returns the sorted list of the sentiment scores for the subtitle files:
The list above tells us that the -0.201 polarity score previously identified for ‘Elliot Rodger’s Retribution’ would have ranked 4th for negativity out of 60, or in 5th percentile of the distribution of the polarity scores for the youtube most popular videos.
The above analysis provides evidence that available technology could be utilized to help automatically detect, and therefore act up, potentially dangerous content in an online video.
Clearly, the above exercise is just a proof of concept and many issues remain to be addressed before one can claim any practical applicability.
For once, as it happens for any anomaly detection system, the rate of false positives can quickly make such a system impractical. A quick back of the envelope calculation shows that even if only 5% of the 100 hours of video are uploaded to YouTube every minute were flagged as dangerous and escalated to human vetting, youtube would need to have 300 people watching videos 24/7 to confirm the dangerousness.
On the other hand, it’s also true that all of the above is the result of an afternoon hack leveraging only publicly available tools and with no additional effort put into improving accuracy. In some sense, what I described before is the worst possible baseline for comparison, a toy example. Many possible simple improvements can be used to dramatically increase accuracy, for instance:
- train the speech recognition software on other videos from the same person (Elliot Rodger’s youtube channel, in this case). That would radically improve the speech recognition step;
- infer pauses in speech and punctuation, to improve the grammar in the final result;
- use a sentiment analysis tool resilient to the poor grammar that the SR step might still generate.
In addition, albeit serving well the purpose of illustrating the concept, a generic “polarity score” would at best only constitute one of the possible signals that the detection system would use. A smarter detection system would:
- be trained on hate speeches and restricted to detect similar ones, rather than generic “negative” concepts;
- don’t simply categorize the extracted concepts/themes into positive/negative, but also weight them according to some measure of “dangerousness”;
- take into account other social indicators, such as comments from other users;
- consider video-specific features such as length, facial expressions, voice pitch, etc.
I fervently hope to see a lot of research on the topic in the forthcoming years. I’d like to see the bleeding edge tools for video content analysis paired with the state of the art of social media sentiment analysis. I’d also like to see such research converted into consumer technology, and deployed on a large scale.
Scaling such a technology would then be challenging in its own right, as even if the detection accuracy was made close to 100%, the responsiveness of the system could still be an issue, that is the time it takes from when a user uploads dangerous content to when the detection system acts on it. The Youtube video pipeline processes Gigabytes of data every second, and it’s not unexpected that most of the video analysis processing is batched and not performed in real time, which would ultimately increase the time to reaction.
Finally, even if the video was correctly handed out to the authority within minutes from its upload, the actions taken might not help avert the ensuing crime, as it already happened in the specific case of Elliot Rodger, who had been visited by the police in April, acting on the complaints of his mother, who was alarmed by videos he had posted online.
The effectiveness of intervention of the authority is a very complex issue that would require much more than a blog post to be discussed. What this analysis tries to make the case for is that, all things being equal, such effectiveness could likely be improved by utilizing currently-available technology.