UserAgentStringClassification

From MozillaWiki
Jump to navigation Jump to search

Synopsis

User agent strings are messy ( http://www.nczonline.net/blog/2010/01/12/history-of-the-user-agent-string/ )

While web applications really shouldn't use them to determine what features of a page will be delivered to the agent, they are one of the only resources available for administrators of websites to understand where their traffic is coming from.

There are several sites out there that attempt to classify and categorize different user agent strings seen ( http://www.useragentstring.com/ ; http://user-agent-string.info/ ).  All of them use curated data though.  The administrators review new unclassified strings that have enough usage to be interesting and write rules that classify them.

This is a problem that I don't feel is really manageable by pure curation or parsing rules though.  New search bots, spiders, and browsers are released every day.  While most of these follow the loose specification of how a UA string is supposed to be formatted, there is also the unpleasantly frequent cases where an add-on or plug-in attempts to inject its own token and messes up the string (sometimes splitting a legitimate token, not prepending a space, or even adding its token many many times).  Mobile browsers are also a problem.  Especially on feature phones.  Many times the UA string will have dozens of tokens describing different capabilities and characteristics of the mobile device.

Our existing rules based UA parsing system that I wrote a few years back is busted.  It wasn't flexible enough to handle new versions of Firefox like Fennec, and worse, when IE9 and Firefox 4 come out, there are big changes happening to the UA strings that would require new parsing logic anyway.

I believe that an ML approach could provide a strong baseline that would be able to categorize and classify both existing user agent strings and ones that haven't come into existence yet, giving a confidence score that can be used to determine whether that classification should be used in analysis.  Also, storing and using the attributes instead of the raw strings increases user privacy because it eliminates most of the "fingerprint" that can be gathered from the full dirty UA string.

I would like to see an open source project that could take a UA string and parse the following attributes out of it.  This would be incredibly useful to the metrics team in the short term, and as said, I believe it would be widely appreciated and used.


User Agent String Classification Dimensions

Agent

type
browser, bot, spider, worm, cloaked, corrupted

Engine

What about when there is more than one engine layer such as Chrome which reports both Webkit and Safari?

Name
Gecko, Trident, Webkit, Java, ???

Version

type
alpha, beta, rc, release
number
1.0, 4.0.7
build
201009231313

Platform

There are several aspects to platform that I haven't decided on the best hierarchy or labels yet.. CPU is frequently reported in UAs, but there is also the concept of the device type (phone, tablet, desktop, game platform...)

CPU
Intel, ARM, AMD, ?
arch
32bit, 64bit, ?

Operating System

There are several aspects to OS that I haven't decided on the best hierarchy or labels yet..  There is a super type, a more specific type, and also an OS Version.. Where do you distinguish Android from Linux and where do you mention Android version?

Name
Windows, Linux, Mac
Version
NT4, XP, Vista, Win7, X11, Android 1.5, Android 2.2

Browser

There is some overlap for family and engine, but looking for a way to distinguish Firefox and Camino from Thunderbird. Also, if a browser has changed engines, might still be good to keep the new and old versions associated other than by name

Family
Firefox (what is a better name for this family?), IE, Opera
Name
Minefield, Shiritoko, Opera cloaked as Firefox or IE, SeaMonkey, Thunderbird, Camino

Version

type
alpha, beta, rc, release
number
1.0, 4.0.7
build
201009231313