Purpose of the Project

Figuring out a system to datamine the support forum and gather insights about top issues in SUMO.

Why Carrot2?

We were looking for an open source tool that would take in data from our support forum database and text mine it in such a way that we get meaningful information out of it. Currently we get 3000 posts each week and manually reading it to figure out the top issues can be very exhausting. After asking around and searching the web for open source text mining tools, we came down to Carrot2.
Carrot2 is an open source tool which can automatically cluster a collections of documents into thematic categories. There are several forms of Carrot2 available however for our project, the best option was to use Carrot2 Workbench since we could input data from our database. It is important to note that Carrot2 only operates properly if the input file is in Carrot2 format. Every question is called a document in the XML file. Each document has a title( title of the question) and a snippet(content of the question).

When should we use Carrot2?

It would give best results with a small set of documents( few hundreds) and when the file provided is about a particular issue. In order to get all/most of the documents related to a particular issue, it is recommended to use SQL workbench and then use Carrot2 to cluster documents to get an insight on the reasons behind that particular issue.

Steps

1. We start off by using the SQL workbench to get all the questions(title and content) that were posted in the desired time span.
2.Export the results in XML format
3.You can choose any of the algorithms but Lingo works best
4. Modify the exported file to look like the Carrot2 format. Follow these steps-

a. Find and replace all <ROW> and </ROW> with <document> and </document> respectively.
b. Add the following three lines on the top-

<?xml version="1.0" encoding="UTF-8"?>

<query>typenamehere</query>

c. Make sure you have the end tag </searchresult> in the bottom

d. Make sure you follow the same order- <title> followed by <snippet>

e. You can’t make your own tags in XML file. Carrot2 will show you an error. Make sure you follow the same naming convention as their format- http://download.carrot2.org/head/manual/index.html#section.architecture.input-xml

5. Open Carrot2, put the Source as XML and enter the path to your file in the XML resource field and click Process
5.You will see a list of clusters and their corresponding documents.
6.In order to improve the cluster quality, use the attributes on the right side to get more informative and meaningful clusters

Carrot2 with small set of documents( Works best with known topic )

Works well with small-medium sized documents. The optimum range for Carrot2 is between 100-500 documents.
Gives good and meaningful clusters if we have questions about a known topic. For example when I search for questions related to blurry, I get :
blurry icons, blurry pictures, hard to read, hardware acceleration, cleartype enabled etc
The only thing to keep in mind is that to achieve these results, we need to do some fine tuning of the attributes.
Having a small set of documents is easy and fast to cluster. It doesn’t require too much memory and tuning is faster too. We also had a small “other topics” group which helped as we didn’t have to manually read the documents in that group
Another good thing about working with this set of documents is that documents are usually placed in the correct clusters. We see gradual increase of inefficient and unnecessary clustering as the set of documents increases. Its always a good thing to have the Exact phrase assignment checked in Preprocessing attribute. This will make sure that only the documents having the cluster name somewhere in the document are clubbed together.
Almost everytime you cluster documents( no matter what the size), you will see a cluster called firefox/FF 4.0/FF/mozilla etc. This is natural to happen since that is the name of the company and most people tend to include them in their questions. However you can avoid getting clusters with this name by tuning the Maximum Word Document Frequency. Having a higher value for this attribute(for ex 0.4)will filter out all words that appear in more than 40% of the documents. This will help remove few variations of firefox if not all.
If the known topic is a word which is commonly used for example “tab”, it can get difficult to cluster it properly as there is alot of overlapping of documents. It also results in several inefficient clusters. However if the topic is something more descriptive for ex “saving tabs” it will generate a better result.

Carrot2 with most-voted documents/questions

Most voted questions usually have a variety of different questions with very few of the same kind.
When I tried clustering most voted issues, questions/documents which were similar for ex fvd/invalid security certificate got clubbed together but many solo clusters were formed for ex Babylon/ mp3tube. Even though many people might have issues with Babylon, it will still form a solo cluster as there aren’t any more documents talking about Babylon in the mostvoted document set. However if you search for specific documents in the database about Babylon and run Carrot2 on that set of documents, it will cluster more efficiently.
This results in small clusters but many clusters are made with just 1 document in them since the issue is completely different from others
We can increase the minimum cluster size however that results in only few meaningful clusters.
However the few meaningful clusters we do get is helpful as we can figure out what the most talked about issue is amongst the most voted questions

Carrot2 with large set of data( unknown topic)

Carrot is designed mainly for small to medium collections of documents.It does cluster documents upto ~8500 but it becomes slow and takes a lot of memory as the number of documents increases.Takes upto 5-10 minutes with documents around ~8500.This happens each time you make any change/ tuning to your clusters
This is the most problematic set of documents as it is not filtered out in anyway. There may/ may not be issues of similar kinds. Its all uncertain.
This set makes alot of clusters and tuning it is limited.
Biggest issue is that words which are used commonly for ex “print” or “tabs” forms a very big cluster. It basically takes in all the documents which have those particular words and finding the main issue with tabs/print will require the user to manually read all the documents in that cluster.
This will also create a very big “other topics” group adding to the manual work required by the user.
Using Lingo algorithm for a very large set of document tends to be slow and requires memory. Using STC would be helpful if you want it to be faster but personally I prefer Lingo as it provides more meaningful clusters.
Carrot doesn’t allow you to name clusters and then collect documents based on that. So if you are looking for something specific especially in a large dataset, it will be difficult unless there are enough documents of that issue for carrot2 to club together.
Carrot2 doesn’t allow the user to search for a particular cluster either. This can be a limitation with a large set of documents as many clusters are formed and you would have to manually go through the entire cluster list to find the cluster and its respective documents that you were looking for.
It works better if you provide snippet. For the SUMO forum project, it helped us since we had to deal with forum questions(already has title and content/snippet) but if it deals with some other database, we might have to manually add snippet for good clustering. There is no fixed tuning set up. The best tuning result changes with the number of documents provided
Having large set of unknown data doesn’t give really good results however if you input data from the week with mostvoted even more than 5, it should give a better result

Pointers about Carrot2

Many times people use different words for same issue like restore/save tabs, blurry/fuzzy etc. There is no specific way in carrot to club these words together.
Carrot2 does not always cluster with the cluster names that you might be looking for. For example all issues related to fvd were grouped together under the cluster name “invalid security certificate”. You can not change the cluster names according to your preference.
Does a good job of clustering issues with some script/code written together but cant break it further to different kinds of script questions
When people misspell words, carrot2 is not able to realize it. There is no way to tell carrot2 that crash/crsh/cras is all the same. This results in incorrect clustering of few documents.
Carrot2 will cluster documents and give you a better idea about what the issues are so that you can look out for questions of those kinds in the forum. However it does not provide a reliable number of documents with a specific problem as there is an overlapping of documents with different cluster names. You also need to manually separate documents into different clusters in the Other Topics.
Filtering out firefox 4.0 + issues from firefox 3.6 issues doesn’t work well as many questions from firefox 3.6 are related to 4.0. Either people don’t know their version or they are asking questions from another computer
Filtering out questions with the highest number of replies isn’t efficient either since most of these questions are related to how firefox 4 sucks/crashes/freezes etc. Rather than discussing issues that they have, most of the people curse/complain.
Not much difference in Most Requested and Most Requested+No replies. Better to take most voted/most Requested only.

Support/Intern/2011/Brinda/Carrot2

Contents