Support/Intern/2011/Brinda/Carrot2

From MozillaWiki
Jump to navigation Jump to search

Purpose of the Project

Figuring out a system to datamine the support forum and gather insights about top issues in SUMO.


Why Carrot2?

We were looking for an open source tool that would take in data from our support forum database and text mine it in such a way that we get meaningful information out of it. Currently we get 3000 posts each week and manually reading it to figure out the top issues can be very exhausting. After asking around and searching the web for open source text mining tools, we came down to Carrot2.
Carrot2 is an open source tool which can automatically cluster a collections of documents into thematic categories. There are several forms of Carrot2 available however for our project, the best option was to use Carrot2 Workbench since we could input data from our database. It is important to note that Carrot2 only operates properly if the input file is in Carrot2 format. Every question is called a document in the XML file. Each document has a title( title of the question) and a snippet(content of the question).


When should we use Carrot2?

Carrot2 works best with a small-medium size of documents. Following is the different sets of documents ordered by their rank of efficiency-

  1. Known topic
  2. Most Voted
  3. Unknown topic

Carrot2 can be used upto ~8500 documents. However as the number of documents increases, it becomes slower and takes more memory. Tuning a large set of document also becomes slow and takes around 5-10 mins for tuning a single attribute. Its important to note that Carrot2 will cluster documents and give you a better idea about what the issues are so that you can look out for questions of those kinds in the forum. However it does not provide a reliable number of documents with a specific problem as there is an overlapping of documents with different cluster names. You also need to manually separate documents into different clusters in the Other Topics especially with a large set of documents.

One of the issues of working with unclustered data is that people don't use the same words for the same issue like restore/save tabs, blurry/fuzzy. Carrot2 doesn't provide a way to club these words together and this results in the users manually reading the clusters and then clubbing it together. Similarly Carrot2 is incapable of figuring out spelling mistakes. There is no way of telling the tool that words like crash/cras/crsh etc are all same thereby resulting in incorrect clustering of few documents.

Carrot doesn’t allow you to name clusters and then collect documents based on that. So if you are looking for something specific especially in a large dataset, it will be difficult unless there are enough documents of that issue for carrot2 to club together. Also, Carrot2 does not always cluster with the cluster names that you might be looking for. For example all issues related to fvd were grouped together under the cluster name “invalid security certificate”. You can not change the cluster names according to your preference.

Steps

1. We start off by using the SQL workbench to get all the questions(title and content) that were posted in the desired time span.
2.Export the results in XML format
3.You can choose any of the algorithms but Lingo works best
4. Modify the exported file to look like the Carrot2 format. Follow these steps-

a. Find and replace all <ROW> and </ROW> with <document> and </document> respectively.
b. Add the following three lines on the top-

<?xml version="1.0" encoding="UTF-8"?>

<searchresult>

<query>typenamehere</query>

c. Make sure you have the end tag </searchresult> in the bottom

d. Make sure you follow the same order- <title> followed by <snippet>

e. You can’t make your own tags in XML file. Carrot2 will show you an error. Make sure you follow the same naming convention as their format- http://download.carrot2.org/head/manual/index.html#section.architecture.input-xml

5. Open Carrot2, put the Source as XML and enter the path to your file in the XML resource field and click Process
5.You will see a list of clusters and their corresponding documents.
6.In order to improve the cluster quality, use the attributes on the right side to get more informative and meaningful clusters


Carrot2 with small set of documents( Works best with known topic )

This is the best set of documents to work with. Using Carrot2 with documents of a known topic gives an insight to what is causing that problem. However if the known topic is a word which is commonly used for example “tab”, it can get difficult to cluster it properly as there is a lot of document-overlapping. It also results in several inefficient clusters.
Also, there is no specific tuning for this/any kind of document. It doesn't matter if you are tuning 1 known topic like blurry or 2 known topics like blurry+crash. You will have to fine tune the clusters every single time.

We can make the clusters better by having a more descriptive topic for example "saving tabs" instead of "tabs". This will generate a smaller set of documents and will also result in more meaningful clustering.

Some more pointers for this set of document:

  1. Works well with small-medium sized documents. The optimum range for Carrot2 is between 100-500 documents.
  2. Gives good and meaningful clusters if we have questions about a known topic. For example when I search for questions related to blurry, I get :
    blurry icons, blurry pictures, hard to read, hardware acceleration, cleartype enabled etc
  3. Having a small set of documents is easy and fast to cluster. It doesn’t require too much memory and tuning is faster too. We also had a small “other topics” group which helped as we didn’t have to manually read the documents in that group
  4. Another good thing about working with this set of documents is that documents are usually placed in the correct clusters. We see gradual increase of inefficient and unnecessary clustering as the set of documents increases. Its always a good thing to have the Exact phrase assignment checked in Preprocessing attribute. This will make sure that only the documents having the cluster name somewhere in the document are clubbed together.
  5. Almost every time you cluster documents( no matter what the size), you will see a cluster called firefox/FF 4.0/FF/mozilla etc. This is natural to happen since that is the name of the company and most people tend to include them in their questions. However you can avoid getting clusters with this name by tuning the Maximum Word Document Frequency. Having a higher value for this attribute(for ex 0.4)will filter out all words that appear in more than 40% of the documents. This will help remove few variations of firefox if not all.


Carrot2 with most-voted documents/questions

This set of document is good for figuring out the most popular issues in the Most Voted questions for ex - 3 questions related to fvd were clubbed together each having more than 100 "me too" votes. This helped us figure out what most people were complaining about for that particular week/ month. However apart from this there is no other significant use/importance of Carrot2 for this set of documents.

Following is a more detailed explanation about using Carrot2 with this set of document :

  1. Most voted questions usually have a variety of different questions with very few of the same kind.
  2. When I tried clustering most voted issues, questions/documents which were similar for ex fvd/invalid security certificate got clubbed together but many solo clusters were formed for ex Babylon/ mp3tube. Even though many people might have issues with Babylon, it will still form a solo cluster as there aren’t any more documents talking about Babylon in the mostvoted document set. However if you search for specific documents in the database about Babylon and run Carrot2 on that set of documents, it will cluster more efficiently.
  3. This results in small clusters but many clusters are made with just 1 document in them since the issue is completely different from others
  4. We can increase the minimum cluster size however that results in only few meaningful clusters.
  5. However the few meaningful clusters we do get is helpful as we can figure out what the most talked about issue is amongst the most voted questions


Carrot2 with large set of data( unknown topic)

This is the most problematic set of documents as it is not filtered out in anyway. There may/ may not be issues of similar kinds. Its all uncertain. The only way to make it relatively better is to make it like Most voted set of documents. For example if you input data from the week with mostvoted even more than 5, it should give a better result.

Following are the different kinds of problems that we face with a large set of document.

  1. This set makes alot of clusters and tuning it is limited.
  2. Biggest issue is that words which are used commonly for ex “print” or “tabs” forms a very big cluster. It basically takes in all the documents which have those particular words and finding the main issue with tabs/print will require the user to manually read all the documents in that cluster.
  3. This will also create a very big “other topics” group adding to the manual work required by the user.
  4. Using Lingo algorithm for a very large set of document tends to be slow and requires memory. Using STC would be helpful if you want it to be faster but personally I prefer Lingo as it provides more meaningful clusters.
  5. It works better if you provide snippet. For the SUMO forum project, it helped us since we had to deal with forum questions(already has title and content/snippet) but if it deals with some other database, we might have to manually add snippet for good clustering. There is no fixed tuning set up. The best tuning result changes with the number of documents provided


Conclusion

After testing Carrot2 with different kinds of documents, we come to the conclusion that Carrot2 works best with a small set of document and when the topic in known. The user should try and make the input document set as close to known topic as possible. If its a large set of document, the user should try to cut down the noise and try to decrease the size of the document by only taking the more/most popular issues that people are talking about. This will result not only in better clusters but will also make it faster.



Next Steps

  1. Find out top issues the way Cheng does
  2. Use SQL workbench + Carrot2 for the most voted questions/popular questions for a particular week to figure out the top issues which should reduce manual reading/figuring out
  3. Are we missing out issues? People might not be voting "I have this problem too" on the same question. It might be spread out over several questions. How can we find a way to club them all together so that it doesn't go unnnoticed?
  4. How can we use analytics (Web trends, Google Analytics etc) to get a better idea of what the users want?
  5. Have we tried asking questions on Webtrends like- Average duration time/visit, where do visitors start from? Where do they end? Do they know about SUMO or are they coming from Google or another website? How many visitors do we have? Out of that, what percentage of people end up reading an article? Are they satisfied or do they end up posting a question?
  6. Is there a way to find what people are searching on SUMO/ google about mozilla?
  7. Will looking at User feedback help in anyway?
  8. Have we tried asking contributors like Corel, Xircal for their feedback on SUMO? They might have mentioned it on contributors forum but is anyone looking at it? Can we use those suggestions since they deal the most with users?