Support/Intern/2011/Brinda/Mid internship

Purpose of the Project

Figuring out a system to text-mine the support forum and gather insights about top issues in SUMO.


Why Carrot2?

We were looking for an open source tool that would take in data from our support forum database and text mine it in such a way that we get meaningful information out of it. Currently we get 3000 posts each week and manually reading it to figure out the top issues can be very exhausting. After asking around and searching the web for open source text mining tools, we came down to Carrot2.
Carrot2 is an open source tool which can automatically cluster a collections of documents into thematic categories. There are several forms of Carrot2 available however for our project, the best option was to use Carrot2 Workbench since we could input data from our database. It is important to note that Carrot2 only operates properly if the input file is in Carrot2 format. Every question is called a document in the XML file. Each document has a title( title of the question), url and a snippet(content of the question).


What have I done till now?

-Since my project involved database, I read a lot of forum questions to get an idea of the different kinds of questions that people ask. This helped me familiarize myself with some of the content in our SUMO database
-Started using our database directly by using mySQL Workbench.
-Learnt how to get the desired output and how to export them in XML format. This included learning/exploring mySQL in greater depth and understanding how to write different kinds of queries
-Installing Carrot2 and understanding how it works. Since there are different formats of Carrot2 available, I had to choose one that could take in our input data and create good clusters.
-Figured out the different sets of questions and the number of questions to input in Carrot2
-Fine tuning the attributes to get better clusters
-Exporting the result
-Converting the result from XML to HTML by using XSLT
-Learnt/learning how to write a script in XSLT. Managed to filter out only the cluster names from the output file and replaces the refid below each cluster name with their respective question title
-Figured out how to hyperlink the title to their respective url.


Steps

1. We start off by using the SQL workbench to get all the questions(title and content) that were posted in the desired time span.
2.Export the results in XML format
3.You can choose any of the algorithms but Lingo works best
4. Modify the exported file to look like the Carrot2 format. Follow these steps-
a. Find and replace all <ROW> and </ROW> with <document> and </document> respectively.
b. Add the following three lines on the top-
<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>typenamehere</query>
c. Make sure you have the end tag </searchresult> in the bottom
d. Make sure you follow the same order- <title> followed by <snippet>
e. You can’t make your own tags in XML file. Carrot2 will show you an error. Make sure you follow the same naming convention as their format- http://download.carrot2.org/head/manual/index.html#section.architecture.input-xml
5. Open Carrot2, put the Source as XML and enter the path to your file in the XML resource field and click Process
5.You will see a list of clusters and their corresponding documents.
6.In order to improve the cluster quality, use the attributes on the right side to get more informative and meaningful clusters
7.Once you have the clusters you want, export the results by clicking on File>Save as
8.In order to just extract the cluster name and the links to those questions, we use XSLT.
9.I have a script written to do this so you just need to link the XML output file to my script.
10. You can do this by either using some kind of XML editor or simply have these 2 lines on the top of your XML file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="brindaXSLT.xsl"?>
11.If you use an XML editor, a new html page will be formed. However if you choose to inline the 2 lines mentioned above, you will get the result by simply opening your XML file with Firefox.


Advantages

  1. Instead of manually reading questions from the forum, Carrot2 will make it easier to prioritize which questions to focus on based on the clusters formed
  2. It will help in noticing questions that might go unnoticed otherwise. For ex questions that are asked several times but have low votes for each question usually goes unnoticed
  3. Carrot2 works best with the following documents in the order of their efficiency: known topic, most votes, unknown topic
  4. Changing XML to HTML helps in accessing the questions under each cluster. Instead of copy pasting links, we have to simply click on the link now.


Disadvantages

  1. Doesn’t work well with a large dataset. As the dataset increases, it becomes slower and takes more memory. Tuning large dataset can take 5-10 mins as well.
  2. Though Carrot2 clusters documents to give a better idea about the issues to focus on, it does not provide a reliable number of documents with a specific problem as there is an overlapping of documents with different cluster names.
  3. You also need to manually separate documents into different clusters in the Other Topics especially with a large set of documents.
  4. Carrot2 is incapable of figuring out spelling mistakes. There is no way of telling the tool that words like crash/cras/crsh etc are all same thereby resulting in incorrect clustering of few documents.
  5. One of the issues of working with unclustered data is that people don't use the same words for the same issue like restore/save tabs, blurry/fuzzy. Carrot2 doesn't provide a way to club these words together and this results in the users manually reading the clusters and then clubbing it together.
  6. Carrot doesn’t allow you to name clusters and then collect documents based on that.Also, Carrot2 does not always cluster with the cluster names that you might be looking for. For example all issues related to fvd were grouped together under the cluster name “invalid security certificate”.


What needs to be done?

  1. Work on XSLT script. Make it better. Better background and layout
  2. Create a table in XSLT- explore xslt further. See what features can be added to make it easier for the user to understand the data for ex: number of votes next to it, graph etc
  3. Inputting XSLT in Carrot2 gives error right now- fix that
  4. Figure out the best way to get result- having a XSLT file in Carrot2, using XML editor etc
  5. We can’t use self defined tags in Carrot2. Try using field to see if we can have information like votes, id etc as separate information fields
  6. How can we make reporting better?
  7. Using Webtrends- understand Webtrends properly. Make a wiki on each report in wiki
  8. Read Analytical blogs like Occam’s Razor- see if anything new
  9. Think of ways to make HTML page + reporting better and efficient