Support/Intern/2011/Brinda/Summer Internship: Difference between revisions
No edit summary |
No edit summary |
||
| Line 8: | Line 8: | ||
#Xml editor- [http://www.oxygenxml.com/download_oxygenxml_editor.html Oxygen] (good to start with to get an idea on how to use it). Other examples are [http://www.jedit.org/ jedit] (limited version of Oxygen), [http://saxon.sourceforge.net/#F9.3HE Saxon] (command line processor)<br> | #Xml editor- [http://www.oxygenxml.com/download_oxygenxml_editor.html Oxygen] (good to start with to get an idea on how to use it). Other examples are [http://www.jedit.org/ jedit] (limited version of Oxygen), [http://saxon.sourceforge.net/#F9.3HE Saxon] (command line processor)<br> | ||
<br> | <br> | ||
== Mobile specific<br> == | == Mobile specific<br> == | ||
| Line 61: | Line 61: | ||
<br>2. '''Run the file through Carrot with defined parameters''' | <br>2. '''Run the file through Carrot with defined parameters''' | ||
These are some good parameters for Carrot attributes. There is no fixed paramters but the two examples below usually give good and meaningful clusters for Mobile data. | These are some good parameters for Carrot attributes. There is no fixed paramters but the two examples below usually give good and meaningful clusters for Mobile data. You can always play around with the attributes to explore and fine tune it further <br> | ||
<br> Cluster Count Base: 30 34<br> | <br> Cluster Count Base: 30 34<br> | ||
| Line 243: | Line 243: | ||
d) You can now remove the groups with 0 votes and manually club some more questions based on your requirements. | d) You can now remove the groups with 0 votes and manually club some more questions based on your requirements. | ||
<br> | |||
== Desktop specific<br> == | |||
Because the process doesn’t work for desktop -- too much data to process for Carrot. I made a new process that finds new issues (and potentially new issues) using a combination of Carrot and SQL. It’s a little more complicated but successfully manages to give a more detailed insight into the support forum compared to Cheng's way . | |||
1.'''Run a defined SQL query to look at top voted questions that were asked this week. -- Right now Cheng is only looking at the Most Requested ones on the site this helps filter out a lot of the less useful questions and generic “Firefox sucks” questions.''' | |||
<br> ** Run Carrot with defined parameters (which are different from the Mobile parameters) [CW: This gives you a list of clusters but you have to make human decisions here... you can't just take the top ones and go with it... it's not like mobile.]<br> ** Carrot provides keywords that you can then use in a defined SQL script. -- This lets you get much more accurate numbers for the number of people with a problem -- not just the count off the top 20 pages but from all the questions.<br> ** For investigating issues, I made another Carrot parameter set that lets you take generic keywords (like hotmail) and helps break that down into specific issues. You need to use this with the dedupe script -- it provides a lot more insight on those keywords. It can tell you (for example) that on top of the usual hotmail questions, we suddenly see more of one specific issue.<br> ** You can use the formatting script here if you need to present information as well (although it doesn’t work as well as in the mobile case). | |||
In addition, I worked on a number of useful SQL queries that aren’t part of the usual feedback gathering. | |||
I have a query that gives votes on a day-wise basis for specific keywords/issues/threads to see how these problems are trending among users and if anything is flaring up. This was particularly useful after the Firefox 5 release.<br> * I have a query that provides top threads for any issue which helps identify good things to link developers to.<br> * I have a query that lets you break down questions by browser version and OS version to provide more information on specific issues or just in general.<br> * I have a query to see how often certain articles are used in forum replies.<br> | |||
<br> | |||
<br> | |||
<br> | |||
<br> | |||
<br> | <br> | ||
Revision as of 23:25, 9 August 2011
Things needed
Please make sure you have the following items
- MySQL Workbench
- VPN access to SUMO database
- Carrot2 Workbench
- Xml editor- Oxygen (good to start with to get an idea on how to use it). Other examples are jedit (limited version of Oxygen), Saxon (command line processor)
Mobile specific
I designed a mobile feedback gathering process. It’s only 4 main steps and basically automated (running scripts and queries). It groups all the main questions together, provides the vote numbers into a presentable format. It summarizes all of the mobile questions on SUMO on a single page making feedback reporting very fast.
These steps are:
1. Run a defined SQL query to get the questions and export the file.
a) Connect MySQL Workbench to sumo_dump table and run the queries
Query for New Questions asked
select CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title , CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
JOIN taggit_taggeditem tti ON qq.id= tti.object_id
JOIN taggit_tag tt ON tt.id= tti.tag_id
where qq.created > "2011-08-01"
and
(tt.name like "%mobile%")
group by qq.id
order by count(distinct(qqv.id)) desc
Popular questions with votes > 3
select CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title , CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
JOIN taggit_taggeditem tti ON qq.id= tti.object_id
JOIN taggit_tag tt ON tt.id= tti.tag_id
where qqv.created > "2011-08-01"
and
(tt.name like "%mobile%")
group by qq.id
having count(distinct(qqv.id))>3
order by count(distinct(qqv.id)) desc
b) Export the result in XML format. You can do this by clicking on that small floppy icon right above the result area
c)Open and input XML file and FIND and REPLACE all <ROW>, </ROW> with <document>, </document> respectively. Similary change <DATA>, </DATA> to <searchresult>, </searchresult>
d) Top three lines of your XML file should look like this. Add the two extra lines to look like this
<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>data</query>
2. Run the file through Carrot with defined parameters
These are some good parameters for Carrot attributes. There is no fixed paramters but the two examples below usually give good and meaningful clusters for Mobile data. You can always play around with the attributes to explore and fine tune it further
Cluster Count Base: 30 34
Size Score Sorting Ratio: 0 0
Label Filtering: (All checked) (All checked)
Cluster Label Assignment: Unique Unique
Cluster Merging threshold: 0.70 0.40
Phrase Label Boost: 2.04 6.32
Phrase length penalty start: 8 8
Phrase length penalty stop: 8 8
Title word Boost: 2 6.32
Factorization Method: NonNegative Matrix Factorization ED Factory NonNegative Matrix Factorization ED Factory
Factorization Quality: High High
Maximum Matrix size: 37500 37500
Maximum word document frequency: 0.90 0.90
Term weighting: Log Tf Ldf term Weighting Log Tf Ldf Term Weighting
Default clustering language: English English
Language aggregation language: Flatten clusters from the majority language Flatten clusters from the majority language
Phrase document frequency threshhold: 1 2
Trancated label threshold: 0.65 0.56
Preprocessing: Merge lexical resources (check) Merge lexical resources (check)
Minimum cluster size: 2 2
Word document frequency threshold: 1 1
Once you get the satisfactory clusters, Click on File> Save as and save the clusters in XML format and call it carrot.xml
3. Run a script I wrote to remove duplicates and clean up the output
a)Save the following script and name it dedupe.xsl
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="kDocById" match="document" use="@refid"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="group">
<xsl:copy>
<xsl:apply-templates select=
"@*
|
node()[not(self::document)]
|
document
[generate-id()
=
generate-id(key('kDocById', @refid)[1])
]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
b)Open Oxygen(or any other XML editor) and put carrot.xml in input and name the output file transfer.xml
c)Once you run the editor, you will get a new file with the name transfer.xml
4. Run a script to format the data for easy presenting
a) Save the following script and name it format.xsl
<?xml version="1.1" encoding="UTF-8" ?>
<xsl:stylesheet version="1.1" extension-element-prefixes="math" xmlns:exslt="http://exslt.org/common" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="xalan://java.lang.Math">
<xsl:template match="/">
<html>
<body>
<style type="text/css">
#background
{
background-color: #BBE664;
padding: 40px;
font-size:2.0em;
}
#top_issues
{
font-family:"Trebuchet MS", Arial, Helvetica, sans-serif;
border-collapse:collapse;
}
#top_issues td, #top_issues th
{
font-size:0.9em;
border:1px solid #98bf21;
padding:3px 7px 2px 7px;
}
</style>
<a href="https://support.mozilla.com/en-US/home">
<img src="/Users/brindagupta/Desktop/sumo.png" align= "right"/>
</a>
<div id="background" align="center">Popular Questions asked between 8/1- 8/7 </div>
<xsl:text>Total records: </xsl:text>
<xsl:value-of select="count(/searchresult/document[@id])"/>
<br>
<xsl:text>Total clusters: </xsl:text>
</br>
<xsl:value-of select="count(/searchresult/group[@id])"/>
<br/> <br/> <!-->
<a href="/Users/brindagupta/Desktop/carrot.png">Click here for Cluster Visualization </a>
</!-->
<table id="top_issues">
<tr bgcolor="#9acd32">
<th>Cluster</th>
<th>Sum</th>
<th>Votes</th>
<th>Title</th>
</tr>
<xsl:for-each select="/searchresult/group">
<tr>
<td>
<xsl:value-of select="."/>
</td>
<td>
<xsl:variable name="total">
<xsl:for-each select="document">
<xsl:value-of select="number(substring-after(/searchresult/document[@id = current()/@refid]/title, ' ::'))"/>
<p/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="sum(exslt:node-set(($total/text())))"/>
</td>
<td>
<xsl:for-each select="document">
<xsl:value-of select="number(substring-after(/searchresult/document[@id = current()/@refid]/title, ' ::'))"/>
<p/>
</xsl:for-each>
</td>
<td>
<xsl:for-each select="document">
<a href="{/searchresult/document[@id = current()/@refid]/url}">
<p>
<xsl:value-of select="substring-before(/searchresult/document[@id = current()/@refid]/title, '::')"/>
</p>
</a>
</xsl:for-each>
</td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
b)Open Oxygen(or any other XML editor) and put transfer.xml in input and name the output file result.html
c)Once you run the editor, you will get a new file with the name result.html
d) You can now remove the groups with 0 votes and manually club some more questions based on your requirements.
Desktop specific
Because the process doesn’t work for desktop -- too much data to process for Carrot. I made a new process that finds new issues (and potentially new issues) using a combination of Carrot and SQL. It’s a little more complicated but successfully manages to give a more detailed insight into the support forum compared to Cheng's way .
1.Run a defined SQL query to look at top voted questions that were asked this week. -- Right now Cheng is only looking at the Most Requested ones on the site this helps filter out a lot of the less useful questions and generic “Firefox sucks” questions.
** Run Carrot with defined parameters (which are different from the Mobile parameters) [CW: This gives you a list of clusters but you have to make human decisions here... you can't just take the top ones and go with it... it's not like mobile.]
** Carrot provides keywords that you can then use in a defined SQL script. -- This lets you get much more accurate numbers for the number of people with a problem -- not just the count off the top 20 pages but from all the questions.
** For investigating issues, I made another Carrot parameter set that lets you take generic keywords (like hotmail) and helps break that down into specific issues. You need to use this with the dedupe script -- it provides a lot more insight on those keywords. It can tell you (for example) that on top of the usual hotmail questions, we suddenly see more of one specific issue.
** You can use the formatting script here if you need to present information as well (although it doesn’t work as well as in the mobile case).
In addition, I worked on a number of useful SQL queries that aren’t part of the usual feedback gathering.
I have a query that gives votes on a day-wise basis for specific keywords/issues/threads to see how these problems are trending among users and if anything is flaring up. This was particularly useful after the Firefox 5 release.
* I have a query that provides top threads for any issue which helps identify good things to link developers to.
* I have a query that lets you break down questions by browser version and OS version to provide more information on specific issues or just in general.
* I have a query to see how often certain articles are used in forum replies.