Revision as of 17:49, 10 August 2011

Things needed

Please make sure you have the following items

MySQL Workbench
VPN access to SUMO database
Carrot2 Workbench
Xml editor- Oxygen (good to start with to get an idea on how to use it). Other examples are jedit (limited version of Oxygen), Saxon (command line processor)

Mobile specific

I designed a mobile feedback gathering process. It’s basically automated (running scripts and queries). It groups all the main questions together, provides the vote numbers into a presentable format. It summarizes all of the mobile questions on SUMO on a single page making feedback reporting very fast.

These steps are:

1. Run a defined SQL query to get the questions and export the file.

a) Connect MySQL Workbench to sumo_dump table and run the queries

Query for New Questions asked

select CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title , CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
JOIN taggit_taggeditem tti ON qq.id= tti.object_id
JOIN taggit_tag tt ON tt.id= tti.tag_id
where qq.created > "2011-08-01"
and
(tt.name like "%mobile%")
group by qq.id
order by count(distinct(qqv.id)) desc

Popular questions with votes > 3

select CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title , CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
JOIN taggit_taggeditem tti ON qq.id= tti.object_id
JOIN taggit_tag tt ON tt.id= tti.tag_id
where qqv.created > "2011-08-01"
and
(tt.name like "%mobile%")
group by qq.id
having count(distinct(qqv.id))>3
order by count(distinct(qqv.id)) desc

b) Export the result in XML format. You can do this by clicking on that small floppy icon right above the result area

c)Open and input XML file and FIND and REPLACE all <ROW>, </ROW> with <document>, </document> respectively. Similary change <DATA>, </DATA> to <searchresult>, </searchresult>

d) Top three lines of your XML file should look like this. Add the two extra lines to look like this

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>data</query>

2. Run the file through Carrot with defined parameters

These are some good parameters for Carrot attributes. There is no fixed paramters but the two examples below usually give good and meaningful clusters for Mobile data. You can always play around with the attributes to explore and fine tune it further

Cluster Count Base: 30 34

Size Score Sorting Ratio: 0 0

Label Filtering: (All checked) (All checked)

Cluster Label Assignment: Unique Unique

Cluster Merging threshold: 0.70 0.40

Phrase Label Boost: 2.04 6.32

Phrase length penalty start: 8 8

Phrase length penalty stop: 8 8

Title word Boost: 2 6.32

Factorization Method: NonNegative Matrix Factorization ED Factory NonNegative Matrix Factorization ED Factory

Factorization Quality: High High

Maximum Matrix size: 37500 37500

Maximum word document frequency: 0.90 0.90

Term weighting: Log Tf Ldf term Weighting Log Tf Ldf Term Weighting

Default clustering language: English English

Language aggregation language: Flatten clusters from the majority language Flatten clusters from the majority language

Phrase document frequency threshhold: 1 2

Trancated label threshold: 0.65 0.56

Preprocessing: Merge lexical resources (check) Merge lexical resources (check)

Minimum cluster size: 2 2

Word document frequency threshold: 1 1

Once you get the satisfactory clusters, Click on File> Save as and save the clusters in XML format and call it carrot.xml

3. Run a script I wrote to remove duplicates and clean up the output

a)Save the following script and name it dedupe.xsl

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>
    <xsl:key name="kDocById" match="document" use="@refid"/>
    
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="group">
        <xsl:copy>
            <xsl:apply-templates select=
                "@*
                |
                node()[not(self::document)]
                |
                document
                [generate-id()
                =
                generate-id(key('kDocById', @refid)[1])
                ]"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

b)Open Oxygen(or any other XML editor) and put carrot.xml in input and name the output file transfer.xml

c)Once you run the editor, you will get a new file with the name transfer.xml

4. Run a script to format the data for easy presenting

a) Save the following script and name it format.xsl

<?xml version="1.1" encoding="UTF-8" ?>
<xsl:stylesheet version="1.1" extension-element-prefixes="math" xmlns:exslt="http://exslt.org/common" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="xalan://java.lang.Math">
	<xsl:template match="/">
		<html>
			<body>
				<style type="text/css">

#background
{
background-color: #BBE664;
padding: 40px;
font-size:2.0em;
}



#top_issues
{
font-family:"Trebuchet MS", Arial, Helvetica, sans-serif;
border-collapse:collapse;
}
#top_issues td, #top_issues th 
{
font-size:0.9em;
border:1px solid #98bf21;
padding:3px 7px 2px 7px;
}

				</style>
			
					<div id="background" align="center">Popular Questions asked between 8/1- 8/7 </div>
						
						
				
			
				<xsl:text>Total records: </xsl:text>
				<xsl:value-of select="count(/searchresult/document[@id])"/>
				<br>
					<xsl:text>Total clusters: </xsl:text>
				</br>
				
				<xsl:value-of select="count(/searchresult/group[@id])"/>
				<br/> <br/> 
				<table id="top_issues">
					<tr bgcolor="#9acd32">
						<th>Cluster</th>
						<th>Sum</th>
						<th>Votes</th>
						<th>Title</th>
					</tr>
					<xsl:for-each select="/searchresult/group">
						<tr>
							<td>
								<xsl:value-of select="."/>
							</td>
							<td>
								<xsl:variable name="total">
									<xsl:for-each select="document">
										<xsl:value-of select="number(substring-after(/searchresult/document[@id = current()/@refid]/title, ' ::'))"/>
										<p/>
									</xsl:for-each>
								</xsl:variable>
								<xsl:value-of select="sum(exslt:node-set(($total/text())))"/>
							</td>
							<td>
								<xsl:for-each select="document">
									<xsl:value-of select="number(substring-after(/searchresult/document[@id = current()/@refid]/title, ' ::'))"/>
									<p/>
								</xsl:for-each>
							</td>
							<td>
								<xsl:for-each select="document">
									<a href="{/searchresult/document[@id = current()/@refid]/url}">
										<p>
											<xsl:value-of select="substring-before(/searchresult/document[@id = current()/@refid]/title, '::')"/>
										</p>
									</a>
								</xsl:for-each>
							</td>
						</tr>
					</xsl:for-each>
				</table>
			</body>
		</html>
	</xsl:template>
</xsl:stylesheet>

b)Open Oxygen(or any other XML editor) and put transfer.xml in input and name the output file result.html

c)Once you run the editor, you will get a new file with the name result.html

d) You can now remove the groups with 0 votes and manually club some more questions based on your requirements.

Desktop specific

Because the process doesn’t work for desktop -- too much data to process for Carrot. I made a new process that finds new issues using a combination of Carrot and SQL. It’s a little more complicated but successfully manages to give a more detailed insight into the support forum compared to Cheng's way .

1.Run a defined SQL query to look at top voted questions that were asked this week. -- Right now Cheng is only looking at the Most Requested ones on the site this helps filter out a lot of the less useful questions and generic “Firefox sucks” questions.

a) Connect MySQL Workbench to sumo_dump table and run the queries

Query for New questions asked

select  CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title ,  CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where qq.created > "2011-08-01" 
group by qq.id
having count(distinct(qqv.id))>15
order by count(distinct(qqv.id)) desc

Query for Popular questions asked

select  CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title ,  CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where qqv.created > "2011-08-01" 
group by qq.id
having count(distinct(qqv.id))>20
order by count(distinct(qqv.id)) desc

b) Export the result in XML format. You can do this by clicking on that small floppy icon right above the result area

c)Open and input XML file and FIND and REPLACE all <ROW>, </ROW> with <document>, </document> respectively. Similary change <DATA>, </DATA> to <searchresult>, </searchresult>

d) Top three lines of your XML file should look like this. Add the two extra lines to look like this

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>data</query>

2.Run Carrot with defined parameters (which are different from the Mobile parameters) [CW: This gives you a list of clusters but you have to make human decisions here... you can't just take the top ones and go with it... it's not like mobile.]
a) Run Carrot with the following parameters

There is no fixed parameters as a lot depends on the type of questions and the number of questions. If your dataset is small(~50) after using the SQL query, you can use same parameters as Mobile:

Cluster Count Base: 30 34

Size Score Sorting Ratio: 0 0

Label Filtering: (All checked) (All checked)

Cluster Label Assignment: Unique Unique

Cluster Merging threshold: 0.70 0.40

Phrase Label Boost: 2.04 6.32

Phrase length penalty start: 8 8

Phrase length penalty stop: 8 8

Title word Boost: 2 6.32

Factorization Method: NonNegative Matrix Factorization ED Factory NonNegative Matrix Factorization ED Factory

Factorization Quality: High High

Maximum Matrix size: 37500 37500

Maximum word document frequency: 0.90 0.90

Term weighting: Log Tf Ldf term Weighting Log Tf Ldf Term Weighting

Default clustering language: English English

Language aggregation language: Flatten clusters from the majority language Flatten clusters from the majority language

Phrase document frequency threshhold: 1 2

Trancated label threshold: 0.65 0.56

Preprocessing: Merge lexical resources (check) Merge lexical resources (check)

Minimum cluster size: 2 2

Word document frequency threshold: 1 1

If its more than double of that, use the following parameters

Cluster Count Base: 45

Size Score Sorting Ratio: 0

Label Filtering: (All checked)

Cluster Label Assignment: Unique

Cluster Merging threshold: 0.70/ 0.40/0.29

Phrase Label Boost: 1.50/5.30

Phrase length penalty start: 8

Phrase length penalty stop: 8

Title word Boost: 2

Factorization Method: NonNegative Matrix Factorization ED Factory

Factorization Quality: High

Maximum Matrix size: 37500

Maximum word document frequency: 0.90

Term weighting: Log Tf Ldf term Weighting

Default clustering language: English

Language aggregation language: Flatten clusters from the majority language

Phrase document frequency threshhold: 1

Truncated label threshold: 0.47/0.75

Preprocessing: Merge lexical resources (check)

Minimum cluster size: 2

Word document frequency threshold: 1

b)Once you get an idea about the issues from the Carrot clusters, search for those keywords(say hotmail) in SQL by using the following query

select  CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title ,  CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where
(qq.title like "%hotmail%" or qq.content like "%hotmail%")
and 
qq.created > "2011-08-01" 
group by qq.id
order by count(distinct(qqv.id)) desc

This lets you get much more accurate numbers for the number of people with a problem -- not just the count off the top 20 pages but from all the questions.

c) Export the result from MySQL Workbench in XML format and save it as hotmail.xml

d) Make it into carrot Format just like Step 1

3.For investigating issues, I made another Carrot parameter set that lets you take generic keywords (like hotmail) and helps break that down into specific issues. You need to use this with the dedupe script -- it provides a lot more insight on those keywords. It can tell you (for example) that on top of the usual hotmail questions, we suddenly see more of one specific issue.

a) Run hotmail.xml with the following Carrot parameters

Based on the size of yoru questions, use one of the Carrot parameters as step 2 a). Or try the following parameter:

Cluster Count Base: 70

Size Score Sorting Ratio: 1

Label Filtering: (All checked)

Cluster Label Assignment: Simple

Cluster Merging threshold: 0.35

Phrase Label Boost: 1.50

Phrase length penalty start: 8

Phrase length penalty stop: 8

Title word Boost: 2

Factorization Method: K-means

Factorization Quality: High

Maximum Matrix size: 37500

Maximum word document frequency: 0.90

Term weighting: Log Tf Ldf term Weighting

Default clustering language: English

Language aggregation language: Flatten clusters from the majority language

Phrase document frequency threshhold: 1

Trancated label threshold: 0.14

Preprocessing: Merge lexical resources (check)

Minimum cluster size: 2

Word document frequency threshold: 1

b) On getting satisfactory insight into why hotmail is a problem, export the result from Carrot by going to File>Save as and save it as carrothotmail.xml

c) In order to remove the duplicates, run dedupe.xsl script on carrothotmail.xml and save the output as transfer.xml

4.You can use the formatting script here if you need to present information as well (although it doesn’t work as well as in the mobile case).
a) To get a better view of the reasons behind hotmail issue, run format.xsl on transfer.xml. You will see the number of votes and the total sum of all votes for each reason behind the hotmail problem

However since this is only one of the issues, you will have to repeat the process for other keywords like Google toolbar etc.

SQL Queries

In addition, I worked on a number of useful SQL queries that aren’t part of the usual feedback gathering.

1.I have a query that gives votes on a day-wise basis for specific keywords/issues/threads to see how these problems are trending among users and if anything is flaring up. This was particularly useful after the Firefox 5 release.

select date(qqv.created) as day, count(distinct(qqv.id)) as votes
from questions_questionvote qqv
join questions_question qq ON qqv.question_id = qq.id
join questions_questionmetadata qm ON qm.question_id= qq.id
where
(qq.title like "%google%toolbar%" or qq.content like "%google%toolbar%")
and
(qqv.created> "2011-06-21" and qqv.created < "2011-06-27" and qm.name= "useragent" and qm.value LIKE "%Firefox/5.%")
group by day
order by day asc;

2. To see daywise breakdown of top(say Google toolbar) questions, this query is helpful-

select  qq.created, date(qqv.created) as day, count(distinct(qqv.id)), qq.title, qq.content
from questions_questionvote qqv
join questions_question qq ON qqv.question_id = qq.id
join questions_questionmetadata qm ON qm.question_id= qq.id
where
(qq.title LIKE "%google%toolbar%" or qq.content LIKE "%google%toolbar%")
and
(qqv.created> "2011-06-22" and qm.name= "useragent" and qm.value LIKE "%Firefox/5.%")
group by qq.id, day
order by count(distinct(qqv.id)) desc, day

3. I have a query for weekwise top 10 questions and how their votes are trending. By searching question id in the searchbox, you will see weekwise trend for that particular question

select t1.created, t1.id, week(q2.created) as week, count(distinct(q2.id)), t1.title
from
(select
count(distinct(qqv.id)) as votes, qq.title as title, qq.created as created, qq.id as id
from questions_questionvote qqv
join questions_question qq ON qqv.question_id = qq.id
where
(qqv.created> "2011-08-01")
group by qq.id
order by count(distinct(qqv.id)) desc
limit 10
) as t1
join questions_questionvote q2 ON q2.question_id = t1.id
group by t1.id, week
order by count(distinct(q2.id)) desc

4.I have a query that provides top threads for any issue which helps identify good things to link developers to.

select  qq.created, count(distinct(qqv.id)), qq.id, qq.title, qq.content 
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where
( qq.title like "%youtube%" or qq.content like "%youtube%")
and
 qqv.created > "2011-08-01"
group by qq.id
order by count(distinct(qqv.id)) desc

5.I have a query that lets you break down questions by browser version and OS version to provide more information on specific issues or just in general.

select tv.date, q2.value as os,  tv.version, tv.id2, tv.votes, tv.title
from questions_questionmetadata q2
join
(select qq.created as date,
IF(LOCATE("Firefox/", q1.value),SUBSTRING_INDEX(SUBSTR(q1.value,LOCATE("Firefox/",q1.value)+8)," ",1), tf.ff) as version,
 qq.id as id2,  qq.title as title, count(distinct(qqv.id)) as votes
from questions_questionvote qqv
join questions_question qq ON qqv.question_id=qq.id
join questions_questionmetadata q1 ON q1.question_id=qq.id
join
(select q3.value  as ff, q3.id as id3
from
questions_questionmetadata q3
where q3.name= "ff_version"
) as tf
ON (tf.id3= qq.id)
where qqv.created>"2011-08-01" and q1.name="ff_version"
group by qq.id
order by count(distinct(qqv.id)) desc
) as tv
ON (q2.question_id = tv.id2)
where q2.name="os"
limit 40000

6. I have a query to see how often certain articles are used in forum replies.

select qq.created,qq.id, count(distinct(qqv.id)), qq.title, qq.content, qa.content
from questions_question qq
JOIN questions_answer qa ON qq.id= qa.question_id
JOIN questions_questionvote qqv ON qqv.question_id= qq.id
where
(qq.created > "2011-06-21" and qq.created < "2011-06-28")
and
(qa.content LIKE "%[[%safe mode%]]%")
group by qq.id, qa.content
order by count(distinct(qqv.id)) desc

Support/Intern/2011/Brinda/Summer Internship: Difference between revisions

Revision as of 17:49, 10 August 2011

Contents

Things needed

Mobile specific

Desktop specific

SQL Queries

Navigation menu

Support/Intern/2011/Brinda/Summer Internship: Difference between revisions

Revision as of 17:49, 10 August 2011

Things needed

Mobile specific

Desktop specific

SQL Queries

Navigation menu

Search