Support/Intern/2011/Brinda/Summer Internship

From MozillaWiki
Jump to: navigation, search

Things needed

Please make sure you have the following items

  1. MySQL Workbench
  2. VPN access to SUMO database
  3. Carrot2 Workbench
  4. Xml editor- Oxygen (good to start with to get an idea on how to use it). Other examples are jedit (limited version of Oxygen), Saxon (command line processor)


Mobile specific

I designed a mobile feedback gathering process. It’s basically automated (running scripts and queries). It groups all the main questions together, provides the vote numbers into a presentable format. It summarizes all of the mobile questions on SUMO on a single page making feedback reporting very fast.

These steps are:

1. Run a defined SQL query to get the questions and export the file.

a) Connect MySQL Workbench to sumo_dump table and run the queries (click on the yellow scar like thing)
Remember each week you will need to change the dates based on your requirements. For example if you want the data for August 1 to August 5, simply change one line in the query

where qq.created > "2011-08-01" and qq.created < "2011-08-05".
Similarly you can do it for the 2nd SQL query-

where qqv.created > "2011-08-01" and qqv.created < "2011-08-05"

Query for New Questions asked

select CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title , CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
JOIN taggit_taggeditem tti ON qq.id= tti.object_id
JOIN taggit_tag tt ON tt.id= tti.tag_id
where qq.created > "2011-08-01"
and
(tt.name like "%mobile%")
group by qq.id
order by count(distinct(qqv.id)) desc


Popular questions with votes > 3

select CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title , CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
JOIN taggit_taggeditem tti ON qq.id= tti.object_id
JOIN taggit_tag tt ON tt.id= tti.tag_id
where qqv.created > "2011-08-01"
and
(tt.name like "%mobile%")
group by qq.id
having count(distinct(qqv.id))>3
order by count(distinct(qqv.id)) desc


b) Export the result in XML format from MySQL Workbench. You can do this by clicking on that small floppy icon right above the result area

c)Open and input XML file in XML editor(for ex Oxygen) and FIND and REPLACE ALL <ROW>, </ROW> with <document>, </document> respectively. Similary change <DATA>, </DATA> to <searchresult>, </searchresult>

d) Top three lines of your XML file should look like this. Add the two extra lines to look like this

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>data</query>

Make sure the end tag in XML is </searchresult> and not </DATA>


2. Run the file through Carrot with defined parameters

Open Carrot and on the right hang side, choose Source as XML and Algorithm as Lingo. Input your XML file(path) in the XML resource.

These are some good parameters for Carrot attributes. There is no fixed paramters but the two examples below usually give good and meaningful clusters for Mobile data. You can always play around with the attributes to explore and fine tune it further


Cluster Count Base:  30                                                                                                             34

Size Score Sorting Ratio:  0                                                                                                        0

Label Filtering: (All checked)                                                                                                       (All checked)

Cluster Label Assignment:  Unique                                                                                             Unique

Cluster Merging threshold:  0.70                                                                                                  0.40

Phrase Label Boost:  2.04                                                                                                            6.32

Phrase length penalty start:  8                                                                                                      8

Phrase length penalty stop:  8                                                                                                      8

Title word Boost:  2                                                                                                                       6.32

Factorization Method:  NonNegative Matrix Factorization ED Factory                                         NonNegative Matrix Factorization ED Factory

Factorization Quality:  High                                                                                                          High

Maximum Matrix size:  37500                                                                                                      37500

Maximum word document frequency:  0.90                                                                                 0.90

Term weighting:  Log Tf Ldf term Weighting                                                                                 Log Tf Ldf Term Weighting

Default clustering language:  English                                                                                           English

Language aggregation language:  Flatten clusters from the majority language                           Flatten clusters from the majority language

Phrase document frequency threshhold:  1                                                                                  2

Trancated label threshold:  0.65                                                                                                    0.56

Preprocessing:  Merge lexical resources (check)                                                                          Merge lexical resources (check)

Minimum cluster size:  2                                                                                                                2

Word document frequency threshold:  1                                                                                        1


Once you get the satisfactory clusters, Click on File> Save as and save the clusters in XML format and call it carrot.xml


3. Run a script I wrote to remove duplicates and clean up the output

a)Save the following script and name it dedupe.xsl

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>
    <xsl:key name="kDocById" match="document" use="@refid"/>
    
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="group">
        <xsl:copy>
            <xsl:apply-templates select=
                "@*
                |
                node()[not(self::document)]
                |
                document
                [generate-id()
                =
                generate-id(key('kDocById', @refid)[1])
                ]"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

b)Open Oxygen(or any other XML editor) and put carrot.xml in input and name the output file transfer.xml

c)Once you run the editor, you will get a new file with the name transfer.xml

4. Run a script to format the data for easy presenting

a) Save the following script and name it format.xsl

<?xml version="1.1" encoding="UTF-8" ?>
<xsl:stylesheet version="1.1" extension-element-prefixes="math" xmlns:exslt="http://exslt.org/common" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="xalan://java.lang.Math">
	<xsl:template match="/">
		<html>
			<body>
				<style type="text/css">

#background
{
background-color: #BBE664;
padding: 40px;
font-size:2.0em;
}



#top_issues
{
font-family:"Trebuchet MS", Arial, Helvetica, sans-serif;
border-collapse:collapse;
}
#top_issues td, #top_issues th 
{
font-size:0.9em;
border:1px solid #98bf21;
padding:3px 7px 2px 7px;
}

				</style>
			
					<div id="background" align="center">Popular Questions asked between 8/1- 8/7 </div>
						
						
				
			
				<xsl:text>Total records: </xsl:text>
				<xsl:value-of select="count(/searchresult/document[@id])"/>
				<br>
					<xsl:text>Total clusters: </xsl:text>
				</br>
				
				<xsl:value-of select="count(/searchresult/group[@id])"/>
				<br/> <br/> 
				<table id="top_issues">
					<tr bgcolor="#9acd32">
						<th>Cluster</th>
						<th>Sum</th>
						<th>Votes</th>
						<th>Title</th>
					</tr>
					<xsl:for-each select="/searchresult/group">
						<tr>
							<td>
								<xsl:value-of select="."/>
							</td>
							<td>
								<xsl:variable name="total">
									<xsl:for-each select="document">
										<xsl:value-of select="number(substring-after(/searchresult/document[@id = current()/@refid]/title, ' ::'))"/>
										<p/>
									</xsl:for-each>
								</xsl:variable>
								<xsl:value-of select="sum(exslt:node-set(($total/text())))"/>
							</td>
							<td>
								<xsl:for-each select="document">
									<xsl:value-of select="number(substring-after(/searchresult/document[@id = current()/@refid]/title, ' ::'))"/>
									<p/>
								</xsl:for-each>
							</td>
							<td>
								<xsl:for-each select="document">
									<a href="{/searchresult/document[@id = current()/@refid]/url}">
										<p>
											<xsl:value-of select="substring-before(/searchresult/document[@id = current()/@refid]/title, '::')"/>
										</p>
									</a>
								</xsl:for-each>
							</td>
						</tr>
					</xsl:for-each>
				</table>
			</body>
		</html>
	</xsl:template>
</xsl:stylesheet>
							

b)Open Oxygen(or any other XML editor) and put transfer.xml in input and name the output file result.html

c)Once you run the editor, you will get a new file with the name result.html

d) You can now remove the groups with 0 votes and manually club some more questions based on your requirements.
When you manually want to change the cluster name(copy paste etc), open transfer.xml in Oxygen(or any other XML editor). Make the desired changes and save it. Run format.xsl and you will see the changes are made. Refresh result.html if you dont see the changes you made :)


Desktop specific

Because the process doesn’t work for desktop -- too much data to process for Carrot. I made a new process that finds new issues using a combination of Carrot and SQL. It’s a little more complicated but successfully manages to give a more detailed insight into the support forum compared to Cheng's way .

1.Run a defined SQL query to look at top voted questions that were asked this week


a) Connect MySQL Workbench to sumo_dump table and run the queries

Query for New questions asked

select  CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title ,  CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where qq.created > "2011-08-01" 
group by qq.id
having count(distinct(qqv.id))>15
order by count(distinct(qqv.id)) desc


Query for Popular questions asked

select  CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title ,  CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where qqv.created > "2011-08-01" 
group by qq.id
having count(distinct(qqv.id))>20
order by count(distinct(qqv.id)) desc

b) Export the result in XML format. You can do this by clicking on that small floppy icon right above the result area

c)Open and input XML file and FIND and REPLACE all <ROW>, </ROW> with <document>, </document> respectively. Similary change <DATA>, </DATA> to <searchresult>, </searchresult>

d) Top three lines of your XML file should look like this. Add the two extra lines to look like this

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>data</query>


2.Run Carrot with defined parameters (which are different from the Mobile parameters) [CW: This gives you a list of clusters but you have to make human decisions here... you can't just take the top ones and go with it... it's not like mobile.]
a) Run Carrot with the following parameters

There is no fixed parameters as a lot depends on the type of questions and the number of questions. If your dataset is small(~50) after using the SQL query, you can use same parameters as Mobile:


Cluster Count Base:     30                                                                                                                         34

Size Score Sorting Ratio:    0                                                                                                                    0

Label Filtering: (All checked)                                                                                                                     (All checked)

Cluster Label Assignment: Unique                                                                                                             Unique

Cluster Merging threshold: 0.70                                                                                                                  0.40

Phrase Label Boost: 2.04                                                                                                                            6.32

Phrase length penalty start: 8                                                                                                                      8

Phrase length penalty stop: 8                                                                                                                      8

Title word Boost: 2                                                                                                                                       6.32

Factorization Method: NonNegative Matrix Factorization ED Factory                                                         NonNegative Matrix Factorization ED Factory

Factorization Quality: High                                                                                                                           High

Maximum Matrix size: 37500                                                                                                                        37500

Maximum word document frequency: 0.90                                                                                                   0.90

Term weighting: Log Tf Ldf term Weighting                                                                                                  Log Tf Ldf Term Weighting

Default clustering language: English                                                                                                            English

Language aggregation language: Flatten clusters from the majority language                                            Flatten clusters from the majority language

Phrase document frequency threshhold: 1                                                                                                   2

Trancated label threshold: 0.65                                                                                                                    0.56

Preprocessing: Merge lexical resources (check)                                                                                          Merge lexical resources (check)

Minimum cluster size: 2                                                                                                                                 2

Word document frequency threshold: 1                                                                                                         1



If its more than double of that, use the following parameters

(There is no fixed parameter but this seems to work decently for the two datasets I had-- Hopefully works well for rest also)

Cluster Count Base: 23

Size Score Sorting Ratio: 0

Label Filtering: (All checked)

Cluster Label Assignment: Unique

Cluster Merging threshold: 0.33

Phrase Label Boost: 4.45

Phrase length penalty start: 8

Phrase length penalty stop: 8

Title word Boost: 2

Factorization Method: NonNegative Matrix Factorization ED Factory

Factorization Quality: High

Maximum Matrix size: 37500

Maximum word document frequency: 0.90

Term weighting: Log Tf Ldf term Weighting

Default clustering language: English

Language aggregation language: Flatten clusters from the majority language

Phrase document frequency threshhold: 1

Truncated label threshold: 0.47

Preprocessing: Merge lexical resources (check)

Minimum cluster size: 2

Word document frequency threshold: 1





b)Once you get an idea about the issues from the Carrot clusters, search for those keywords(say hotmail) in SQL by using the following query

select  CONCAT(CONCAT(qq.title," ::"),count(distinct(qqv.id))) as title ,  CONCAT("http://support.mozilla.com/en-US/questions/",
CAST(qq.id as char)) AS url, qq.content as snippet
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where
(qq.title like "%hotmail%" or qq.content like "%hotmail%")
and 
qq.created > "2011-08-01" 
group by qq.id
order by count(distinct(qqv.id)) desc

This lets you get much more accurate numbers for the number of people with a problem -- not just the count off the top 20 pages but from all the questions.

c) Export the result from MySQL Workbench in XML format and save it as hotmail.xml

d) Make it into carrot Format just like Step 1

3.For investigating issues, I made another Carrot parameter set that lets you take generic keywords (like hotmail) and helps break that down into specific issues. You need to use this with the dedupe script -- it provides a lot more insight on those keywords. It can tell you (for example) that on top of the usual hotmail questions, we suddenly see more of one specific issue.

a) Run hotmail.xml with the following Carrot parameters

Based on the size of yoru questions, use one of the Carrot parameters as step 2 a). Or try the following parameter:


Cluster Count Base: 70

Size Score Sorting Ratio: 1

Label Filtering: (All checked)

Cluster Label Assignment: Simple

Cluster Merging threshold: 0.35

Phrase Label Boost: 1.50

Phrase length penalty start: 8

Phrase length penalty stop: 8

Title word Boost: 2

Factorization Method: K-means

Factorization Quality: High

Maximum Matrix size: 37500

Maximum word document frequency: 0.90

Term weighting: Log Tf Ldf term Weighting

Default clustering language: English

Language aggregation language: Flatten clusters from the majority language

Phrase document frequency threshhold: 1

Trancated label threshold: 0.14

Preprocessing: Merge lexical resources (check)

Minimum cluster size: 2

Word document frequency threshold: 1



b) On getting satisfactory insight into why hotmail is a problem, export the result from Carrot by going to File>Save as and save it as carrothotmail.xml

c) In order to remove the duplicates, run dedupe.xsl  script on carrothotmail.xml and save the output as transfer.xml. Open Oxygen(or any other XML editor) and have carrothotmail.xsl as the input and transfer.xml as the output file.


4.You can use the formatting script here if you need to present information as well (although it doesn’t work as well as in the mobile case).
a) To get a better view of the reasons behind hotmail issue, run format.xsl(from Oxygen) on transfer.xml. You will see the number of votes and the total sum of all votes for each reason behind the hotmail problem. Save the output file as result.html

However since this is only one of the issues, you will have to repeat the process for other keywords like Google toolbar etc.



SQL Queries


In addition, I worked on a number of useful SQL queries that aren’t part of the usual feedback gathering.

1.I have a query that gives votes on a day-wise basis for specific keywords/issues/threads to see how these problems are trending among users and if anything is flaring up. This was particularly useful after the Firefox 5 release.

select date(qqv.created) as day, count(distinct(qqv.id)) as votes
from questions_questionvote qqv
join questions_question qq ON qqv.question_id = qq.id
join questions_questionmetadata qm ON qm.question_id= qq.id
where
(qq.title like "%google%toolbar%" or qq.content like "%google%toolbar%")
and
(qqv.created> "2011-06-21" and qqv.created < "2011-06-27" and qm.name= "useragent" and qm.value LIKE "%Firefox/5.%")
group by day
order by day asc;

2. To see daywise breakdown of top(say Google toolbar) questions, this query is helpful-

select  qq.created, date(qqv.created) as day, count(distinct(qqv.id)), qq.title, qq.content
from questions_questionvote qqv
join questions_question qq ON qqv.question_id = qq.id
join questions_questionmetadata qm ON qm.question_id= qq.id
where
(qq.title LIKE "%google%toolbar%" or qq.content LIKE "%google%toolbar%")
and
(qqv.created> "2011-06-22" and qm.name= "useragent" and qm.value LIKE "%Firefox/5.%")
group by qq.id, day
order by count(distinct(qqv.id)) desc, day


3. I have a query for weekwise top 10 questions and how their votes are trending. By searching question id in the searchbox, you will see weekwise trend for that particular question


select t1.created, t1.id, week(q2.created) as week, count(distinct(q2.id)), t1.title
from
(select
count(distinct(qqv.id)) as votes, qq.title as title, qq.created as created, qq.id as id
from questions_questionvote qqv
join questions_question qq ON qqv.question_id = qq.id
where
(qqv.created> "2011-08-01")
group by qq.id
order by count(distinct(qqv.id)) desc
limit 10
) as t1
join questions_questionvote q2 ON q2.question_id = t1.id
group by t1.id, week
order by count(distinct(q2.id)) desc



4.I have a query that provides top threads for any issue which helps identify good things to link developers to.

select  qq.created, count(distinct(qqv.id)), qq.id, qq.title, qq.content 
from questions_questionvote qqv
JOIN questions_question qq ON qq.id= qqv.question_id
where
( qq.title like "%youtube%" or qq.content like "%youtube%")
and
 qqv.created > "2011-08-01"
group by qq.id
order by count(distinct(qqv.id)) desc


5.I have a query that lets you break down questions by browser version and OS version to provide more information on specific issues or just in general.

select tv.date, q2.value as os,  tv.version, tv.id2, tv.votes, tv.title
from questions_questionmetadata q2
join
(select qq.created as date,
IF(LOCATE("Firefox/", q1.value),SUBSTRING_INDEX(SUBSTR(q1.value,LOCATE("Firefox/",q1.value)+8)," ",1), tf.ff) as version,
 qq.id as id2,  qq.title as title, count(distinct(qqv.id)) as votes
from questions_questionvote qqv
join questions_question qq ON qqv.question_id=qq.id
join questions_questionmetadata q1 ON q1.question_id=qq.id
join
(select q3.value  as ff, q3.id as id3
from
questions_questionmetadata q3
where q3.name= "ff_version"
) as tf
ON (tf.id3= qq.id)
where qqv.created>"2011-08-01" and q1.name="ff_version"
group by qq.id
order by count(distinct(qqv.id)) desc
) as tv
ON (q2.question_id = tv.id2)
where q2.name="os"
limit 40000



6. I have a query to see how often certain articles are used in forum replies.

select qq.created,qq.id, count(distinct(qqv.id)), qq.title, qq.content, qa.content
from questions_question qq
JOIN questions_answer qa ON qq.id= qa.question_id
JOIN questions_questionvote qqv ON qqv.question_id= qq.id
where
(qq.created > "2011-06-21" and qq.created < "2011-06-28")
and
(qa.content LIKE "%[[%safe mode%]]%")
group by qq.id, qa.content
order by count(distinct(qqv.id)) desc