- Scribe: Dennis
- Chair: Raul
Labeling for the new reports (Ksenia)
We could add a labeling mechanism to make triage a bit easier and hide some issues that get invalid or nsfw label. These could be stored in a separate table in BQ as a relation <report_id> -> <label_id>? Or we can introduce a "local" db for that to make it possible to edit labels as the current system is read only.
1. ML labeling We could use web-bugs data and would need to create a new model in bugbug that would only take into account description (instead of the full body). The current model considers issues that were ever moved to needsdiagnosis to be the "positive" class. Perhaps, for this model need a different approach.
Some obvious invalid candidates (mention of virus, scam, spam, etc.): https://github.com/webcompat/web-bugs-private/issues?q=is%3Aissue+virus+is%3Aclosed+milestone%3Aml-autoclosed+
2. Nsfw labeling There is a db with nsfw domains on webcompat.com based on https://oisd.nl/, so perhaps could use a similar approach. We could also add an ability to mark a report as nsfw in the UI - once its domain is added to the list, and report gets hidden.
- Dennis: I like the idea, but I don't think we need our own DB for that. As we're not going to operate on/query the full telemetry pings every time, we'll have some ETL script that reads from telemetry and stores it in a simplyfied format (which is the report only) anyway. We should be able to label the reports during that stage.
- Ksenia: This is also about James' idea from last week where we add some data to the bugs to further improve our models.
- Dennis: Yes, for supervised learning we need to store state. We can still do this in BQ, since the data should be nearly append only. We can create a new table to link the reports to the labels. We might need to mutate change if we need to delete labels a lot. Not sure if delete rewrites the entire table.
- James: I assume that Deletes also rewrite the entire table, as otherwise, Updates would be cheap. There's two different cases: one is if we label issues as "diagnosable", where Bob marks an issue as diagnosable, but we later figure out it isn't, so we need to mutate the labeling there. There probably also will be cases where we need to remove labels. It could still be append-only, where instead of deleting a label, you add a tombstone for that label. Of course, that adds costs to querying, but if we're not mutating too frequently, that might be worth it. Of course, that pushes a lot of complexity into the application layer.
- Ksenia: Yeah, for the model to learn, it would be good to tag false-positives, and it seems like BigQuery isn't really designed for editing.
- Dennis: Big query is good at big queries. It's not really designed for mutating data, but it's good for querying large amounts of data. We should look into getting some persistent mutable storage e.g. cloud storage for these use cases that require mutation. The second possibility is that we ignore the problem of rewrites, because the cost of the overwrites might be small at the scale we're operating at. We might need to talk to the data team again.
- James: There's a very decent chance that we're worrying too much.
- Dennis: CloudSQL comes with fixed overheads.
- Ksenia: Another use-case for editing could be tagging/labling duplicates.
- James: For duplicates, that's an interesting problem. If you get enough reports, you can read the descriptions and decide if the issue fits into an existing cluster. For those without description, that's not possible, but for those with comments, we could have a flag that says "this is likely to be something we don't know"
- Honza: I have a question about the labeling: we'd be querying the Telemetry data and copying the data into our own table, right? And if we'd do labeling, that'd be stored in another table.
- Dennis: BigQuery billing is based on the amount of data read, not the amount returned. So we use the ETL script to transform the input data into smaller tables that only contain the interesting parts. Another downside of BQ is that you can't change existing records; it requires rewriting the entire table. So we'll extract into our own table and have a new table just for labelling. That adds some complexity to querying, but it's necessary given the storage model. Generally updates are expensive and although we will have relatively little data we still need to be careful.
- James: The table for labels can have more than one row for each report, for multiple labels for example. But the number of bytes is going to be comparable. We're either in a space where it's fine to re-write the table or not, and most likely, the table is going to be small.
- ACTION: Dennis to ask the Data Team about this.