Statistical Analysis of Tribunal Cases
One of the acknowledged benefits of the Tribunal reform card system introduced to League of Legends a few weeks ago was the ability to look up any Reform Card, at any time, just by making up a plausible case number. I soon realized that, with an automated system, I could download large numbers of games and perform some analysis on them, trying to learn about the Tribunal and the game as a whole.
NOTE 1: Gathering these data did not involve hacking, cracking, or breaking any form of Riot's web security. All of the information is available on publicly accessible web pages, and automated downloading of those pages was in compliance with leageoflegends.com's webcrawler guidelines as specified in robots.txt.
NOTE 2: These data are only meaningful so far as they represent an accurate sample of Tribunal cases, or of League of Legends games. I assume that every Tribunal case produces a publicly accesible Reform Card, and I have pretty good evidence that this is the case, but if I am wrong, my conclusions may be invalid.
I downloaded 9254 Reform Cards, randomly selected between case 5572301 and case 5637015. This represents about a week's worth of reform cards. My dataset is a sample which includes about 1/7th of the games, but since it is a random sample, it should be more than sufficient to draw conclusions about the entire block.
The dataset includes everything you can see when loading a reform card on the website, plus a number of other facts that your browser actually downloads every time but that do not display. For example, the k/d/a, gold, cs, and item builds of the enemy team are actually sent to your browser, and I have captured them.
The simplest question to ask is "How often does the Tribunal vote to punish?" And already, we find the first curious fact about Reform cards.
38.78% of cases are voted Pardon, and 61.22% Punish. However, 39.99% of cases show a punishment of None. In 1.21% of cases (112 cases in my dataset) a Punish vote is listed but the punishment assigned is None. These cases are intriguing, and we'll come back to them in a moment.
Next up is the question "How often are warnings, temp bans, and permabans issued?"
Limiting ourselves to the Punish cases, the data are as follows.
None (see above) - 1.35%
Warning - 47.34%
Time Ban - 47.53%
Permaban - 3.77%
Proportionally, very few cases result in a permanent ban. However, that number represents 198 cases captured, which extrapolates to around 1400 over the entire period.
Next, how often is the decision controversial?
Controversial - 38.42%
Majority - 42.41%
Overwhelming - 19.17%
These numbers alone aren't very meaningful, because we don't know how Riot divides results up into these categories. Controversial may only contain cases that are within 1% of the decision break point, or it may contain cases within 5%, or 10%.
However, it is probably safe to assume that those bins are the same from case to case, so we can ask "Does the degree of agreement vary by the verdict of the case?"
Here, for the first time, I won't just post a bunch of numbers. Now we are asking proper statistical questions, and the answer comes in two parts. First, is there a difference, and second, is that difference meaningful?
Yes. Pardon cases are 12% more likely to have an overwhelming majority, and punish cases are 12% more likely to be controversial. There is no significant difference between the agreement rates for the various punishments.
One possible interpretation of this (and this is just speculation) is that about 2-3% of Tribunal cases are blatant errors, and Tribunal voters notice them and reject them. Another interpretation is that the break point for punish is well above 50%, so the "Overwhelming Majority" bin for Pardon simply covers more possible outcomes.
Remember those punish cases without punishments? What about them? It turns out, the agreement profile on those cases matches the Punish cases far closer than the Pardon cases, which means the verdict matches the votes. It's still not clear why there is no punishment, however.
Game and Report Totals
Next up, how does total number of reports and total number of games figure into things?
First, how many reports does the typical game have? The median case has 5, the average case has 5.48. The mode (most common value) is 4. Two cases in the sample have no reports (both are pardons); I can only assume that is due to a Tribunal bug. There are very few cases above 15 or so reports, but one enterprising person managed to collect 33.
As reports go from 1 to 7, the chance of punishment increases quickly from around 20% to around 80%, then climbs very slowly thereafter.
We can do a similar analysis to the number of games in a case. The mean is 2.73, the median is 3, and the mode is 2. Pardon chances range from 62% in single game cases down to 24% in 5 game cases.
Lastly, we can consider reports per game. Again, as reports per game goes up, so does the chance of punishment.
Also, remember the mystery cases, that are a punish but no punishment? In these stats, they fall about halfway between the straight punishes and the pardons, which doesn't help explain them.
Unfortunately, there isn't a lot to say about the rest of the cases, because we don't know how these games are being chosen. The 1 report cases seem almost worthless to include in the Tribunal at all, though.
I will continue to do analysis and update this thread, probably once every few days.
Next, in Part 2, I will look at statistics derived from individual game records. Champion selection, k/d/a, win/loss, game length...
In Part 3, I plan to get even more ambitious and find out if there really are instapunish words in chat, and what they are. Also, I will possibly look at item builds.
In Part 4, if I can keep this up that long, I will look at things I haven't thought of yet, or take user requests, or something.
A note about Part 1: It turns out the mystery data are duplicate reports and due to a Tribunal bug. However, I chose not to remove them from the set, because I have no way of finding, or even confirming the existence of any comparable duplicate Pardon cases.
Part 2: Per-Game Data
In Part 1, I was looking exclusively at data gathered from the main page describing the case. Now, I will start looking at the individual games.
This analysis will be a little different from Part 1, because by looking at individual games, I can try to answer questions not just about the Tribunal, but about the game as a whole.
We'll start with a simple question: Who is the most popular champion? Remember, this covers all players in games which appear in a Tribunal case for one week, about 2.5 weeks ago.
The top 5 champions are:
Ashe - 3.59%
Teemo - 3.08%
Master Yi - 2.70%
Ezreal - 2.43%
Darius - 2.19%
The bottom 5 champions, on the other hand, are:
Gragas - 0.31%
Xerath - 0.29%
Victor - 0.22%
Trundle - 0.17%
Karma - 0.17%
(Zyra) - 0.13%
Zyra isn't really that underplayed, she was not available for the entire period of data gathering so her data aren't useful to compare.
Now, lets compare that with reported players:
Master Yi - 3.91%
Teemo - 3.30%
Ashe - 3.24%
Ezreal - 2.63%
Darius - 2.59%
Yorick - 0.28%
Xerath - 0.28%
Viktor - 0.23%
Karma - 0.17%
(Zyra) - 0.14%
Trundle - 0.11%
And punished players:
Master Yi - 3.85%
Teemo - 3.26%
Ashe - 3.15%
Darius - 2.59%
(bottom 5 are roughly the same as above)
So now comes the big question...what champion is punished proportionately the most more than they are played, or reported?
Zyra is excluded from all of the following tables, since she was new enough at the time of data gathering that the Tribunal would not have had time to build many cases for her. Remember, however, that every game has a reported champion in this set, so these numbers will all be much higher than the LoL population in general.
First, we'll compare how often a champion is reported to how often it is played.
Tryndamere - 15.8% report rate
Twitch - 15.6%
Master Yi - 15.1%
Evelynn - 14.9%
Mordekaiser - 13.9%
Ahri - 6.8%
Sona - 6.0%
Taric - 5.9%
Leona - 5.6%
Janna - 5.4%
So when seen, Tryndamereis the the most likely to be reported, and Janna the least.
That covers reports, but once reported, who is most likely to be punished?
Irelia - 77.7% conviction rate
Udyr - 77.6%
Volibear - 75.9%
Trundle - 73.3%
Xerath - 73.2%
Maoki - 57.6%
Malphite - 56.8%
Skarner - 55.7%
Jayce - 54.2%
Leona - 52.6%
Irelia? Really? I would never have guessed that one. (Note that Trundle and Xerath have few enough cases that their numbers are likely not particularly stable from week to week)
Lastly, we can combine the two and directly compare plays to punishment verdicts (remember, 6.8% of all players in the sample were punished):
Evelynn - 10.8% punishment rate
Twitch - 10.4%
Master Yi - 9.7%
Tryndamere - 9.6%
Twisted Fate - 9.4%
Taric - 4.0%
Ahri - 3.9%
Sona - 3.7%
Janna - 3.3%
Leona - 2.9%
That is to say, Evelynn is the champion most punished by the Tribunal per time she is played, and Leona the least punished.
I find it interesting that Arhi is very near the bottom of the first and third lists, which are otherwise filled with supports, but I don’t know what to make of that.
Game Outcome, Kills, Deaths, and Assists
What champion gets the most kills per game?
Twitch - 8.5
Darius - 8.5
Fizz - 8.3
Akali - 8.3
Katarina - 8.0
Leona - 2.8
Sona - 2.6
Taric - 2.0
Soraka - 1.8
Janna - 1.6
This supports the many “DARIUS=OP QQ” threads over in General Discussion, but I didn’t expect Twitch to top him by a few hundredths...
What champion dies the most?
Karthus - 7.3
Twitch - 7.2
Master Yi - 6.9
Xin Zhao - 6.9
(Zyra) - 6.9
Twisted Fate - 6.8
Malphite - 4.8
Taric - 4.7
Soraka - 4.6
Anivia - 4.6
Janna - 4.0
Apparently, egg really is that good at keeping Anivia alive. And maybe sacrifice Karthus is more popular than I thought.
And lastly, assists by champion.
Taric - 12.2
Soraka - 12.1
Sona - 12.0
Janna - 12.0
Alistar - 11.4
Darius - 6.1
LeBlanc - 6.1
Fiora - 6.0
Vayne - 5.6
Master Yi - 5.4
Supports at the top, burst-heavy carries at the bottom.
Winners vs. Losers
How different are the k/d/a results between winning teams and losing teams? (these numbers are per champion)
Winners get 7.2 kills/game, and losers 4.5 kills/game.
Winners die 4.5 times a game, and losers 7.3 times. (These are the mirror of the above, since every kill is a death, and nearly every death is a kill)
Winners get 10 assists/game, and losers get 6 assists/game.
Now, back to explicitly Tribunal related data, how different is the reported player from his team? It’s clear that these numbers need to be normalized by outcome. so I’ll consider losing reported players and winning reported players separately.
Remember, these stats only cover the team with the reported player on it, so they won’t quite match up to the numbers above.
In losing games, the reported player’s allies averaged 4.4 kills, and the reported player only averaged 3.3 kills. Allies averaged 7 deaths, the reported player 7.6 deaths. Allies averaged 5.8 assists, and the reported player averaged 5 assists.
In winning games, the numbers aren’t so clear. Allies averaged 7.5 kills, while the reported player averaged 8.1 kills. Deaths were 5.1 for allies, 5.8 for the reported player. Assists were 10.7 for allies, 9.5 for the reported player.
Limiting the numbers to only punished players, and not just reported ones, doesn’t result in too many changes. For the losing team, an offender averages 3.9 kills, 8.6 deaths, and 5 assists. On the winning team, the numbers are 8/6.1/9.5. Worse play seems to lead to a slightly higher conviction rate (easily accounted for by blatant intentional feeding cases), except that convicted losing players have a higher kill count than reported losers overall.
There is a lot more I could do in this section, normalizing reported player kills by champion, deriving a game impact single value that accounts for kills, deaths, and assists, or looking at gold or other numbers. But I’ve already gone pretty long for one section, and I don’t want to burn out on the boring number crunching before Part 3, where I start looking at chat logs. So that’s it for this section. If there’s anything I didn’t cover that you are burning to know, ask in the thread and I’ll see if I can get it into part 4.
Part 3: The Chat Log
The main tool the Tribunal voter is given to judge a case is the complete chat log of the game. This is the most subjective and difficult place to make a judgement. Who is trolling? How much of an insult is too much?
I have decided to try see if I can stop trends in Tribunal behavior by looking at certain categories of words. First, however, a big caveat. This analysis required me to pick a few representative words for each behavior. Typos, misspellings, words I didn’t think of, words used in a different context, and many other things may throw these conclusions completely off.
Second, I’m looking at one game at a time here, instead of one case at a time. That means that high game count cases will skew the data from the numbers in part one.
So, let’s begin.
The first word I want to talk about is the most commonly discussed instapunish word, n****r. And indeed, it is pretty much an instapunish word.
0 usages - 65% Punish
1 usage - 87% Punish
2 usages - 94% Punish
3 usages - 93% Punish
4+ usages - 100% Punish
As for the overall prevalence of the word, 0.52% of players overall in the database use it at least once, and 2.11% of reported players use it.
The Other N-word
This one I can type in without self-censoring. Noob. The more it is said, the more likely a punishment is, but not so dramatically as above.
0 usages - 62% Punish
1-8 usages - ~75% Punish
9+ usages - ~85% Punish
As for the overall prevalence of this word, 11.9% of players overall in the database use it at least once, and 31.3% of reported players use it.
Now it gets a little fuzzier, because there are too many swear words to look for comprehensively. I made two categores (Bad swears and not-so-bad swears), and the punishment curves for both of them look about like noob does.
Game Skill Insults
I put this one in because it bothers me a fair amount. These are words that specifically mean that the other person shouldn’t be playing the game. “l2p” and “uninstall” are what the algorithm searches for. The progression isn’t nearly as clean as for the other categories above, but conviction does go up as these words are used more.
These are non-swear word insults; things like “idiot”, “moron”, and so forth. For this case, the curse is almost exactly the same as the noob one again.
Lest you think I am obsessed with bad language, I am also looking for polite words, like “please”, “thanks”, and “gj”. And here is where things get interesting. Being polite correlates to a small but measurable increase in conviction chances. This could be from people who are being insulting but using polite words, of course, but it’s food for thought.
0 usages - 63% Punish
1+ usages - 67% Punish
Maybe it’s not what you say, but how much you say? Looking at the number of lines of chat in a game, and things get really interesting.
0 lines - 52% Punish (this represents only 2.4% of games, though. Most people talk.)
1-10 lines - 50% Punish
11-50 lines - 63% Punish
50+ lines - 72% Punish
So, there is at least a little to be said for the idea that if you don’t want to be reported, don’t talk. Of course, we’re only looking at reported people here, so it’s approaching the situation backwards. Plenty of people talk all the time and are very rarely reported; they just aren’t in the Tribunal.
Other Chat Facts
Another thing I’ve tried is to identify the language being spoken by a player. This is a lot harder than it sounds. Ultimately, rather than try anything clever or complicated, I decided to look for a few simple words in both English and in Portuguese/Spanish (difficult to distinguish without proper diacritics), and accept that I would be missing a lot.
My results are as follows.
44.8% of the talking population of LoL is recognizably speaking English.
0.9% of the talking population of LoL is recognizably speaking Portuguese or Spanish
Obviously, this isn’t catching everyone in either case (either because they are just saying mia, re, etc, or because they are speaking recognizable English/Portuguese and I’m not detecting it)
Assuming that in actuality, those should total 100%, the adjusted numbers are 98.1% English and 1.9% Portuguese.
Three months ago, I would have said that number was obviously wrong, but I can’t remember being matchmade with Portuguese speakers recently. Maybe the beta Brazilian server is attracting them away?
Can looking at chat habits predict who wins a match? If a player sends more than 50 lines of chat in a match, their winning percentage is only %40 (the average is, for obvious reasons, 50%).
What about game related chat, though? I built an “on topic words” filter like the ones above, but for “mia”, “top”, “bot”, “baron”, etc. Usage of on-topic language does not influence winning percentage in any way meaningful way, however.
I know I said I would try to add item analysis here, but the coding necessary looks complicated and while I find this project entertaining, running these numbers over and over is getting a bit boring. So I’m going to call this section here, and spot check user request or other random tidbits I’ve thought of for part 4 later on.
Part 4: Random stuff
In this section, I'll answer analysis questions, present a few miscellaneous findings, and muse about what I really figured out.
However, (and this one is a bit surprising) there is no viable correlation between the ratio of reports from allies to enemies and the chances of a punishment. Each extreme has a higher punishment rate than the middle, but that’s simply because in order to have a 10:1 ratio you need 11 reports total.
8.1% of reported players used an antigay slur. One of these managed a mere 107 in a single game, and was indeed banned for his efforts.
As for the effects on punishment rates, any usage at all halves the pardon rate from 36% to 18%. More than 5 usages drops the pardon rate into the single digits, but there are are still occasional pardons at any degree of usage (including one pardon with 13 usages.)
As for the word gay itself, it is, rather surprisingly, used less often than its more offensive counterparts (at least when those are combined). The total prevalence is 2.2%, and usage among reported players is at 4.8%
The punishment profile for gay itself actually looks rather similar to the other one, starting at 36% pardon and dropping to single digits around 5 usages/game.
I was expecting to see more usage of gay compared to harsher insults, and less punishment for it. I'm not sure what that means.
Not all report types are equally used. This is the breakdown.
27.7% Intentionally feeding
25.8% Offensive Language
21.3% Verbal Abuse
14.1% Negative Attitude
7.8% Assisting Enemy Team
1.5% Inappropriate Name
A small fraction of cases have no report type specified.
Remember, this is the proportion of most common report types, not individual report types. Refusal to communicate does not appear at all, meaning it is very uncommon.
Most of these types are punished at roughly the same rate as cases as a whole. Only one shows real statistical significance; Inappropriate Name cases, which are punished more often (at a 69% conviction rate).
Report type has no correlation to the nature of the punishment (warning, ban, or permaban).
I wrote code to analyze item builds, but it was complicated and non-intuitive in output, and I didn't know how to analyze it.
I never managed to extract summoner spells in a useful fashion.
14.1% of players are using a skin. Reported players are slightly more likely to have a skin than the population at large (15.2%). The use of a skin does not affect conviction rate, though permabanned players only have skins 13.1% of the time (possibly indicating that a non-trivial number of permabanned players are on second or third accounts and have learned not to spend real money on the game).
Fizz, Lee Sin, Cassiopea, Evelynn, and Nocturne players talk the most (measured by lines), and Heimerdinger, Nassus, Caitlyn, Kog’Maw, and Sona players talk the least. Trundle players are the most erudite, averaging more than 3 words/line (there are so few Trundles, however that this could be an aberration), followed by Nautilus and Evelynn. Kassadin, Tryndamere, and Wukong have the shortest lines at around 2.75 words per.
1.) The tribunal handles a LOT of cases.
My estimate is that the Tribunal handles sixty thousand cases a week. If the average Tribunal voter judges 40 cases a week (an aggressive estimate), and a case needs 20 votes to close (a conservative estimate), there must be 30,000 Tribunal judges.
To manage this workload, it would take roughly 400 full-time Riot employees. I figure an employee would average 4 minutes a case, and would be able to spend 30 hours a week actually reviewing cases and not in meetings, breaks, training, review, etc, and that each case would go before 3 employees.
Even just manually reviewing the 1400 permabans a week with these estimates takes about 10 people.
2.) Autopunishing is not a problem in the Tribunal.
Tribunal cases are reasonably proportioned across the spectrum of overwhelming majority pardon to overwhelming majority punish. This means that either Riot has picked a punish threshold that accounts for and effectively ignores autopunishers, or there are in fact very few autopunishers (or Riot has an undiscussed vote weighting algorithm that compensates for them, or some other unknown solution.)
3.) By deciding how many reports and incidents to collect before generating a case, Riot controls the punish rate.
The difference in punish rates between cases with very few reports or incidents and many is stronger than nearly any other describing factor. But these factors aren’t directly within the control of the reported player; they are under Riot’s control. Rather than making a 1 report Tribunal case that has an 80% Pardon rate, Riot could wait for more reports before generating it, or decide not to submit it at all. Sometimes they do, and sometimes they don’t. I don’t know why.
4.) Punishable behaviors are non-orthogonal.
It is easy to find explanations for why a player was punished. If you examine deaths, you’ll find that players are punished more often for more deaths. If you examine kills, you’ll find players are punished less often for more kills. This might lead you to conclude that gameplay, not chat, was the driving force behind the Tribunal.
But, if you start with the chat, you’d find just as strong explanation there. The answer is that most punishable behaviors come hand in hand, due to the Tribunal’s agglomerative case nature. A punished player may have one game in which they fed but didn’t insult, and another in which they insulted but didn’t feed. It’s impossible to tell which game was the justification for punishment.
The Tribunal-report infrastructure is a complicated system, and despite having access to reams of data from it, most of the deeper questions can't really be answered without seeing report rates and understanding more about how the cases are built.
What I have been able to learn has not given me any reason to doubt the Tribunal's effectiveness. Autopunishers do not dominate the voting results, and most hypotheses that I tested and wanted to be true (racism gets punished, etc) turned out to be true.
The dominance of report total in predicting verdict bothers me, but without more understanding how the decision is made, I can't claim that it makes the Tribunal worse.
so what is your point?
IIRC Lyte said that the ones that say both "Verdict: Punish" and "Punishment: None" were bugs that were in fact punished in some way. Let me see if I can dig up his post quickly.
EDIT: Here we go. My bad, they were actually duplicate bugs, rather than discrete cases. You should scrub them from your data, most likely - they're weighting the analysis incorrectly.
I'll be honest....I require part 2 ASAP.
And can you provide and PROOF as to the sources of your data? Not that I don't believe you per se, but you're gonna start getting a lot of naysayers in this thread real quick.
People on the forums refuse to acknowledge any problems with the Tribunal; even if Lyte himself says "were working on improving feature x" etc etc.
I'm not really sure how I can prove either of those. For the first case, I can...post some links to some of the more extreme cases, which I would be unlikely to have found manually? That's the best I can think of. The raw data is 1.7 GB, so it's not like I can post it anywhere.
I claimed there I found 112 cases with a punishment verdict but not punishment. Here are the case numbers. You can check them yourselves. I don't know how I could have gotten them manually. Actually, in looking at them, I see that they are all from a similar range in my data by sequence (though not contiguous), which supports the theory that they are just due to a temporary bug. Anyway, sorry about the numbers...
5606497 5606509 5606556 5606582 5606599 5606600 5606613 5606641 5606663
For the second case, I have no idea, other than showing every step of my work and all my data, to convince people that I've done what I've done. But, as I've already said, it's far too much data to share.
Does that satisfy you?
I'd speculate that cases at the very edge of the punish/pardon limit are flagged for audit.
edit- decided to poke into it, and these cases there's some... pretty obvious intentional feeding going on. Another theory could be that the cases are market no punishment due to concurrent punishment....
edit the second- those cases have a high incidence of duplication. Likely database error and, as such, duplicates don't get a punishment.
|All times are GMT -8. The time now is 01:54 PM.|
(c) 2008 Riot Games Inc