I'm a ramblin' man #announcement blog
This'll probably be my last Shy Says post here. After this, they'll have their own separate blog hosted on the site.
Also, I typed this up more or less stream-of-conscious, and afterward I put aside tags around the parts where I strayed too far from my original topic. I haven't done anything resembling proofreading any of it. But I'm letting it stand as it is for now.
Something I really want to do but don't know how to go about even figuring out where to start is making public all of the statistics that I've collected from what the comment filter evolved into. Funny how The Frog thought he 'won' because he made me waste my time making that filter. In reality, I love statistics, especially corpus statistics, and I obviously love programing, so it should go without saying that I legitimately enjoyed making the filter, so much so that it's morphed into something well beyond its original purpose. It now analyzes comments in other ways besides detecting Le Frog and other trolls and even does the same kind of analysis on quotes too. Unfortunately, it ain't easy to put all of the data and statistics together in a user-friendly form that's easy to read, browse, and manipulate so you guys can explore and have fun with it.
That's not to mention the problem of organizing the code and getting it into a form that can even run on the current host. For the curious, the filter started its life as a well-organized and structurally coherent set of three VB module files and one C# class file. The C# class was later translated into VB when the quote comment page was, since it is closely tied to it. It originally intercepted comments with unusually high, low, or average troll scores and sent a copy of them to my FSTDT email so I could add them to the corpus of training material. It would then discard comments and perform a progressive IP ban if they had a high score, or returns comments back to the quote-comment page code to be posted as normal if they had a low or average score.)
Now it's become a disorganized, ad-hoc set of seven VB modules, three Object Pascal* unit files (compiled in either Delphi or FreePascal depending on the operating system I'm compiling on), one Ada file mostly written by my BFF / modly minion Mikey, and two JavaScript files that I wrote to run in Node.js when I was testing out how well it worked with databases (the verdict on that: comme ci, comme ça). Code in three two of those languages won't even run on the current host. I could probably get the Object Pascal to compile with Delphi.NET after some adjustments. The code only started branching away from things like "written in one language" and " organized structure" after I started analyzing comments for fun and the whole thing was taking on a life of its own. Around that time, it also somehow began to take on a secondary role as personal playground for experimenting with programming languages.
[aside]*I'm using Object Pascal here in the sense that a lot of Pascal-dabblers nowadays use it, i.e. to describe a modern and quasi-standard dialect or "style" of writing Pascal code that can be compiled by at least both Delphi and FreePascal, and possibly other Pascal languages (e.g. GNU Pascal) if you adhere to a stricter subset of "Object Pascal" in this sense. Confusingly, Object Pascal is also originally what Borland called the last couple versions of its Pascal compiler for DOS (whose very last version also apparently had a hilariously bad, half-assed Windows port). Indeed, this old-school Object Pascal is essentially Turbo Pascal with object orientation (or a ridiculous attempt thereat in the case of the Windows port). Aside from their core syntax and lexicon, that Object Pascal and "Object Pascal" in the sense here are dramatically different. Most code more complex than "Hello world" written in the latter is not compatible with the original Object Pascal in any useful sense unless it was intentionally written to be. And that's your programming-language history lesson for today.[/aside]
In addition to being written in as many languages as your average Dutchman can speak, another fairly major hurdle to making this little pet project public is that both the quotes and comments I've fed it to analyze (originally "train") and the interesting parts of the analyses thereof are stored on my private server in an SQLite database. Our current host Does Not Allow Using SQLite, despite their terms of use saying nothing to that effect or even suggestive of it. Apparently SQLite is Too Forbidden to even do that. TIL SQLite is Lord Voldemort.
Only the word and word co-occurrence probability data needed for the filter to run was stored and updated here on FSTDT in a secondary SQL Server database. (It still is, but hasn't been updated in a while and isn't currently being used.) This probability data is basically just how likely (or unlikely) certain words and co-occurrences of words are to occur in 'good' and 'bad' comments. These probabilities are the product of other statistical and meta-statistical analysis stored in the external SQLite database.
Getting the probability data used by the filter requires collecting statistics about a corpus of training data, the larger the corpus the better. Those statistics are stored so they can be subjected to statistical analysis of their own, and then those statistics are subject to further statistictaal analysis. That's two layers of meta-statistics. I originally decided against storing this latter data on the FSTDT server because I vastly overestimated the amount of space that would be required to store statistics about statistics pertaining to statistics of statistics about tens of thousands of words and the frequency they co-occur with other words. (How could you not??) Yo dawg, I heard you liked statistics, so I gave you some statistics collected from your statistics about your statistics...
We call this madness naive Bayesian filtering. To be a "non-naive" Bayesian filter, you must venture even further down this metastatistical rabbit hole.
TL;DR: Bayes Theorem is postmodern statistics.
Anyway, I think the corpus in the training database is a good representative crosscut of the actual FSTDT database. There's tons of cool, fascinating, and just plain weird stuff to gleam from it, like the fact that the fundie index of a post and the number of times the word 'when' appears in it appear to stand in direct correlation. Why?? Anyone wanna fathom a guess? Quotes also have way more hapax legomena (words that occur only once in a corpus) than comments do, but I don't find that nearly as interesting, because there are already a couple of very likely explanations. One, certain fundies absolutely love to invent "words" like abortuarydeathscortagandistism.* Two, a whooole lot of fundies just can't spell. Perhaps they try to hide that fact with word puree like homocommunofascofemininazis?**
[aside]*Protip: English is a mostly Isolating language, so we Anglophones generally prefer to create names for novel concepts by compounding existing words together into phrases instead of creating new words by adding prefixes and suffixes to other words or word roots. For example, to describe the practice of treating medical problems with things that actually exist in reality (as opposed to quackpot woo-woo), we coined the term evidence-based medicine instead of inventing a completely new word like vercomadhealancy (ver-com-ad-heal-anc(e)-cy, lit. "truth~reality | with~using | from~by | heal | having the quality of | the activity or state of"), no matter how much cooler and more phonoaesthetic vercomadhealancy sounds.
If that gloss is incoherent to you, read it backwards: "the activity or state of | having the quality of | heal(ing) | by | using | reality." In English, the order that a word's morphemes follow is generally a mirror image of the order that words usually follow in clauses and sentences. In linguistic parlance, English clauses and morphemes branch in opposite directions: clauses are head-initial (branch to the right of their head), while words head-final (branch to the left of their head). Oh, and only one morpheme of that word, heal, can stand alone, while all three of evidence-based medicine can. And that's your linguistics lesson of the day. And this is exactly why I love corpus statistics: it fuses my three favorite subjects: linguistics, mathematics, and in the modern age, computers.[/aside]
[aside]**True fact: I'm also a terrible speller, but I actually heed the little red underlines that tend to pop up a lot in the things I write.[/aside]