User:ChenzwBot

From Simple English Wikipedia, the free encyclopedia
Jump to navigation Jump to search
Crystal Clear action run.png This user is a bot run by Chenzw (talk).

It is used for helping to make many small changes that would take a long time for a person to do alone.
Administrators: if this bot isn't working right or doing bad things, please stop it.


ChenzwBot This user is a Bot
ChenzwBot (Talk · Contribs)
ChenzwBot.png
ChenzwBot patrols the sea of recent changes.
Operator:Chenzw
Flagged?Yes (11 April 2008)
Edit rate:Variable (Anti-vandalism)
Edit period/s:Always
Automatic or manual?Automatic
Programming language/s:PHP and Python
Source code published?https://gitlab.com/antivandalbot-ng (partial)
Emergency shutoff-compliant?No

This bot reverts vandalism on the wiki. It is a rewrite of User:GoblinBot4's (and User:Chris G Bot) code, and powered by the revscoring library.

History

  • Since 2010: Anti-vandalism task begins, using Chris G Bot's code. Vandalism detection was achieved by evaluating edits using regular expressions. Prone to low detection and high false positive rates.
  • Approximately Dec 2015: Bot begins using the revscoring library (which powers ORES) to extract features (e.g. numbers of characters added/removed) about each edit. Vandalism probability is predicted by a Random forest classifier.
  • Mid-2016: Bot core rewritten in line with the reactor design pattern.
  • 7 May 2018: Bot core rewritten (again) in Python, which is vastly more efficient than the original PHP implementation. Classifier changed to XGBoost.
  • Mid-October 2019: Classifier changed to LightGBM, with substantial improvements to how each diff is evaluated. Words added by editors are transformed to tf–idf vectors, and fed into a separate Bayesian classifier. These words are also tagged as nouns/verbs/pronouns etc., with the counts of the various categories becoming new inputs for the main LightGBM classifier.

Summary of algorithm

Machine learning

Unlike previous generation anti-vandalism bots (Chris G Bot, GoblinBot4, and previous codebases of ChenzwBot; a brief overview of how they worked can be found in History above), ChenzwBot learns what is considered vandalism from a list of pre-classified edits provided by a human. This is the training dataset.

Classification pipeline

The bot evaluates each edit according to the following:

  1. Extracting features from each edit using the revscoring library. Such features usually come in the form of certain metrics, such as (but not limited to): total length of edit, number of nouns added/removed, number of special characters added/removed, number of wikilinks added/removed. This step is similar to how the revision scoring tool on Recent Changes works.
  2. Part-of-speech tagging: the same word can have different meanings depending on context. For example, "film" can be used both as a noun and a verb. To further distinguish between both uses, the bot tags the word using the spaCy library.
  3. Understanding edit content: words added in the edit are then counted. Due to processing in step 2, the same word used in different contexts is counted separately. To avoid giving commonly used words undue weight, they are transformed to tf–idf vectors. This has the effect of prioritising relatively uncommon words in the dataset, and penalising common words in the dataset, when they appear in the edit.
  4. The tf-idf vector is provided as an input to a Bayesian classifier. The classifier provides a score representing the probability that the words used in that edit are characteristic of vandalism.
  5. The features in step 1 and the probability score in step 4 are supplied as inputs to a gradient boosting classifier. The output of the classifier is the final calculated probability that the edit is vandalism. ChenzwBot uses the LightGBM library.

A note about probability thresholds

  • The final probability given by the LightGBM classifier is compared against a defined threshold. Probabilities higher than the threshold are considered vandalism and reverted accordingly, while those lower than the threshold are considered constructive. The threshold is calculated against a separate dataset (not used in training) such that it will match a specific false positive rate. This false positive rate is set by a human.

Additional rules

  • The bot will not revert local sysops and rollbackers.
  • The bot will not revert any registered user with more than a certain number of edits.
  • The bot will not revert any edit made to a page on the exclusion list. This currently includes Wikipedia:Sandbox, Wikipedia:Introduction, and Wikipedia:Student tutorial.
  • The bot will not revert to itself if it last reverted the page less than 24 hours ago. Exceptions to this rule include Wikipedia, Simple English Wikipedia, WP:AN, and WP:ST.
    • As of 16 December 2019 the bot will instead not revert the same user/page combination more than once every 24 hours. This means that the bot may revert the same page multiple times within a 24 hour window if the vandals are different.
    • Due to issues with false positives, the bot will no longer revert any edit outside of article space.