Wikipedia:Simple English Wikipedia/Technical evaluation of simplicity

From Wikipedia, the free encyclopedia
Jump to: navigation, search

This page began as a discussion on Wikipedia:Simple talk. It was moved to this page at 16:20, 21 August 2016 (UTC). This discussion itself is not in Simple English. StevenJ81 (talk)

How simple is this wiki in comparison to English Wikipedia really?[change source]

I am a teacher for introduction to web science on wikiversity and we use the dataset of simple english wikipedia quite a lot to teach our students text modeling techniques on the web.

Today I was trying to create a lesson on the topic of formulating a research hypothesis.

So my hypothesis was that Simple English wikipedia is easier to understand than English Wikipedia

I was reformulating this to: Less words are needed to understand more parts of the Simple english wikipedia in comparison to English Wikipedia.

I wanted to to make this hypothesis plausible by plotting the cumulative distribution function of both simple english wikipedia and english wikipedia.

The problem is when you look at the graphic that I provide here is that the hypothesis seems to be jsut wrong. Unless you know about 100k words you will understand more parts of the english wikipedia than of the simple english wikipedia when knowing the same amount of words.

the cdf of the wordrank - frequency plot of articles on english and simple english wikipedia as of 2015

I already discussed this quite extensive with a colleague. Besides the simplifications and systematic problems from this experiment / graph (not a real basis for statistical test) we agree that the graphic should have looked exactly the other way around. Since we cannot quickly find any explanation we thought you folks might have an idea for this result.

--Renepick (talk) 13:05, 18 August 2016 (UTC)

  • Word frequency is relevant, but it takes no account of sentence length and complexity. Also, one has to limit the tests to the prose areas. Obviously many/most of our infoboxes and other devices are direct copies, or almost direct, of those on En wiki. Try readability measures on the prose sections as an approach better, supported by research evidence. Having said all that, I know many of our pages were put up by editors whose ability to write good English simply was not good enough. As an "anyone can edit" pedia, the openness runs against efforts to keep the prose genuinely simple and yet accurate. Macdonald-ross (talk) 13:54, 18 August 2016 (UTC)
just for clarification: Infoboxes and tables are dropped in my dataset so it should be mainly on prose areas. What I am currently building are two things: 1st a model that focuses not on word frequency but on probability of knowing all words in a scetence (given a certain vocab size) I would hope that simpel English does much better than english wikipedia on those. 2nd (I don't know if I find the time) a model where I only compare the subset of english articles that have a corresponding simple english article. Still I am open to more ideas why this result looks unexpected. --Renepick (talk) 14:42, 18 August 2016 (UTC)
We try not just to use common words, but to write more simply and yet still be accurate. It would be easy to design an experiment to test the relative readability of a chosen set of articles from En wiki and their equivalents written or adapted by our more experienced editors. In fact any teacher with a classroom of students could run such an experiment, and it could also be done by using readability measures.
As far as I can remember, the statistical work of George Zipf showed that all samples of all languages produce frequency curves of the same shape (he used log-log graphs). It doesn't follow that all samples of languages are of the same difficulty to read! In other words, your graphs may not measure what you think they measure. (see George Miller's Introduction to Zipf, George K The psycho-biology of language. MIT Press, 1968) reprint. Macdonald-ross (talk) 16:05, 18 August 2016 (UTC)
  • Many articles about science and mathematics are extremely difficult to simplify by using easier words. Instead, those articles will often have additional explanatory material either in parenthesis, or as additional blue links that the EnWiki equivalent article doesn't have, or sometimes as links to wiktionary. (See, for example, Pi.) Additionally, articles about people and places and organizations will, naturally, use the exact same name for the same people or places or organizations, respectively. Any analysis that includes these articles will be looking at additional "words" beyond the simple words we would like to use on this project. It would not make sense to try to simplify the names "Barack Obama" or "Bümpliz, Switzerland" or "Parliament of the United Kingdom" since these are the correct names for these articles. I hope this helps as well. (Minor point: the OP's link to the wikiversity page doesn't work correctly from Simple as the software will try to link to a non-existent "simple.wikiversity.org" page instead of "en.wikiversity.org" as it should. The correct link can be found here.) Etamni | ✉   18:16, 18 August 2016 (UTC)
@Etamni: This reminds me of a problem that I was having on the 7-Eleven article. I was simplifying the part that explains that it used to be called "Tote'm" because customers toted their groceries. But I didn't want to use the word "tote" because I thought it was too complicated. In this particular case, I can't just write "It was called Carry'm, this totally isn't a made up word!" because that wasn't the real name, so I asked for help on the talk page and this is what Auntof6 told me:


One of the standard ways of dealing with a complex word is to explain it in parentheses after using it. In a case like this, as you're seeing, you can't just substitute a simpler word. You could say one of the following:
  • Customers could tote (carry) their groceries.
  • Customers could carry ("tote") their groceries.
  • Customers could tote their groceries. (Tote means carry.)
Will any of those work for you?


I disagreed with the third one since the last 5 letters of this wiki's name is "pedia" and not "onary". But the first one made since, and that's what I used.

Does this relate to what you're talking about? (Wikipedia is one of the few places where recycled advice usually works) Also, I apologize about the huge wall of text. Computer Fizz (talk) 18:30, 18 August 2016 (UTC)

I think the problem lies with your hypothesis. Take any scientific article (of sufficient length). To be credible, the author must use certain terms (with a very specific meaning). Since this is SEWP, they end up explaining those terms as well (which means that you likely need more words to convey the same message). I would guess that (not counting scientific vocabulary, the "average reader" can probably get a long with a vocbiulary of 3k-5k words. --Eptalon (talk) 18:50, 18 August 2016 (UTC)
First of all thank you all for your suggestions and insights and this discussion. I would like to emphasise that my goal was by no mean to say that simple english wiki would not achieve its goal of being easy to understand and simple. Also I am pretty familiar with Zipf Law since my PhD is about natural language processing and language modelling. So this morning I did a little bit more research and stumbled upon en:Flesch Kincaid readability tests (which ironically I find much easier to understand than the simple english version Flesch_Reading_Ease) and the version that is easy to implement on computers: Automated readability index which I implemented on the abstracts of simple english and english wikipedia. The result is that in fact the abstracts of simple english wikipedia are already much easier to read than the ones from English Wikipedia.
Data set Automated Readability Index expected school grade
Simple English 7.16 sixth grade
English Wikipedia 9.68 eight grade
I think that is a terrific result which you folks can be very proud of. Still I will conduct the other two experiments I was talking about. A first rough (not tested or double checked) result again suggests that you need to know fewer words in english wikipedia to know all words in a larger fraction of scentences. Which again in my opinion is very counter intuitive. --Renepick (talk) 09:54, 19 August 2016 (UTC)
To summarize to make sure I understand your posts: Simple Wikipedia has better readability based on sentence and word length, but uses more unique words than the English Wikipedia. Is that the take away? Only (talk) 10:33, 19 August 2016 (UTC)
almost! The first part ist true. For the second part I guess you mean the right thing but it would be more precise to say the following: Simple English wikipedia has infact less unique words, because it has less words overall. But on Simple English Wikipedia one needs to know a higher amount of unique words than in English Wikipedia to have the same probability for understanding a randomly picked word. I still find the last one not intuitive and I would have expteced it to be the other way around. well... --Renepick (talk) 11:55, 19 August 2016 (UTC)

Ok I have finished the experiment where I checked how many sentences are understandable in the sense of all words are known (no Grammar was checked) given a certain vocabulary size. It demonstrates that Simple English Wiki is indeed easier to understand. Still I was suprised to see that one needs about 18k words to be able to understand 1 out of two sentences. @Eptalon that is not the 3 - 5 k that you have suggested (well of course my corpus includes all words where you explicitly omitted scientific vocab from your suggestion.)

We counted the words of abstracts of simple english and english wikipedia as of the 2016 august first data dump. And then ranked them by the top words. we asked ourself what is the percentage of sentences that can be understood given a certain vocabulary. One can see that every second sentence in simple english can be understood given a vocab of around 18'000 words. For the english wikipedia around 39'000 words are needed to be able to understand every second scentence. So Simple english wikipedia seems indeed to be easy to understand.

Ok the image in the first place was updated by me. I have used my own parsers to redo the experiment. I did prepare my own parser for the second experiment with the sentences so it was easy to reuse. In that sense I did not depend on other peoples work for extracting (and defining) words. Apearently two different strategies have been used to process simple English and English wikipedia dumps by my student assistents. With this also the CDF of wordrank wordfrequency diagram looks as expected showing that simple english Wikipedia is indeed easier to understand (with the simple curve always lying above the one from Englsih wikipedia) also on a word level. I apologize that I started this discussion taking your time. Also from this diagram one can see, that indeed with the amount of words suggested by @Eptalon most of the words are covered. Though naturally I might add you need more words to have a similar high understanding of scentences --Renepick (talk) 14:38, 19 August 2016 (UTC)

I'm thinking it might be worth moving this discussion to a Wikipedia-space page rather than merely archiving it, as it's a subject we get asked about from time to time. StevenJ81 (talk) 15:24, 19 August 2016 (UTC)
if you send me the link of that Wikipedia-space page I will reiterate on the content of the discussion. Once the new course material goes online on wikiversity also the source code that was used for doing for the calculation will be published and I would cross post this on that wikipage for transparency reasons so that the experiments can be done again or extended --Renepick (talk) 17:13, 19 August 2016 (UTC)
Yes, that's a good idea. This is an important topic with many angles to it. Macdonald-ross (talk) 07:49, 20 August 2016 (UTC)
Another thing that comes to my mind is the definition of "word"; as an example, the verb "have": "have", "has", "had","having" are all the forms of this word. In your analysis, are you looking at "word forms" (as in: the four words cited are four words) or are you looking at "words" (all the four can be reduced to one form)? The first is easier, but it obviously increases the word count: usually four forms for each verb, two forms for each noun. The second approach would be more accurate, but requires an additional step of grammatical analysis. Are you taking into account spelling differences? (traveler vs. traveller, harbor v.s harbour). English is an official language in South Africa, Belize, India and several other countries; I would expect to find 3-5 different spellings for each word. So in the graphs you show, with what I assume to be a cumulative distribution, and a median, the absolute numbers are irrelevant. In order to not falsify your result, you should also have a minimum article length; very short articles are likely stubs. If possible exclude lists (usually they have a category)...--Eptalon (talk) 11:08, 20 August 2016 (UTC)
Everything that Eptalon mentioned in his last post are valid points! Obviously I came from a different perspective where for example word forms and spellings do not really make a difference in order to find evidence for the hypothesis that Simple English is in fact easier to understand than english wikipedia. There are algorithmic solutions to most of the problems mentioned by Eptalon but of course this makes the entire task and the models that are being build more difficult and rises possibilities for systematical errors (by buying in assumptions). So please yes be very careful with the absolut numbers given from my results. They can only be indicative in comparing english wikipedia to Simiple English wikipedia. Yet as mentioned before I think they are a very nice result demonstrating that simple english wikipedia is in fact achieving something and that everyone working on this project can be proud of the success! -- Renepick (talk) 11:31, 20 August 2016 (UTC)