Store all language data on each edit.

Issue #81 resolved
Peeter Tinits created an issue

Updating the needs to match the results of the discussion:

1) Store all language proficiencies defined by the user.

2) Store languages with "don't show" as well. Maybe mark "noshow" or smth similar.

Comments (12)

  1. Peeter Tinits reporter
    • marked as bug

    Ok, heard of the link:

    http://mwtranslate2.keeleleek.ee/all

    I see from the data, that language skills were never forwarded to the corpus? They need to be added. Also useful for the corpus would be definitely interface language, also whether they used introduction only filter, but maybe also the things they automated (I don't have a clear understanding of how that works yet.

  2. Peeter Tinits reporter

    Ok, I see now the proficiencies, as they weren't present on the first submissions.

    In the current state updates would be needed: 1) Store all language proficiencies in each upload, not just the ones that were shown in the interface 2) Store also the language proficiencies that were never changed and are undefined at first (I'm not sure which of these two didnt' work for now)

    3) I can't find the username, surely we'd mean to collect them too? Of course they can potentially retrieved from the entry into the wikipedia, but that is if a script exists that does it and is easy to use. For ease of use, I would just store them as well.

  3. Kristian K

    I don't understand what you mean in:

    1). The user's current language skills are saved under each edit independently. That should be the same as you mean with "each upload"? What do you mean with "shown in the interface"? Do you mean the source articles language? They are saved under the "fromIds".

    2) how can you store something that is undefined?

    3) the username is available at the edit commit in the Wikipedia, there is no need to duplicate this information. For ease of use, you would duplicate all information. Since all information is available anyway, it is easier to maintain the separation of it with as little duplication as possible.

  4. Peeter Tinits reporter

    1) ok, i didn't differentiate each upload from each edit, but i assume that the data is collected on upload? the point was, that if you add a language, you can store it as "either/don't show" in which case it is not on the interface. this language data should be stored too, regardless of whether it was actually used.

    2) you store it as undefined, or unspecifyied. point being that if you work with the data, then you may know that an NA means that it is undefined. But what if some error has occurred - then it would easily protect against it.

    3) ok, we need to make the scripts to retrieve the entire dataset public at one point. i have some python stubs using mwclient available, i might be able to adapt them.

  5. Andrjus Frantskjavitsius repo owner

    Data is collected on each upload.

    1) I'm not sure if it is a good idea to collect not "Don't show" language info. We should collect information that can be actually useful.

    2) If there is an error/exception when retrieving proficiencies, then it will also fail to retrieve most of the record data. Everything (almost) is done in one pass.

    3) The username is actually always available, but it's hashed. @keeleleek , it was your idea... do you still think we need to hash usernames?

  6. Peeter Tinits reporter

    1) It is useful because it keeps the data stable for authors. For analysis we will anyway be aggregating over authors.

    2) Ok, it seems for clarity we should include data on all languages. If needed, we could have a separate category for Undefined and untouched, as it now seems to be. But I'm not sure how it's working. The situation is that there are languages that are included as source or dest. but which don't have any data attached. And then there are languages with Undefined. I'm not sure how this happens.

    3) So each user gets a unique hash? Ok, this is good for analysis still, and may make some people happier over privacy. Although it's wikipedia so its public anyway... Btw for future-future feature. There could be a function to show how many articles you have translated, or why not even at one point show you graphs of what you have done, because I think even very little feedback, can keep you going at translating better. But there is no time or significant need to implement it at this stage yet. Could be a thing, if there's ever extra funding.

  7. Kristian K

    1) Following the principle of "there is no better data than more data", it seems obvious we should collect all parameters of relevance. I think everything in the languages section is relevant.

    2) you store the information in a clear and iconic way, if it has a value "don't show", it could be saved as "don't show" or as "value6". The latter case would need further pointers in the documentation. At no point should there be ambiguous values saved, so never should it be the case that "value6" can mean both "Undefined" and "don't show".

    3) This point allready contains many questions: 3a) Allowing the data to be accessed by scripts. Let the user write the scripts or be the user yourself who writes a script. Wikipedia allready has well defined APIs for their data. At the moment our data is available in JSON. Our data is tabular and could thus just be available in a tabular format (e.g .csv or .odt). Thus we wouldn't need any API, since most analytics allready have tabular data read functionality built-in. Don't anticipate the needs of other users than yourself. 3b) The usernames are hashed because our data is publicly available with a free license. We haven't committed ourselves in following Wikimedias Privacy Policy or Terms of Use. Hashing serves the identity function that is needed for analysis. The hashes are decodable and the usernames can be retrieved from Wikipedias data. 3c) Future-future feature. This feature is allready available if you aggregate our dataset with Wikipedias dataset and extract the number of user contributions and take the division of the two. I question whether the users need feedback on how much they have translated vs how much they have contributed to Wikipedia. We need to divide our efforts between what is relevant for the translation task and what is relevant for the free encyclopedia. We need to respect and use the work done by Wikimedia, re-inventing the wheel is stupid. If feedback on the translation task is deemed favorable, it should be done together with ContentTranslation workteam. Thus users would be free to choose which tool they use and the collected "credit" would be shared between the tools. This is quite primary for software freedom principles -- the freedom to use the program you choose.

  8. Peeter Tinits reporter

    1) Yes.

    2) Yes.

    3a) I didn't have an API for us in mind. We just need a (e.g. python) script that downloads the "corpus" data, populates it with data from wikipedia and makes it into a nice csv. For every user to make their own script seems a waste of everyone's time + first they would have to learn the WIkipedia API and whatnot. Not having it available will make it much less usable. 3b) Ok, sure. I've no idea on our data policies or the requirements for them. 3c) I don't understand what you are dividing. The raw number for contributions is nice too. The function of this feedback is not really to show how much really they have contributed, but that they have done something, and motivate to do it a bit more. Even raw numbers would allow you to compete with your neighbour. This number is easy to retreive from either corpus or wikipedia. But the point is to show it on the interface and keep it visible. --- The latter part has a lot of things packed in it, and seems a separate issue.

  9. Andrjus Frantskjavitsius repo owner

    Proficiencies for all languages will be sent (including unspecified).

    "Don't show" languages can be deduced.

  10. Log in to comment