The Dirty work of Data Mining is a Brave New World of Confusion.
You might not even know what Data Mining is. I didn’t. Time to live and learn!
One of the highlights for me this year was the publication of Cally Phillips work Discovering Crockett’s Edinburgh. Following on from the 2 Volumes Discovering Crockett’s Galloway already published, I knew it would be a well-researched, informative and interesting work. I believe there is currently no better way to find Crockett ‘places’ and to relate them back to his literary works.
For the uninitiated, Crockett first went to Edinburgh in 1876 as a bursary student and he lived there, on and off for the next decade. Edinburgh features in around a third of his seventy plus literary works and Phillips’ book takes the reader on a number of journeys through place and time in this work. With it you can explore and re-tread the steps of the young Crockett and his characters throughout five centuries of Edinburgh – either for real if you’re in Edinburgh, or virtually if you’re not.
What a brilliant thing to be able to do.
So what about Data Mining? A text, by any other name does NOT smell as sweet, believe me. I was recently made aware of an app which claims to offer people the opportunity to do a similar thing to Discovering Crockett’s Edinburgh, not just for Crockett, but for a plethora of Scots writers with Edinburgh connections. Right on your smartphone (or computer – I don’t have a smartphone!) The Barrie aficionado who alerted me to the app warned me though, telling me the Barrie links were far from accurate. I went to look at the shiny new toy. Amazed to find Crockett on the ‘app’ – since he’s barely known of in Edinburgh – I went straight to his author name.
Massive disappointment. Now I know how Cally Phillips felt when she first started the Crockett ‘project.’ In 2012 she ‘discovered’ Crockett in the realms of project gutenberg. The quality of the digital texts were so poor (mostly unreadable) that she turned to and produced newly edited versions for ebook and paperback. In the process she became a publisher and is now one of the leading scholarly authorities on Crockett.
The Litlong app (yes, time to name and shame) sadly relies on both Wikipedia – which is horribly inaccurate regarding Crockett – and project gutenberg texts. It is, dear reader, worse than useless, at least for Crockett. I haven’t had the heart to check it out for other authors.
But I did have a look at how it is compiled. It’s called Text Mining. Which is effectively data mining. (I leave you to draw analogies to other forms of data mining)
Be afraid. Be very afraid. Here is the description from the app:
You can use LitLong to explore Edinburgh as a literary setting. Browse the map and zoom in and out to see how locations around the city have featured in literature. As you zoom further in, more pins will appear.
Click on a pin to see excerpts of literature that mention that location. From there you can select a particular excerpt, save it to your library, add it to a path, or read more about the selected book and its author.
Sounds great. Except you really can’t. Not if you want accuracy. Not with Crockett.
LitLong uses natural language processing technology informed by literary scholars’ input in order to text mine literary works set in Edinburgh and to visualise the results in accessible ways.
The problem is: There appears to be no literary scholars’ input in the Crockett selection.
The explanation continues:
What have we made?
We have created a very large database of place-name mentions in more than 600 books that use Edinburgh as a setting. We have then extracted the sentences immediately surrounding each mention and included those as an excerpt in our database. The data has then been mapped onto the city via the place-name mentions, and can be explored through a mobile app and online interface. With LitLong, you can walk your own paths through the resonant locations of literary Edinburgh.
Except you really can’t. Not with meaning. Not for Crockett. It’s more like a super drunken stumble at best.
Our aim in creating LitLong was to find out what the topography of a literary city such as Edinburgh would look like if we allowed digital reading to work on a very large body of books. Edinburgh has a justly well-known literary history, cumulatively curated down the years by its many writers and readers. This history is visible in books, maps, walking tours and the city’s many literary sites and sights.
Do we feel that perhaps they have just over-extended. Crockett isn’t really mainstream now, is he. But for me this is no excuse when they claim their desire to go beyond the mainstream:
But might there be other voices to hear in the chorus? Other, less familiar stories? By letting the algorithms do the reading, we’ve tried to set that familiar narrative of Edinburgh’s literary history in the less familiar context of hundreds of other works.
Failed on that score. A little knowledge is a dangerous thing. Well, you know, a vast amount of random words are even more dangerous.
How did we do it?
To create LitLong:Edinburgh we have used text-mining and georeferencing on extremely large and diverse collections of digitised books made available to us by – among others – the British Library, the National Library of Scotland and the Hathi Trust. In addition, some publishers and authors have shared their lists with us. We searched these collections for texts which, in the range and frequency of their use of place-names, showed all the signs of making Edinburgh their setting. A combination of algorithmic and manual curation then filtered these texts for ones that matched our criteria, giving us a dataset of hundreds of narrative works which explore the city or use it as a backdrop for their action. The Edinburgh places mentioned in these texts were then georeferenced using a bespoke gazetteer created to register the very different ways in which place might be named in fiction or memoir.[i]
Sound great? But it hasn’t worked. There’s is nothing like enough ‘human’ or literary input into this project.
Issues with the Crocket entries include:
Texts are sometimes inaccurately labelled (with the American editions being used -these often have different titles from the British versions)
Crockett biographical information is very incomplete. Wikipedia editing is not a skill I possess, and until more Wiki-editors know more about Crockett it will not be updated accurately or comprehensively. Don’t hold your breath. Academics are still well out of step with Crockett, holding on to outmoded and ill-conceived notions of ‘Kailyard’ etc. Crockett needs a Wiki-advocate.
Actual texts. If you are happy reading online it’s not too bad. When you try to download the problems commence. OCR is poor on many of the titles.
The excerpts rarely give any real flavour or reason as to why they are attributed to a particular place – in stark contrast to Phillips’ work which integrates and weaves the stories, characters, places and Crockett himself into one meta-narrative.
Sometimes you get what you pay for. The Litlong app is free. Which simply disproves a cliché – the best things in life are NOT always free. You may have to pay for Discovering Crockett’s Edinburgh. It’s well worth it. Beyond that, I would recommend if you want to read Crockett you read from a reputable source. Like Ayton Publishing’s ‘Galloway Collection’ available from www.unco.scot, Amazon and elsewhere.
I am now looking out for a digital version of Discovering Crockett’s Edinburgh. It won’t be free but it will be worth every penny. And it’s what I will carry with me when I go Crocketeering in Edinburgh.
So, Data mining. What do we think? It may (or may not) be a clever way to chew up and spit out industrial levels of words. But words without meaning… where is the point?
The lesson to be learned is that keywords are not the same as literary analysis, critique or research. Data mining of this level cannot take the place of a human being. And that as humans we should be very wary of this kind of activity. We have got used to the idea that there are apps for everything. Please think twice when you
If this represents a wonderful new way of ‘mining’ data then I fear for us all. Obviously as far as literature goes, dredging up data from the inner workings of digital archives leaves a lot to be desired. Perhaps we can take some solace in the fact that the Litlong app proves there is a definite need for the kind of skilled, painstaking research that Cally Phillips undertook in Discovering Crockett’s Edinburgh. But if you are introduced to Crockett (or Barrie, and doubtless others) via this app, I’d suggest neither does what it says on the tin, nor does credit to some unco Scots writers.
Lest you think I’m just being shirty, here is a wee comparative analysis:
I’ve picked an ‘average’ map point.
Place your pin on The Pleasance.
Lit Long credits 4 Books to this location
BOG-MYRTLE AND PEAT Samuel Rutherford Crockett, 1895
CLEG KELLY, ARAB OF THE CITY Samuel Rutherford Crockett, 1896
THE STICKIT MINISTER Samuel Rutherford Crockett, 1893
THE DEW OF THEIR YOUTH Samuel Rutherford Crockett, 1910 (actually 1909)
The excerpts are of variable use and interest. But at least all books can be read online- if that’s your thing.
In Discovering Crockett’s Edinburgh (DCEd) you have a whole chapter dedicated to the Pleasance (and Cowgate) since it is one of the most important of Crockett’s Edinburgh locations. It provides excerpts as well as critical analysis and reflection from Kit Kennedy, Lads’ Love, The Stickit Minister’s Wooing, Cleg Kelly and Kid McGhie.
The Dew of their Youth and Bog Myrtle and Peat references found in Litlong are dealt with in Chapter 6 of DCEd titled ‘Student Characters.’
The Stickit Minister excerpt is from a Cleg Kelly story – given in more detail in DCEd
The Litlong app doesn’t even mention nearby St Leonard’s Street which is perhaps one of the most important locations in Crockett’s Edinburgh. Not least because it’s where he lived for 10 years!
DCEd has another complete chapter set here and guides the reader or explorer to walk from St Leonards down to the Old Town in the company of Crockett and his characters. This is considerably more enlightening and entertaining than that offered by the app. Should I term the phrase ‘an app is only as good as its map’ and the ‘map’ offered by LitLong for Crockett’s work is, I’m sorry to say, feeble! In DCEd St Leonard’s Street is hub from which you can go in many directions to find Crockett locations.
I haven’t been comprehensively through all the LitLong listings, but there are many which are misplaced – one places a ‘South Side of Edinburgh’ in the middle of the Meadows when it is a Sunday School ‘southside’ of the Pleasance.
In conclusion though, I suggest you don’t rely on data or text mining and georeferencing combined to experience Crockett’s literary Edinburgh. For some things, real human beings, putting in real hours of work will offer a much better result. 10/10 to Cally Phillips book 2/10 to the Lit Long app.
[i] I have quoted ACCURATELY from Lit Long website – if only they could quote as accurately from Crockett’s work!
To find past articles please use monthly archives.