Over the past three months, we've been testing different machine learning and machine reading techniques to help better understand large volumes of data. We've been aiming this specifically, at foresight or 'horizon scanning' because we feel that this is an area of analysis that could greatly benefit by more thorough, data-driven analysis.
To conduct a meta-analysis of foresight material we designed a system that uses crawlers, machine reading, cloud-based databasing and data visualisation scripts. Using these processes we implemented a three stage process of analysis - searching, mapping and analysis.
The scan of scans
Following this process, we developed a detailed search set of data, this widened the initial list of starting sources from a list of 20 sources to a further iteration of around 150 sources. Such a growth in starting data was made possible by good engagement with experts in foresight, but then from the use of crawlers that bought back a large range of document sources, which were then data mined and assessed for viability and suitability.
The search phase led to around 1200 research reports (as either pdf files or html files converted into pdfs) being gathered. Of these, 1050 were deemed to be relevant to foresight and the remaining 150 were either duplicates, write-protected pdfs or deemed out of scope.
Upon selecting the relevant source literature, we split each individual research document into its constituent strings. This was done using bespoke machine reading processes that tag source documents with metadata (relating to the title, author and source organisation) and then splitting each document into its constituent strings, each assigned with the appropriate metadata.
Once split into strings, all the data was then held on a bespoke DJANGO database navigated through a graphical user interface (GUI). Using this GUI, an initial data visualisation was produced that illustrated the keyword frequencies contained in each source document.
With all the data gathered and hosted on the main database, there were around 11,000,000 source strings that could then be analysed. We conducted two forms of mapping on this data and the associated metadata.
1. Topic modelling to determine the most frequently occurring themes and concepts in the full dataset.
2. Expert mapping to determine the key contributing authors for the data for future testing.
Topic modelling yielded a rich picture of the data contained in the documents set and gave some indication of the overall concepts and interconnections between the documents. The data was mapped using Gephi, an openly available graphing platform.
Using the detailed topic map, allowed a simplified, topic map that summarised most frequently reported themes and concepts in the dataset to be recorded. This produced a higher level ‘topic’ map that combines the most frequently occurring terms in the data with low frequency ‘emerging’ terms of potential strategic relevance to foresight analysts and policy planners.
Although this is a stylised representation of the data collected in stages 1-2. It is valuable as tool for enabling structured foresight exercises and scenario development to be developed around data that can be evidenced and accessed for further policy and decision making.
Having determined the themes in the data, the search data collected in the search phase, was then used to map the contributing network of experts and source organisations. After determining these, it is the possible to contact the sources and ask them to comment on the findings (especially the high level map) to add their analysis and insights to the initial maps. A sample, contributing expert map (high level) is below.
Using a data-driven searching processes for gathering and modelling topics and expert networks can improve current processes for thinking about the future. Following a structured auditable processes is likely to increase the confidence and accountability in foresight analysis. Such processes also represent an important bench mark for making foresight more ‘quantitative’ as they allow metrics to be generated around the scale and range of data collected. Such metrics can be used to form the basis of confidence and probabilistic assessments which could increase the rigour of foresight analysis, moving the discipline away from current techniques which are often difficult to quantify and subject to considerable levels of bias and group think.
To take this work forward, a useful next exercise could be to assess current foresight processes and benchmark them to see what markers and metrics can be used to test predicted outcomes. Processes that use machine reading and data visualisation yield a large amount of 'hard data' it would be useful to better understand how these could be applied to improving long term prediction.