The Problem of Learning Analytics and AI
For some time now, I have been wanting to write about some of the problems I observed during my time in the Learning Analytics world (which also crosses over into Artificial Intelligence, Personalization, Sentiment Analysis, and many other areas as well). I’m hesitant to do so because I know the pitchforks will come out, so I guess I should point out that all fields have problems.
Even my main field of instructional design is far from perfect. Examining issues with in a field (should be) a healthy part of the growth of a field. So this will probably be a series of blog posts as I look at publications, conferences, videos, and other aspects of the LA/PA/ML/AI etc world that are in need of a critical examination. I am not the first or only person to do this, but I have noticed a resistance by some in the field to consider these viewpoints, so hopefully adding more voices to the critical side will bring more attention to these issues.
But first I want to step back and start with the basics. At the core of all analytics, machine learning, AI, etc are two things: surveillance and algorithms. Most people wouldn’t put it this way, but let’s face it: that is how it works. Programs collect artifacts of human behavior by looking for them, and then process those through algorithms. Therefore, the core of all of this is surveillance and algorithms.
At the most basic level, the surveillance part is a process of downloading a copy of data from a database that was intentionally recording data. That data is often a combination of click-stream data, assignment and test submissions, discussion forum comments, and demographic data. All of this is surveillance, and in many cases this is as far as it goes. A LOT of the learning analytics world is based on click stream data, especially with an extreme focus on predictive analytics. But in a growing number of examples, there are also more invasive forms of surveillance added that rely on video recordings, eye and motion detection, bio-metric scans, and health monitoring devices. The surveillance is getting more invasive.
I would also point out that none of this is accidental. People in the LA and AI fields like to say that digital things “generate” data, as if it is some unintentional by-product of being digital: “We turned on this computer, and to our surprise, all this data magically appeared!”
Data has to be intentionally created, extracted, and stored to exist in the first place. In fact, there usually is no data in any program until programmers decide they need it.
They will then create a variable to store that data for use within the program. And at this moment is where bias is introduced. The reason why certain data – like names, for example – are collected and others aren’t has to do with a bias towards controlling who has access and who doesn’t. Then that variable is given a name – it could be “XD4503” for all the program cares. But to make it easier for programmers to work together, they create variables names that can be understood by everyone on the team: “firstName,” “lastName,” etc.
Of course, this designation process introduces more bias. What about cultures that have one name, or four names? What about those that have two-part names, like the “al” that is common in the Arabic names, but isn’t really used for alphabetizing purposes? What about cultures that use their surname as their first name? What about random outliers? When I taught eighth grade, I had two students that were twins, and their parents gave them both nearly identical sets of five names. The only difference between the two was that the third name was “Jevon” for one and “Devon” for the other. So much of the data that is created – as well as how it is named, categorized, stored, and sorted – is biased towards certain cultures over others.
Also note here that there is usually nothing that causes this data to leave the program utilizing it. In order for some outside process or person to see this data, programmers have to create a method for displaying and / or storing that data in database.
Additionally, any click stream, video, or bio-metric data that is stored has to be specifically and intentionally captured in ways that can be stored. For example, a click in itself is really just an action that makes a website execute some function. It disappears after that function happens – unless someone creates a mechanism for recording what was clicked on, when it was clicked, what user was logged in to do the click, and so on.
All of this to say that none of this is coincidental, accidental, or unplanned. There is a specific plan and purpose for every piece of data that is created and collected outside of the program utilizing them. None of the data had to be collected just because it was magically “there” when the digitials were turned on. The choice was made to create the data through surveillance, and then store it in a way that it could be used – perpetually if needed.
Therefore, different choices could be made to not create and collect data if the people in control wanted it that way. It is not inevitable that data has to be generated and collected.
Of course, most of the few people that will read this blog already know all of this. The reason I state this all here is for anybody that might still be thinking that the problems with analytics and AI is created during the design of the end user products. For example, some believe that the problems that AI proctoring has with prejudice and discrimination started when the proctoring software was created… but really this part is only the continuation of problems that started when the data that these AI systems utilized was intentionally created and stored.
I think that the basic fundamental lens or mindset or whatever you want to call it for publishing research or presenting at conferences about anything from Learning Analytics to AI has to be a critical one rooted in justice. We know that surveillance and algorithms can be racist, sexist, ablest, transphobic, and the list of prejudices goes on. Where people are asking the hard questions about these issues, that is great. Where the hard questions seem to be missing, or people are not digging deep enough to see the underlying biases as well, I want to blog about it. I have also noted that the implementation of LA/ML/AI tools in education too often lacks input from the instructional design / learning sciences / etc fields – so that will probably be in the posts as well.
While this series of posts is not connected to the Teach-In Against Surveillance, I was inspired to get started on this project based on reflecting on why I am against surveillance. Hopefully you will join the Teach-In tomorrow, and hopefully I will get the next post on the Empowering Learners for the Age of AI conference written in this lifetime. :)
Author: Matt Crosslin
Matt is currently an Instructional Designer II at Orbis Education and a Part-Time Instructor at the University of Texas Rio Grande Valley. Previously he worked as a Learning Innovation Researcher with the UT Arlington LINK Research Lab. His work focuses on learning theory, Heutagogy, and learner agency. Matt holds a Ph.D. in Learning Technologies from the University of North Texas, a Master of Education in Educational Technology from UT Brownsville, and a Bachelors of Science in Education from Baylor University. His research interests include instructional design, learning pathways, sociocultural theory, heutagogy, virtual reality, and open networked learning. He has a background in instructional design and teaching at both the secondary and university levels and has been an active blogger and conference presenter. He also enjoys networking and collaborative efforts involving faculty, students, administration, and anyone involved in the education process.