Laying tracks on an infinitely long road
At the end of my time at Lambda School, I got the chance to work with Human Rights First, a non-profit organization. I joined them in building “Blue Witness”, a web application that will be used to increase visibility and awareness of police violence. The centralization of news through media networks under larger corporations comes with inherent bias. As the world’s networks grow stronger, people become more connected, the power of the individual is highlighted, and the inflow of data is exponentiated. Blue Witness became an opportunity to leverage deep learning machines with twitter data to create a system that spreads information more freely in the interest of the individual. I believe this is a small part in a much larger pattern of integrating artificial intelligence into our social networks. This is my short experience as a gandy dancer on this long road.
While working on the data science team, we were able to redesign the rails information flowed through from scraping twitter for potential incidents of police violence, passing them through a deep learning model to extract information, and storing them in a database that can be accessed through the web application. A data labeling app was created to increase the efficiency of feeding training data to the model. A word-stem search was created to categorize and label incoming data. A twitter-bot was built to retrieve more information from twitter users if necessary. All of this was routed to a live PostgreSQL database to periodically update data.
The Troubles of Section Hands, Traqueros, and Platelayers
One of the challenges of A.I. is that it is full of contradictions. It’s an attempt to fit the problems of an entropic world into a binary system. The machines are just about everything except intelligent. Our success in using them relies on optimizing to a specific problem, yet architecting a brand new neural network for every specific problem is unfeasible and so the most useful architectures have to be powerful, sophisticated, and generalize well. When this isn’t executed properly, what you often end up with is similar to a person using a two-handed hammer drill on venetian lamp shade (or a tack hammer on a rail pin).
For Blue Witness, Google’s BERT model was used, and much thought had to go into molding it to fit our goals. BERT (Bidirectional Encoder Representations from Transformers) is an open-source system for training an NLP (natural language processor) to interpret text. In the context of the Blue Witness project, it was used to search tweets for reports of police violence.
A completed dataset with ranks of violence was used to initially train the model but as with many deep learning models, more high-quality data was needed. To tackle this problem, we created a web app that gave any members of the development team or the non-profit organization the ability to help add training data. Upon accessing the web app, a user would receive clear directions on how to label and submit data for training the model.
The form the data came in also presented some challenges. Tweets have no required format so valuable information like date and time of the police violence reports may not be present. To deal with this, we created a twitter-bot that would interface between the original posters of the tweet and the trusted admins on the Blue Witness web application. Using a twitter developers account, we were able to automate the process of checking for tweets mentioning the accounts. While searching through these mentions, if the ID’s of the replier matched our database, the replies would get stored, then made visible to an admin interacting with the web application. This allowed us to automate the action of replying on twitter and storing new information while keeping the interactions personal as they will be controlled by an admin. This is especially important considering the high probability these incidents can be traumatic.
The development teams were using the original dataset as a blueprint for designing the web application. The “tags” from that dataset became categories users could select to filter the data. These tags were not present in the incoming twitter data so I built a tag maker. The tag maker takes in any text and list of possible tags to search for. It uses a word stemmer from the NLTK (Natural Language Toolkit) libraries. With this, the tag maker could search for any version of any word (past, present, imperative, declarative, etc..) and match it to the tag list. We no longer had to accept the tags that were given in the original dataset either. Instead, the team, and HRF, now have total autonomy over how to categorize incoming data by creating their own tag list.
Lastly, we wanted to make sure the work we did was easy to understand and iterate over. When I first came into this project, I had a hard time understanding some of the code. More detailed descriptions and comments would have been useful. I made it a point to write concise code with more verbose comments. I converted multiple functions in a python file into a runnable script. I left clear information on the limitations of tag maker when dealing with abstract concepts. The next gandy dancer to pick up where I left off will have their own roadmap, instructions on the tools, their dimensions, and their faults or limitations.
The Long Journey Ahead
“The road must be built, and you are the man to do it. Take hold of it yourself. By building the Union Pacific, you will be the remembered man of your generation.”
- President Abraham Lincoln to Oakes Ames, 1865
My time working on these rails came to an end before we completed the route. While we were able to rank more tweets to train the BERT model with, much more data will be needed in the future. We were able to bring the twitter-bot to life in development, but to implement it in production, future teams will have to collaborate to make the functionality available to web users. Even though most of the concrete work is done, there are abstract problems that have to be considered. Due to the checkered history of the United States, adding awareness and visibility to police violence requires attention to race and gender. The data, tags, and model predictions will have to be revisited to check for underrepresented groups of people.
The work we have done is only the beginning. The tools to build these solutions are freely accessible to anyone with a computer. The will of people to create, iterate, and improve will continue to present itself in this mission. Instead of limiting the scope of the project to twitter, other social media outlets can be included. As the project proves successful, there will be good reason to grow the scope outside of the U.S. alone. Once the application is fully deployed, there will be new data to be collected. The interactions of the twitter-bot can be measured for best practices. Fully automating would improve the efficiency and may become a possibility. The process of creating new training data for the model may be outsourced to parties invested in the mission of Human Rights First. The project seems so far along, but there are still so many possibilities.
These days, the constant inflow of data is a beckoning call for data scientists to analyze and implement new products/solutions. Taking already built deep learning architectures and pointing them at incoming data is just the beginning. This concoction of social media and A.I. interpretation can only evolve and develop into higher complexity abstractions. If you allow your imagination to wander, and recognize these projects and tasks as a smaller part of a repeating pattern, it’s easy to see these rails can lead to many places. The finished product today will become the building blocks for greater missions tomorrow. These roads will create networks enabling others to connect larger ideas and provide value in new ways. As developers and digital architects, it is important to recognize the responsibilities we are undertaking; that these tools and functions are built on a strong ethical foundation.