Summary of the 20th IEEE International Conference on Information Reuse and Integration for Data Science (IRI 2019)

From July 30th to August 1st I participated in the 20th IEEE International Conference on Information Reuse and Integration for Data Science (IRI 2019) in Los Angeles, USA. Here are quick summaries of the presentations that I attended during the event.

 

Day 1: Tuesday, July 30th

Keynote 1:

After the official opening, the conference started with the keynote Some New Data Challenges for Data Science by Huan Liu (Arizona State University, USA). His slides are available here.

Prof. Liu introduced the topic saying that nowadays abundant data is ubiquitous (“data is the new oil”). In this context, Data Science emerges from Computer Science, Statistics, Information Sciences, etc. The recent success of AI is due to its use of data. For machine learning to work, we need data. A new source of Big Data is social media. He’s been working in this area for some time, published a successful book on the topic (avaliable for free).

He then presented three challenges for working with social media data: (1) social media data seems really big, but it isn’t, we want to make it bigger; (2) privacy vs. utility trade-off; (3) how can we evaluate the results without a ground truth? (also, data labeling is very time-consuming and usually done by the researchers themselves, introducing a conflict-of-interest).

Regarding the first challenge (social media is not big), social media data suffers the curse of dimensionality: data have many dimensions, creating an exponential amount of regions data can be categorized. To reduce the problem, we need to identify relevant, redundant and irrelevant features using feature selection in order to reduce the number of dimensions. His group developed a Python library called scikit-feature to this purpose. In social media, companies want to make money from regular people, which produce little data. A few people (e.g., celebrities) produce a lot of data, but most people produce thin or sparse data (short head, long tail), in which machine learning won’t work. There are new opportunities in social media data research regarding multiple facets (posts, profile, linked information) and platforms (conecting users across social media sites).

Regarding the second challenge (utility vs. privacy), people’s data traces (e.g., browser histories) can help personalize their experience. However, the availability of these traces can make users vulnerable to attacks. In essence, this is the trade-off between privacy and utility. For instance, current solutions to protect privacy (e.g. anonymization) harm utility. He presented his proposal for protecting privacy without loosing utility called the PBooster Algorithm.

Finally, regarding the third challenge (no ground truth), prof. Liu motivated the problem with the question of migration between social media websites and referred to a CACM 2015 article from his research group for people interested in a solution they propose. He finished the talk mentioning his research group also made available a social computing repository, two recent surveys on the topic and books on the subject.

During the questions session, prof. Liu mentioned the role of knowledge base approaches when machine learning approaches are not suitable. Knowledge graphs are a hot topic at the moment.

 

Session A11: Machine Learning and AI I

The first paper was a work from UFSJ in Brazil (authors couldn’t make it, so it was presented by a colleague), called Combining Data Mining Techniques for Evolutionary Analysis of Programming Languages. The goal of the work was to examine whether the changes in programming languages when they evolve make a positive or negative impact in the community. The proposal is a framework with three components: topic modeling (TM), sentiment analysis (SA) and data visualization (DV), which were explained in more detail during the presentation. TM is applied to the documentation of the language (using NMF), SA to the community feedback from forums and discussion lists (using VADER) and the results of both are sent to the DV component that produces a sparkline bar chart (temporal visualization). The proposal was evaluated by analyzing Python Enhancement Proposals and community feedback from 2000 to 2017.

Next, Taghi Khoshgoftaar (Florida Atlantic University, USA) presented Evaluating Model Predictive Performance: A Medicare Fraud Detection Case Study. The contributions of the work are a unique approach for Medicare data preparation and a robust comparison of cross-validation vs. validating model performance with a new, unseen dataset. The work is motivated by the billions of dollars issue of Medicare fraud in the U.S. To prepare data, features were selected and then the dataset with the features was joined with another dataset in which the fraud vs. non-fraud information was found. Cross-validation was done on a single dataset by splitting it into k-folds. The approach was evaluated and ANOVA and Tukey’s HSD tests indicated that results are significantly better than the baseline.

Finally, M. Brian Blake (Drexel University, USA) presented Evaluation of a Reusable Technique for Refining Social Media Query Criteria for Crowd-Sourced Sentiment for Decision Making. The paper tackles the following research question: is there a statistically supported resuable techinque that facilitates data consumer in selecting social media filtration criteria that reduces the gap between consumer’s objective and the system’s output. The work proposes a customized statistical approach and model for interpreting the influence of bots in social media and analyzing social media data according to origin (who, where), originality (original, retweet, mention), sentiment (positive, negative) and participation (low, medium, high). The experiment was conducted on three datasets based on the following Twitter hashtags: #presidentialdebate (2016), #election2016 (election night) and #notmypresident (post-election protest, 2016). Experiment results show, in particular, that participation had the strongest statistical relationship with sentiment and a perfect correlation with retweet.

 

Keynote 2:

The second keynote was given by Matthew C. Stafford (U.S. Air Force’s Education and Training Command) and was titled “Chameleons” – Actors who can “Play any Part”: your Data can have a Starring Role Too!. Dr. Stafford uses Adaptive Learning in an Air Force Learning Services Ecosystem. The role of data in that ecosystem is huge, creating many opportunities. A 2019 bill mentions that all U.S. government agencies will need to identify the data they will handle and all the analysis they intend to do with this. There’s a huge demand for Data Scientists in the government.

Particularly in the Air Force, there is a big demand for new training because airliners poach pilots and maintainers, given that the private sector has better salaries than government. It usually takes 10-12 months to train a fighter pilot and the challenge is to shorten that period. Applying adaptive learning, they built an AI Coach that analyzes workload and performance data from trainees and adapt the rigor of the learning environment for each of them. Cost and timing of training was cut in half. A huge amount of data is collected, but only 15-20% is being used, so many opportunities there.

Dr. Stafford switched the context to Education in Academia, in which the norm is to assess students with only a midterm and final exam: we need more touchpoints. Further, not only assess if students understand, but also how they feel about it. There are many isolated silos in academia (registrar, finance, housing, health, social, academic, sports, etc.). We tend to focus only on the academic silo and that does not have all the data we need.

On the topic of Machine Learning: Dr. Stafford sees exciting things, but argues we need more of the human side. He recommended that Design Scientists talk to people (in Government, for instance) that have the data and understand together how data science can help them, excite them and get access to the data. As a final recommendation, don’t fight bureaucracy head on, go around it. Bureaucracy exists to survive and thrive, one needs to learn how to work with it in order to improve the world.

 

Session A23: Novel Data Mining and Machine Learning Applications I:

This session started with my presentation of our paper GO-FOR: a Goal-Oriented Framework for Ontology Reuse.

Next, Sampath Jayarathna (Old Dominion University, USA) presented A Sentiment Classification in Bengali and Machine Translated English Corpus. The basic idea of the work is to reuse sentiment analysis resources that already exist in mainstream languages in other languages. Working on the Bengali language, they applied machine learning techniques instead of lexicon-based methods for cross-lingual sentiment analysis, given the lack of lexicon-based tools for this language. The proposal is based on a simple translation to English and then using the available tools, therefore with a high potential of being generalized to other languages.

Then, Sandeep Reddivari (University of North Florida, USA) presented Software Quality Prediction: an Investigation based on Machine Learning. The work is motivated by the cost of maintenance in low quality software, so they want to predict software quality. They experimented with eight machine learning techniques in the context of realiability (number of defects) and maintainability (number of changes) and compared 9 object-oriented metrics over the source code (obtained with PROMISE) using J48, Random Forest, Naive Bayes, Bayesian Network, PART, KNN, SVM and ANN. Well-known open source projects were used in the experiments, which showed that Random Forest had the best results, followed by PART, J48 and KNN.

Prof. Reddivari went on to present another paper, entitled Enhancing Software Requirements Cluster Labeling Using Wikipedia. Clustering has been sucessfully used in Software Engineering, particularly in reverse engineering, program comprehension, traceability and source code visualization. There is some use also in Requirements Engineering, including previous work from himself. In this paper, he presents a review of the automated cluster labeling methods for requirements documents, proposes a framework for enxhancing cluster labeling using Wikipedia and conducted experiments over three datasets (iTrust, eTour and CM-1). The process consists of pre-processing requirements documents, clustering them and labeling the clusters automatically using Wikipedia. Then the labels are compared with labels produced by domain experts.

 

Panel I: On Expanding the Impact of data Science on the Theory of Intelligence and its Applications

The panel was composed by Stuart H. Rubin (SPAWAR Systems Center Pacific, USA), Matthew C. Stafford (U.S. Air Force’s Education and Training Command), Taghi Khoshgoftaar (Florida Atlantic University, USA) and Chengcui Zhang (University of Alabama at Birmingham, USA).

I couldn’t really follow the discussions as I’m not an expert on these topics. One of the things they discussed was the limitations of Deep Learning/Neural Networks (it fails once every thousand runs, would you trust it to drive your car?; it doesn’t do a good job playing chess; simply inverting the colors in a simple image can break it; etc.). Panelists seem to agree that a combination of neural network techniques with symbolic AI techniques shows some promise.

Another thing that was mentioned was the current trend on trying old methods now that we have better computers and if the promise of Quantum Computing would solve current challenges. Comments went on the direction of problems also becoming harder, thus increasing processing power won’t solve this. We will also need new algorithms.

Dr. Stafford mentioned again the importance of putting together Machine Learning and Human Learning. The human aspects of the application of AI are being overlooked and it’s very important to pay attention to them.

 

Day 2: Wednesday, July 31st

The program for Wednesday’s morning was all about neural networks, so I decided no to attend.

 

Invited Presentation:

After lunch, Stuart H. Rubin (SPAWAR Systems Center Pacific, USA) presented a paper called The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks by Jonathan Frankle and Michael Carbin, originally presented at the International Conference on Learning Representations (ICLR), in May 2019. This was inserted in the program after a keynote speaker canceled last minute.

The hypothesis is that dense, randomly-initialized feed-forward neural networks contain subnetworks they call “winninng tickets” that, when trained in isolation, reach test accuracy comparable to the original network in a similar number of iterations with 10-20% of the size of the full network. Therefore, you would need much less resources to run them.

Dr. Rubin presented the paper in more detail. He also made his own proposal on an improved way to find the winning ticket. He said that this is a good opportunity to try to improve neural networks in order to overcome its limitations.

 

Session B22: Databases and Access

Via Skype, Luiz Henrique Zambom Santana (UFSC, Brazil) presented A Middleware for Polyglot Persistence of RDF Data into NoSQL Databses. It is becoming common to have multiple persistence technologies used in the same project, creating a series of challenges. They propose WA-RDF, a workload-aware middleware for storing and querying RDF data in multiple NoSQL database nodes. Experiments showed that using WA-RDF to manage the partitions is better than delegating it to the NoSQL databases, reducing the response time.

Ravi Sandhu (University of Texas in San Antonio, USA) presented On the Feasibility of Attribute-based Access Control Policy Mining. Attribute-based access control (ABAC) means looking at attributes from users and objects they want to access and determine access based on them. Creating these access rules manually is infeasible due to the amount of attributes. For automatic generation of ABAC rules, there is a matter of feasibility of creating such rules. They propose a partition-based strategy for determining this feasibility that consists on building a partition set and verifying if the set is conflict-free w.r.t. writing.

Chetna Suryavanshi (Utah State University, USA) presented Renovating Database Applications with Query AutoAwesome. Google AutoAwesome is a tool that automatically enhances photos in Google Photos. Analogously, Query AutoAwesome provides automatic enhancements to SQL queries in database applications. It enhances queries in terms of alternative literals, query combination, column padding or group padding.

 

Day 3: Thursday, August 1st

Panel II: Integration of Artificial Intelligence and Security

Bhavani Thuraisingham (University of Texas at Dallas, USA) introduced the panel, motivating the need for security and privacy concerns in Artificial Intelligence. She divided this in three topics: AI for cybersecurity, cybersecurity for AI and AI causing security/privacy issues. She then introduced the panelists and each of them had a few minutes to talk.

Stuart H. Rubin (SPAWAR Systems Center Pacific, USA) argues for making code more secure by creating variations of it and using automatic programming techniques to reduce costs. Such techniques still need further research, so there are opportunities there.

Ravi Sandhu (University of Texas in San Antonio, USA) talked about the importance of taking into account the characteristics of the domain and talked about experiments conducted with scaling up systems to make them more robust, with good results. He closed his speech saying that most systems will have AI, so he’s not sure if security for AI is so much different than security for other kinds of systems.

Danda Rawat (Howard University, USA) discussed the matter of trust when we delegate to AI things like configuring security measures in our systems.

Brian Ricks (PhD candidate at the University of Texas at Dallas, USA) talked about signature-based vs. anomaly-based approaches to cybersecurity, the former being more applicable in practice due to smaller number of false positives, but the latter being more successful in a few specific cases and worth looking into.

Discussion followed with a few questions, but frankly I lost interest.

 

Session C12: Data Modeling and Knowledge Base

Stuart H. Rubin (SPAWAR Systems Center Pacific, USA) presented DDM: Data-Driven Modeling of Physical Phenomenon with Application to METOC. The work addresses representation, acquisition, and randomization of experiential knowledge for autonomous systems in expert reconnaissance (e.g., weather prediction). Based on cases from experience, rules are derived, creating a search space. Random and symmetric searches are used to perform fuzzy predicate matching in order to derive predictions (e.g., it will rain) based on current sensor information (e.g., barometer measurements are rising and it’s warm).

Nisansa de Silva (University of Oregon, USA) presented An Overview of utilizing Knowledge Bases in Neural Networks for Question Answering. Question answering involves language understanding, domain knowledge and answer production. This work consists of a survey on the second item (use of domain knowledge) with a particular focus on neural question answering.

There was a third paper in the session, entitled Scalable analysis of open data graphs. However, the session chair did not control the time of the frist two presentations and those took the entire time of the session.

 

Session C21: Information Systems

Brian Ricks (PhD candidate at the University of Texas at Dallas, USA) presented Mimicking Human Behavior in Shared-Resource Computer Networks. The work is motivated by the fact that sometimes there is a lack of relevant datasets to perform your research on, in particular in the domain of computer networks. The goal of the work is to automate user behavior that resembles the intended target behavior you want to analyze. This work built upon the eMews framework presented in IRI 2018 (best paper award that year), trying to solve a limitation that the behavior was static.

Juliana Fernandes (Masters student at UNIRIO, Brazil) presented A Conceptual Model for Systems-of-Information Systems. The purpose of the paper was to establish a conceptual model to support researchers and practitioners to recognize systems of information systems (SoIS), based on a literature review on the topic. The model was evaluated by instantiating three cases from the literature. Such model could help researchers and practitioners in the design of SoISs.

Bharat S. Rawal (Penn State Abington, USA) presented A Comparative Study of System Virtualization Performance. They performed an experiment to measure throughput, jitter, response time and packet loss between two clients running Windows and Linux virtualized systems. Conclusions are that virtualized systems have a slightly lower performance relative to non-virtualized systems due to overhead, there were 12%-20% performance degradation with heterogeneous combinations of host-guest OSs depending on the metric and using Linux as host/guest OS results in better performance than Windows.

 

My opinion

I never do this, but this conference kind of deserves it: this was the saddest conference I’ve ever been to.

Registration was US$ 780 (the most expensive I’ve ever seen) to have the conference at the Sheraton (note: lunch was not included). Keynotes and panels (no parallel activity) held at a room for 70 people usually had 15, not more than 20 people in it. When sessions were held in parallel, it was even less. I presented my paper to 6 other people, one being the session chair and two others presenting their works in that same session. It seemed like the only people in the conference were authors/presenters and organizers. They could have done this easily in an university campus and cut their costs drastically, attracting students and other people to the presentations.

The call for papers lists 12 topics, but almost the entire conference was about just one of them: Machine Learning and AI (particularly, neural networks). You can check the program for yourself to see what I’m talking about. If the papers accepted by the program committee were all about this topic, OK… However, having all the keynotes, invited talks and panels strictly about this, why not change the name of the conference and adjust the call for papers, if that’s all you want to talk about?

Even though the conference took place at an overpriced and way-more-than-necessary venue, it still looked very amateurish. Organizers would constantly mention people that couldn’t make it to the conference. One author from Brazil was told that they could not present the paper remotely and asked a colleague (who was not an expert on the work) to do it, then at some other session an author does present his paper via Skype (coincidentally, both were Brazilian). One of the invited talks basically consisted of the conference general chair presenting someone else’s paper that was recently presented at a different conference (and, by the way, he got the authors wrong during his talk). Authors were asked to provide short bios that were never used. Wi-Fi was provided only to people staying at the conference hotel.

I guess all this reflected on people, as you would clearly see that most of them would attend the conference only at the days they were presenting something. Even I skipped a morning, which I never do (ever since my PhD supervisor advised me not to, for the sake of my academic career). The whole thing looked like a dying community spending a lot of grant money to get together and discuss a single topic, which they could have done spending a lot less.

Last, and probably least, what was that awards ceremony during the banquet? A dozen outstanding services awards were given to basically everyone involved in the organization (including the conference webmaster). Really weird…