The article explains how researchers collect, label, and link Android vulnerabilities, their fixes, and the underlying vulnerability-inducing code to create a dataset for evaluating security-bug classifiers.The article explains how researchers collect, label, and link Android vulnerabilities, their fixes, and the underlying vulnerability-inducing code to create a dataset for evaluating security-bug classifiers.

Inside the Data Pipeline Behind Classifying Android Security Flaws

2025/11/19 15:00
7 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

ABSTRACT

I. INTRODUCTION

II. BACKGROUND

III. DESIGN

  • DEFINITIONS
  • DESIGN GOALS
  • FRAMEWORK
  • EXTENSIONS

IV. MODELING

  • CLASSIFIERS
  • FEATURES

V. DATA COLLECTION

VI. CHARACTERIZATION

  • VULNERABILITY FIXING LATENCY
  • ANALYSIS OF VULNERABILITY FIXING CHANGES
  • ANALYSIS OF VULNERABILITY-INDUCING CHANGES

VII. RESULT

  • N-FOLD VALIDATION
  • EVALUATION USING ONLINE DEPLOYMENT MODE

VIII. DISCUSSION

  • IMPLICATIONS ON MULTI-PROJECTS
  • IMPLICATIONS ON ANDROID SECURITY WORKS
  • THREATS TO VALIDITY
  • ALTERNATIVE APPROACHES

IX. RELATED WORK

CONCLUSION AND REFERENCES

\ \

V. DATA COLLECTION

This section describes how the vulnerability dataset is collected and generated for evaluating the accuracy of classifier models. The dataset consists of a list of source code changes where each change is labeled as ViC or LNC that includes VfC. Classification as either ViC or LNC aligns with the goal of this study to build a classifier that accurately differentiates between ViCs and LNCs.

\ The data collection process (depicted in Figure 2) involves the three key steps:

(1) selecting all critical vulnerabilities found in the target AOSP codebase,

(2) associating each vulnerability with its corresponding fixes, i.e., VfCs; and

(3) locating of ViC(s) for each VfC:

\ Selecting Target Vulnerabilities. As depicted in Figure 2, this study leverages the CVE (Common Vulnerabilities and Exposures) database 7 , maintained by the National Cybersecurity FFRDC (NCF), to select the target vulnerabilities. Specifically, it focuses on the CVEs that are found in the target AOSP codebase (namely, AOSP CVEs) and published on AOSP Security and Update Bulletins (ASB)8 . This study excludes some types of CVEs to remain focused. First, self-discovered and fixed CVEs found

\ internally by Google during new Android dessert releases (e.g., v14) are omitted due to the lack of publically available details. Second, vulnerabilities found in the proprietary extensions from silicon vendors, ODMs (e.g., Qualcomm), and the Google play service are not considered because they fall outside the upstream AOSP development. Third, CVEs of upstream Linux kernel (e.g., mainline, stable, and longterm releases) are excluded although AOSP-specific Linux kernel CVEs are included (e.g., ones found in the Android common kernel extensions). It is because they often involve code developed by Google, silicon vendors and ODMs, and are not strictly tied to a specific AOSP platform version.

\ Associating Vulnerabilities and Fixes. For each of the selected CVEs, this step locates the associated VfC(s). It begins by identifying all the relevant bug report(s) linked from a given CVE issue. We note that every target CVE issue published on the AOSP security bulletins has one or more associated bug reports stored in an issue tracking service (e.g., Google issue tracker 9 aka Buganizer). Conversely, multiple CVEs can sometimes share the same bug report if their fixes are identical or closely related.

\ Bug reports offer valuable insights into the vulnerability fixing process (e.g., key discussions done while reproducing or fixing them). In a vast majority of the cases, bug reports contain information about all or a subset of their VfCs. It is explicit if a VfC lists a bug report ID in its code change description (e.g., Bug: or Fixes: in the gerrit 10 change description) because then its submission event is posted on the bug report.

\ Our BugID2GerritID script automates the process of finding VfCs. It takes a list of bug IDs as input, scans the content of those bug reports, and returns any posted change IDs. Because code changes can be cherry-picked to other branches, a single change can exist across multiple branches. At this stage, the script does not yet differentiate between original changes and cherry-picks, gathering the change IDs (i.e., gerrit IDs) of all relevant changes.

\ While VfCs for CVEs or other important security issues usually reference their bug report in their gerrit description, in practice, depending on the used development protocol, it is not always the case. If the script finds no gerrit ID, a manual review work is triggered for all such bug reports to find the associated, implicit VfCs. Rarely, some bug reports do not have any VfCs if those externally known issues do not exist in the internal repository (e.g., already resolved). Occasionally, such manual analyses relevel relevant gerrit changes or commits (e.g., URLs) linked to the VfCs. In those cases, the GerritID2ChangeIDandCommitHash script is used to extract the specific VfC IDs and commit hashes from the gerrit IDs. Importantly, commits sharing the same change ID indicate cherry-picks of the original change.

Locating Vulnerability-inducing Changes (ViCs) for each Vulnerability-fixing Change (VfC). The primary objective of Vulnerability Prevention (VP) is to maximize the accurate identification of ViCs. However, the aforementioned two steps in this section have thus far only allowed for the identification of CVEs and VfCs. Thus, we introduce a technique that enables the identification of ViCs from a given VfC. The identified ViCs undergo manual analysis to remove irrelevant code changes, resulting in a refined ViC set used to evaluate VP classifiers and features.

\ These identified ViCs Table I presents the algorithm for finding ViCs. It first identifies all the changed lines (i.e., additions and deletions) by using the git11 show command and subsequently parsing its output data. For each of the identified, changed source code lines, our Blame script filters out extraneous lines (e.g., empty lines, headers, and comments) in order to only retain the relevant, vulnerabilityfixing lines (VfLs). For a deleted line or a sequence of deleted lines, the script checks when each deleted line was added or last modified. The emphasis on the addition and last modification helps pinpoint potential ViCs because those code changes could have addressed the vulnerability at least but were unsuccessfully. We note that automatically and accurately determining whether a target vulnerability originates from the last modification or prior changes(if such changes exist) remains a challenge. Thus, this study relies on manual reviews for such cases.

\ When deleted lines are replaced by some newly added lines, typically more complex lines are added (e.g., in terms of the number of lines) to implement tailored error checking rules and error handling routines that can prevent a

\ corresponding vulnerability at runtime. The tool does not classify such as a modification because it is challenging to determine whether it is a sequence of deletions and additions, or a true modification.

\ For an added line or consecutively added lines in VfLs, our Blame script analyzes when the next valid line was last modified. Here, a next valid line, for example, means a line that is not an empty line nor a comment. It is to target the common case where an error checking routine is added right before a checked variable is used. By examining the addition or last modification time of the subsequent line, the tool identifies potential ViCs where the initial error checks for those variable(s) might have been missed.

\ If multiple ViCs are identified for a single VfC, the script lists them all. While the analysis done in this study mostly relies on such script-based automated techniques for locating ViCs, sometimes valuable insights for locating ViC(s) are found in the discussions posted on the bug reports or from the descriptions of VfCs. The tools and their algorithms are continuously refined through an iterative validation process of the discovered ViCs.

:::info Author:

  1. Keun Soo Yim

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Chainlink Logo
Chainlink Price(LINK)
$9.039
$9.039$9.039
+0.87%
USD
Chainlink (LINK) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

The post CEO Sandeep Nailwal Shared Highlights About RWA on Polygon appeared on BitcoinEthereumNews.com. Polygon CEO Sandeep Nailwal highlighted Polygon’s lead in global bonds, Spiko US T-Bill, and Spiko Euro T-Bill. Polygon published an X post to share that its roadmap to GigaGas was still scaling. Sentiments around POL price were last seen to be bearish. Polygon CEO Sandeep Nailwal shared key pointers from the Dune and RWA.xyz report. These pertain to highlights about RWA on Polygon. Simultaneously, Polygon underlined its roadmap towards GigaGas. Sentiments around POL price were last seen fumbling under bearish emotions. Polygon CEO Sandeep Nailwal on Polygon RWA CEO Sandeep Nailwal highlighted three key points from the Dune and RWA.xyz report. The Chief Executive of Polygon maintained that Polygon PoS was hosting RWA TVL worth $1.13 billion across 269 assets plus 2,900 holders. Nailwal confirmed from the report that RWA was happening on Polygon. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 The X post published by Polygon CEO Sandeep Nailwal underlined that the ecosystem was leading in global bonds by holding a 62% share of tokenized global bonds. He further highlighted that Polygon was leading with Spiko US T-Bill at approximately 29% share of TVL along with Ethereum, adding that the ecosystem had more than 50% share in the number of holders. Finally, Sandeep highlighted from the report that there was a strong adoption for Spiko Euro T-Bill with 38% share of TVL. He added that 68% of returns were on Polygon across all the chains. Polygon Roadmap to GigaGas In a different update from Polygon, the community…
Share
BitcoinEthereumNews2025/09/18 01:10
Velo protocol Integrates SumPlus to Power AI-Driven Finance

Velo protocol Integrates SumPlus to Power AI-Driven Finance

Velo Protocol and SumPlus working together to enable AI-driven finance and allow autonomous agents to execute secure on-chain transactions across DeFi space.
Share
Blockchainreporter2026/03/20 05:00
Seething House Republicans turn knives on John Thune with crude message

Seething House Republicans turn knives on John Thune with crude message

House conservatives are training their fire on a new target: their own Senate majority leader.Fed up with John Thune's (R-SD) refusal to nuke the filibuster and
Share
Rawstory2026/03/20 05:42