This research investigates unethical behavior in open-source software (OSS) projects through the analysis of software artifacts that are impacted. Building on previous taxonomies, we further categorize and refine them to find 18 different sorts of artifacts that potentially represent ethical transgressions, including as source code, configuration files, licenses, project-level features, and GitHub interactions.This research investigates unethical behavior in open-source software (OSS) projects through the analysis of software artifacts that are impacted. Building on previous taxonomies, we further categorize and refine them to find 18 different sorts of artifacts that potentially represent ethical transgressions, including as source code, configuration files, licenses, project-level features, and GitHub interactions.

18 Ways Unethical Behavior Creeps Into Open-Source Software

2025/09/22 18:02
5분 읽기
이 콘텐츠에 대한 의견이나 우려 사항이 있으시면 crypto.news@mexc.com으로 연락주시기 바랍니다

Abstract and 1. Introduction

  1. Background and Related Work

  2. Study of Unethical Behavior in OSS

    3.1 RQ1: Types of unethical behavior

    3.2 RQ2: Affected software artifacts

  3. Methodology

    4.1 Modeling via SWRL rules

    4.2 Automatic detection of unethical behavior

  4. Evaluation

  5. Discussion and Implications

  6. Threats to Validity

  7. Conclusion and References

3.2 RQ2: Affected software artifacts

We define affected software artifacts as objects in software repositories that violate ethical principles. To derive the set of affected software artifacts, we started with the 19 categories from the taxonomy of prior study [74]. Then, we categorized the artifacts we found in our study based on the 19 categories. After removing categories with no artifact found, we obtained eight categories: (1) source code, (2) script, (3) configuration, (4) database (data), (5) image, (6) prose, (7) legalese, and (8) other. For the prose category (i.e., plain text files), we only found two concrete types (i.e., README/CONTRIBUTING.md, and CHANGELOG) so we separated them into two categories. As the category “other” in prior study [74] is too broad, we split it into 10 new categories based on aforementioned steps in thematic analysis: (1) external application programming interface (API), (2) user interface (UI), (3) project, (4) release history, (5) software feature, (6) product name, (7) operating system (OS), (8) website, (9) PR/Issue code review, (10) PR/Issue comment. We derive “PR/Issue code review” and “PR/Issue comment” based on prior work [58]. Our newly introduced categories aim to preserve the hierarchy of artifacts (Project > Software feature [50] > Source code). For 28 cases (8.9%), both authors meet to discuss the issues labeled with different categories to resolve any disagreement. Finally, we obtained 18 types of affected software artifacts: (1) project, (2) software feature, (3) source code, (4) external API, (5) legalese, (6) product name, (7) release history, (8) UI, (9) configuration file, (10) PR/Issue code review, (11) PR/Issue comment, (12) README / CONTRIBUTING.md, (13) CHANGELOG, (14) data, (15) image, (16) OS, (17) website, and (18) script (i.e., source code in languages executed by an interpreter). As several artifacts are more difficult to understand, we explain them below:

\ Project: The affected artifacts involve more than one types of artifacts within the entire repository.

\ Software feature: Functional or non-functional requirements of a system [50, 57]. An example is the ability to unsubscribe a service.

\ Source code: Source files (excluding scripts, binary code, build code) that belong to the current repository (internal).

\ External API: API from third party (external) library or service.

\ Legalese: Licenses, copyright notes, or patents.

\ Product name: The product, project, or app name.

\

\ The third column in Table 1 presents the affected artifacts for each unethical behavior. Each number in the column denotes the number of GitHub issues with a certain type of artifact (e.g., “19 Projects” means that there are 19 issues where S2 is affected by projects). Theoretically, one issue might discuss multiple artifacts but we found that each issue only discusses one artifact because (1) developers prefer discussing ethical concerns for one type of artifact in one issue, and (2) some categories are hierarchical (e.g., “project” includes multiple types of artifacts). Overall, Table 1 shows that source code is still the most common type of artifacts for unethical behavior (i.e., it affects eight types of unethical behavior).

\

4 METHODOLOGY

Our study shows that diverse types of unethical behavior exist in OSS projects, and they usually involve diverse types of software artifacts. The diversity and the complexity of the rules governing the ethics-related activities in GitHub motivate the need for a modeling approach that can abstract this complexity and facilitate its automatic detection. In Section 4.1, we describe how we model unethical behavior using SWRL rules. Then, we explain the architecture of Etor that uses SWRL rules for automatic detection in Section 4.2.

\ Table 2: GitHub attributes and types for auto-detection

4.1 Modeling via SWRL rules

We propose using SWRL rules to represent unethical behavior in an OSS project together with the publicly available data in GitHub. SWRL rules allow us to model affected software artifacts as hierarchies of classes and properties, capturing the relationships between affected software artifacts and stakeholders. Table 2 shows GitHub attributes used in our modeling. The columns under “Attribute”, and “Type” explain each attribute and its type. We model each OSS project as GHRepository. By referring to the GitHub Repositories API [19], we selected 11 data properties (e.g., latestRelease and licenseFile) that belong to a GHRepository by excluding properties that are irrelevant for unethical behavior (e.g., avatar_url that points to the icon for a repository). Apart from GHRepository, we introduce six classes to model data properties of a repository: (1) GHUser, (2) GHCommit, (3) GHContent, (4) GHIssue (5) GHPullRequest), (6) GHRelease. While GitHub users (GHUser) usually play different roles in OSS projects, we only model: (1) contributors (users who are official contributors of a repository) and (2) issue owners (users who report an issue). For modeling GHIssue, we reuse the same convention in GitHub by modeling a PR (GHPullRequest) as a subclass of GHIssue (i.e., GitHub Issue Search API will search for issues and PRs, essentially treating a PR as a type of GitHub issue). Figure 3 shows the OWL ontology for our model where GHRepository is the main class, and the arrows denote the relationships between the classes. Specifically, GHIssue − GHPullRequest represents the subclass relations, whereas other arrows denote hasA relations (e.g., GHIssue − GHUser means that each issue has a user who reports the issue).

\

:::info Authors:

(1) Hsu Myat Win, Southern University of Science and Technology, China (11960003@mail.sustech.edu.cn);

(2) Haibo Wang, Southern University of Science and Technology, China (wanghb2020@mail.sustech.edu.cn);

(3) Shin Hwei Tan, a corresponding author from Southern University of Science and Technology, China (tansh3@sustech.edu.cn).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

시장 기회
OpenLedger 로고
OpenLedger 가격(OPEN)
$0.20994
$0.20994$0.20994
+2.65%
USD
OpenLedger (OPEN) 실시간 가격 차트
면책 조항: 본 사이트에 재게시된 글들은 공개 플랫폼에서 가져온 것으로 정보 제공 목적으로만 제공됩니다. 이는 반드시 MEXC의 견해를 반영하는 것은 아닙니다. 모든 권리는 원저자에게 있습니다. 제3자의 권리를 침해하는 콘텐츠가 있다고 판단될 경우, crypto.news@mexc.com으로 연락하여 삭제 요청을 해주시기 바랍니다. MEXC는 콘텐츠의 정확성, 완전성 또는 시의적절성에 대해 어떠한 보증도 하지 않으며, 제공된 정보에 기반하여 취해진 어떠한 조치에 대해서도 책임을 지지 않습니다. 본 콘텐츠는 금융, 법률 또는 기타 전문적인 조언을 구성하지 않으며, MEXC의 추천이나 보증으로 간주되어서는 안 됩니다.

USD1 Genesis: 0 Fees + 12% APR

USD1 Genesis: 0 Fees + 12% APRUSD1 Genesis: 0 Fees + 12% APR

New users: stake for up to 600% APR. Limited time!