The Programmer's Guide
  • About
  • Algorithm
    • Big O Notation
      • Tree
      • Problems
    • Basic Notes
    • Data Structure Implementation
      • Custom LinkedList
      • Custom Stack
      • Custom Queue
      • Custom Tree
        • Binary Tree Implementation
        • Binary Search Tree Implementation
        • Min Heap Implementation
        • Max Heap Implementation
        • Trie Implementation
      • Custom Graph
        • Adjacency List
        • Adjacency Matrix
        • Edge List
        • Bidirectional Search
    • Mathematical Algorithms
      • Problems - Set 1
      • Problems - Set 2
    • Bit Manipulation
      • Representation
      • Truth Tables
      • Number System
        • Java Program
      • Problems - Set 1
    • Searching
    • Sorting
    • Array Algorithms
    • String Algorithms
    • Tree
      • Tree Traversal Techniques
      • Tree Implementation
      • Applications of Trees
      • Problems - Set 1
    • Graph
      • Graph Traversal Techniques
      • Shortest Path Algorithms
      • Minimum Spanning Tree (MST) Algorithms
    • Dynamic Programming
      • Problems - Set 1
    • Recursion
    • Parallel Programming
    • Miscellaneous
      • Problems - Set 1
  • API
    • API Basics
      • What is an API?
      • Types of API
        • Comparison - TBU
      • Synchronous vs Asynchronous API
    • API Architecture
      • Synchronous & Asynchronous Communication
    • API Specification
  • Cloud Computing
    • Cloud Fundamentals
      • Cloud Terminology
      • Core Terminology
      • Cloud Models
      • Cloud Service Models
      • Benefits, Challenges and Risk of Cloud Computing
      • Cloud Ecosystem
  • Database
    • DBMS
      • Types of DBMS
        • Relational DBMS (RDBMS)
        • NoSQL DBMS
        • Object-Oriented DBMS (OODBMS)
        • Columnar DBMS
        • In-Memory DBMS
        • Distributed DBMS
        • Cloud-Based DBMS
        • Hierarchical DBMS
      • DBMS Architecture
      • DBMS Structure
    • SQL Databases
      • Terminology
      • RDBMS Concepts
        • Entity Relationship Diagram (ERD)
          • ERD Examples
        • Normalization
        • Denormalization
        • ACID & BASE Properties
          • ACID Properties
          • BASE Properties
        • Locking and Unlocking
      • SQL Fundamentals
        • SQL Commands
          • DDL (Data Definition Language)
          • DML (Data Manipulation Language)
          • DCL (Data Control Language)
          • TCL (Transaction Control Language)
          • DQL (Data Query Language)
        • SQL Operators
          • INTERSECT
          • EXCEPT
          • MINUS
          • IN and NOT IN
          • EXISTS and NOT EXISTS
        • SQL Clauses
          • Joins
          • OVER
          • WITH
          • CONNECT BY
          • MODEL
          • FETCH FIRST
          • KEEP
          • OFFSET with FETCH
        • SQL Functions
          • Oracle Specific
        • SQL Data Types
          • Numeric Types
          • Character Types
          • Date & Time Types
          • Large Object Types
        • Others
          • Indexing
      • Vendor Specific Concepts
        • Oracle Specific
          • Data Types
          • Character Set
          • Rownum, Rowid, Urowid
          • Order of Execution of the query
          • Keys
          • Tablespace
          • Partition
      • Best Practice
      • Resources & References
        • O’Reilly SQL Cookbook (2nd Edition)
          • 1. Retrieving Records
          • 2. Sorting Query Results
          • 3. Working with Multiple Tables
          • 4. Inserting, Updating, and Deleting
          • 5. Metadata Queries
          • 6. Working with Strings
          • 7. Working with Numbers
          • 8. Date Arithmetic
          • 9. Date Manipulation
          • 10. Working with Ranges
          • 11. Advanced Searching
          • 12. Reporting and Reshaping
          • 13. Hierarchical Queries
          • 14. Odds 'n' Ends
    • SQL vs NoSQL
    • Best Practices
  • Git
    • Commands
      • Setup and Configuration Commands
      • Getting and Creating Projects
      • Tracking Changes
      • Branching and Merging
      • Sharing and Updating Projects
      • Inspection and Comparison
      • Debugging
      • Patching
      • Stashing and Cleaning
      • Advanced Manipulations
    • Workflows
      • Branching Strategies
        • Git Flow
        • Trunk-Based Development
        • GitHub Flow
        • Comparison
      • Merge Strategies
        • Merge
        • Rebase
        • Squash
        • Fast-forward vs No-fast-forward
        • MR vs PR
      • Conflict Resolution
        • Handling Merge Conflicts
        • Merge Conflicts
        • Rebase Conflicts
        • Divergent Branches After git pull
        • Force Push
      • Patch & Recovery
        • Cherry-pick strategies
        • Revert vs Reset
        • Recover from a bad rebase
      • Rebasing Practices
        • Merge vs Rebase
        • Rebase develop branch on main branch
      • Repository Management
        • Working Directory
        • Mirror a repository
        • Convert a local folder to a Git repo
        • Backup and restore a Git repository
  • Java
    • Java Installation
    • Java Distributions
    • Java Platform Editions
      • Java SE
      • Java EE
      • Jakarta EE
      • Java ME
      • JavaFX
    • Java Overview
      • OOP Principles
        • Encapsulation
        • Inheritance
        • Polymorphism
        • Abstraction
          • Abstract Class & Method
          • Interface
            • Functional Interfaces
            • Marker Interfaces
          • Abstract Class vs Interface
      • OOP Basics
        • What is a Class?
          • Types of Classes
        • What is an Object?
          • Equals and HashCode
            • FAQ
          • Shallow Copy and Deep Copy
          • Ways to Create Object
          • Serialization & Deserialization
        • Methods & Fields
          • Method Overriding & Overloading
          • Method Signature & Header
          • Variables
        • Constructors
        • Access Modifiers
      • Parallelism & Concurrency
        • Ways to Identify Thread Concurrency or Parallelism
        • Thread Basics
          • Thread vs Process
          • Creating Threads
          • Thread Context Switching
          • Thread Lifecycle & States
          • Runnable & Callable
          • Types of Threads
          • Thread Priority
        • Thread Management & Synchronisation
          • Thread Resource Sharing
          • Thread Synchronization
            • Why is Synchronization Needed?
            • Synchronized Blocks & Methods
          • Thread Lock
            • Types of Locks
            • Intrinsic Lock (Monitor Lock)
            • Reentrant Lock
          • Semaphore
          • Thread Starvation
          • Thread Contention
          • Thread Deadlock
          • Best Practices for Avoiding Thread Issues
      • Keywords
        • this
        • super
        • Access Modifiers
      • Data Types
        • Default Values
        • Primitive Types
          • byte
          • short
          • int
          • long
          • float
          • double
          • char
          • boolean
        • Non-Primitive (Reference) Types
          • String
            • StringBuilder
            • StringBuffer
              • Problems
            • Multiline String
            • Comparison - String, StringBuilder & StringBuffer
          • Array
          • Collections
            • List
              • Array vs List
              • ArrayList
              • Vector
                • Stack
                  • Problems
              • LinkedList
            • Queue
              • PriorityQueue
              • Deque (Double-Ended Queue)
                • ArrayDeque
                • ConcurrentLinkedDeque - TBU
                • LinkedBlockingDeque - TBU
            • Map
              • HashMap
              • Hashtable
              • LinkedHashMap
              • ConcurrentHashMap
              • TreeMap
              • EnumMap
              • WeakHashMap
            • Set
              • HashSet
              • LinkedHashSet
              • TreeSet
              • EnumSet
              • ConcurrentSkipListSet
              • CopyOnWriteArraySet
        • Specialized Classes
          • BigInteger
          • BigDecimal
            • Examples
          • BitSet
          • Date and Time
            • Examples
          • Optional
          • Math
          • UUID
          • Scanner
          • Formatter
            • Examples
          • Properties
          • Regex (Pattern and Matcher)
            • Examples
          • Atomic Classes
          • Random
          • Format
            • NumberFormat
            • DateFormat
            • DecimalFormat
        • Others
          • Object
          • Enum
            • Pre-Defined Enum
            • Custom Enum
            • EnumSet and EnumMap
          • Record
          • Optional
          • System
          • Runtime
          • ProcessBuilder
          • Class
          • Void
          • Throwable
            • Error
            • Exception
              • Custom Exception Handling
              • Best Practice
            • Error vs Exception
            • StackTraceElement
    • Java Features by Version
      • How New Java Features are Released ?
      • Java Versions
        • Java 8
        • Java 9
        • Scoped Values
        • Unnamed Variables & Patterns
      • FAQ
    • Concepts
      • Set 1
        • Streams
          • flatmap
          • Collectors Utility Class
          • Problems
        • Functional Interfaces
          • Standard Built-In Interfaces
          • Custom Interfaces
        • Annotation
          • Custom Annotation
          • Meta Annotation
        • Generics
          • Covariance and Invariance
        • Asynchronous Computation
          • Future
          • CompletableFuture
          • Future v/s CompletableFuture
          • ExecutorService
            • Thread Pool
            • Types of Work Queues
            • Rejection Policies
            • ExecutorService Implementations
            • ExecutorService Usage
          • Locks, Atomic Variables, CountDownLatch, CyclicBarrier - TBU
          • Parallel Streams, Fork/Join Framework,Stream API with Parallelism - TBU
      • Set 2
        • Standards
          • ISO Standards
          • JSR
            • JSR 303, 349, 380 (Bean Validation)
        • Operator Precedence
      • Set 3
        • Date Time Formatter
        • Validation
      • Set 4
        • Input from User
        • Comparison & Ordering
          • Object Equality Check
          • Comparable and Comparator
            • Comparator Interface
          • Sorting of Objects
          • Insertion Ordering
    • Packages
      • Core Packages
        • java.lang
          • java.lang.System
          • java.lang.Thread
      • Jakarta Packages
        • jakarta.validation
        • javax.validation
      • Third-party Packages
    • Code Troubleshoot
      • Thread Dump
      • Heap Dump
    • Code Quality & Analysis
      • ArchUnit
      • Terminologies
        • Cyclic dependencies
    • Code Style
      • Naming Convention
      • Package Structure
      • Formatting
      • Comments and Documentation
      • Imports
      • Exception Handling
      • Class Structure
      • Method Guidelines
      • Page 1
      • Code Smells to Avoid
      • Lambdas and Streams Style
      • Tools
    • Tools
      • IntelliJ IDEA
        • Shortcuts for MAC
      • Apache JMeter
        • Examples
      • Thread Dump Capture
        • jstack
        • VisualVM - TBU
        • jcmd - TBU
        • JConsole - TBU
        • YourKit Java Profiler - TBU
        • Eclipse MAT - TBU
        • IntelliJ IDEA Profiler - TBU
        • AppDynamics - TBU
        • Dynatrace - TBU
        • Thread Dump Analyzers - TBU
      • Heap Dump Capture
        • jmap
        • VisualVM - TBU
        • jcmd - TBU
        • Eclipse MAT (Memory Analyzer Tool) - TBU
        • IntelliJ IDEA Profiler - TBU
        • YourKit Java Profiler - TBU
        • AppDynamics - TBU
        • Dynatrace - TBU
        • Kill -3 Command - TBU
        • jhat (Java Heap Analysis Tool) - TBU
        • JVM Options - TBU
      • Wireshark
        • Search Filters
    • Best Practices
      • Artifact and BOM Versioning
  • Maven
    • Installation
    • Local Repository & Configuration
    • Command-line Options
    • Build & Lifecycle
    • Dependency Management
      • Dependency
        • Transitive Dependency
        • Optional Dependency
      • Dependency Scope
        • Maven Lifecycle and Dependency Scope
      • Dependency Exclusions & Overrides
      • Bill of Materials (BOM)
      • Dependency Conflict Resolution
      • Dependency Tree & Analysis
      • Dependency Versioning Strategies
    • Plugins
      • Build Lifecycle Management
      • Dependency Management
      • Code Quality and Analysis
      • Documentation Generation
      • Code Generation
      • Packaging and Deployment
      • Reporting
      • Integration and Testing
      • Customization and Enhancement
        • build-helper-maven-plugin
        • properties-maven-plugin
        • ant-run plugin
        • exec-maven-plugin
        • gmavenplus-plugin
      • Performance Optimization
    • FAQs
      • Fixing Maven SSL Issues: Unable to Find Valid Certification Path
  • Spring
    • Spring Basics
      • What is Spring?
      • Why Use Spring
      • Spring Ecosystem
      • Versioning
      • Setting Up a Spring Project
    • Core Concepts
      • Spring Core
        • Dependency Injection (DI)
        • Stereotype Annotation
      • Spring Beans
        • Bean Lifecycle
        • Bean Scope
          • Singleton Bean
        • Lazy & Eager Initialization
          • Use Case of Lazy Initialization
        • BeanFactory
        • ApplicationContext
      • Spring Annotations
        • Spring Boot Specific
        • Controller Layer (Web & REST Controllers)
    • Spring Features
      • Auto Configuration
        • Spring Boot 2: spring.factories
        • Spring Boot 3: spring.factories
      • Spring Caching
        • In-Memory Caching
      • Spring AOP
        • Before Advice
        • After Returning Advice
        • After Throwing Advice
        • After (finally) Advice
        • Around Advice
      • Spring File Handling
      • Reactive Programming
        • Reactive System
        • Reactive Stream Specification
        • Project Reactor
          • Mono & Flux
      • Asynchronous Computation
        • @Async annotation
      • Spring Security
        • Authentication
          • Core Components
            • Security Filter Chain
              • HttpSecurity
              • Example
            • AuthenticationManager
            • AuthenticationProvider
            • UserDetailsService
              • UserDetails
              • PasswordEncoder
            • SecurityContext
            • SecurityContextHolder
            • GrantedAuthority
            • Security Configuration (Spring Security DSL)
          • Authentication Models
            • One-Way Authentication
            • Mutual Authentication
          • Authentication Mechanism
            • Basic Authentication
            • Form-Based Authentication
            • Token-Based Authentication (JWT)
            • OAuth2 Authentication
            • Multi-Factor Authentication (MFA)
            • SAML Authentication
            • X.509 Certificate Authentication
            • API Key Authentication
            • Remember-Me Authentication
            • Custom Authentication
          • Logout Handling
        • Authorization
        • Security Filters and Interceptors
        • CSRF
          • Real-World CSRF Attacks & Prevention
        • CORS
        • Session Management and Security
        • Best Practices
      • Spring Persistence
        • JDBC
          • JDBC Components
          • JDBC Template
          • Transaction Management
          • Best Practices in JDBC Usage
          • Datasource
            • Connection Pooling
              • HikariCP
            • Caching
        • JPA (Java Persistence API)
          • JPA Fundamentals
          • ORM Mapping Annotations
            • 1. Entity and Table Mappings
            • 2. Field/Column Mappings
            • 3. Relationship Mappings
            • 4. Inheritance Mappings
            • 5. Additional Configuration Annotations
          • Querying Data
            • JPQL
            • Criteria API
            • JPA Specification
              • Example - Employee Portal
            • Native SQL Queries
            • Named Queries
            • Query Return Types
            • Pagination & Sorting
              • Example - Employee Portal
            • Projection
          • Fetch Strategies in JPA
        • JPA Implementation
          • Hibernate
            • Properties
            • Example
        • Spring Data JPA
          • Repository Abstractions
          • Entity-to-Table Mapping
          • Derived Query Methods
        • Cross-Cutting Concerns
          • Transactions
          • Caching
          • Concurrency
        • Examples
          • Employee Portal
            • API
    • Distributed Systems & Communication
      • Distributed Scheduling
      • Inter-Service Communication
        • 1. RestTemplate
        • 2. WebClient
        • 3. OpenFeign
        • Retry Mechanism
          • @Retryable annotation
            • Example
    • Security & Data Protection
      • Encoding | Decoding
        • Types
          • Base Encoding
            • Base16 - TBD
              • Encoding and Decoding in Java - TBD
            • Base32
              • Encoding and Decoding in Java
            • Base64 -TBD
              • Encoding and Decoding in Java - TBD
          • Text Encoding - TBD
            • Extended ASCII
              • Encoding and Decoding in Java - TBD
                • ISO-8859-1
                • Windows-1252 - TBD
                • IBM Code Pages - TBD
            • ASCII
              • Encoding and Decoding in Java
        • Java Guidelines
          • Text Encoding Decoding Examples
          • Base Encoding Decoding Examples
          • Best Practices and Concepts
          • Libraries
      • Cryptography
        • Terminology
        • Java Cryptography Architecture (JCA)
        • Key Management
          • Key Generation
            • Tools and Libraries
              • OpenSSL
              • Java Keytool
                • Concept
                • Use Cases
            • Key & Certificate File Formats
          • Key Distribution
          • Key Storage
          • Key Rotation
          • Key Revocation
        • Encryption & Decryption
          • Symmetric Encryption
            • Algorithm
            • Modes of Operation
            • Examples
          • Asymmetric Encryption
            • Algorithm
            • Mode of Operation
            • Examples
    • Utilities & Libraries
      • Apache Libraries
        • Apache Camel
          • Camel Architecture
            • Camel Context
            • Camel Endpoints
            • Camel Components
            • Camel Exchange & MEP
          • Spring Dependency
          • Different Components
            • Camel SFTP
        • Apache Commons Lang
      • MapStruct Mapper
      • Utilities by Spring framework
        • FileCopyUtils
    • General Concepts
      • Spring Boot Artifact Packaging
      • Classpath and Resource Loading
      • Configuration - Mapping Properties to Java Class
      • Validations in Spring Framework
        • Jakarta Validation
          • Jakarta Bean Validation Annotations
    • Practical Guidelines
      • Spring Configuration
      • Spring Code Design
  • Software Testing
    • Software Testing Methodologies
      • Functional Testing
      • Non Functional Testing
    • Software Testing Life Cycle (STLC)
    • Integration Test
      • Dynamic Property Registration
    • Java Test Framework
      • JUnit
        • JUnit 4
          • Examples
        • JUnit 5
          • Examples
        • JUnit 4 vs JUnit 5
  • System Design
    • Foundations
      • Programming Paradigms
      • Object-Oriented Design
        • SOLID Principles
        • GRASP Principles
        • Composition
        • Aggregation
        • Association
      • Design Pattern
        • Creational Pattern
        • Structural Pattern
        • Behavioral Pattern
        • Examples
          • Data Collector
          • Payment Processor
        • Design Enhancements
          • Fluent API Design
            • Examples
    • Architectural Building Blocks
      • CAP Theorem
      • Load Balancer
        • Load Balancer Architecture
        • Load Balancing in Java Microservices
          • Client-Side Load Balancing Example
          • Server-Side Load Balancing Example
        • Load Balancer Monitoring Tool
      • Scaling
        • Vertical Scaling (Scaling Up)
        • Horizontal Scaling (Scaling Out)
        • Auto-Scaling
        • Database Scaling via Sharding
      • Caching
        • Pod-Level vs Distributed Caching
      • Networking Metrics
        • Types of Delay
        • Scenario
      • System Characteristics
      • Workload Types
      • Resilience & Failure Handling
    • Performance
      • Why Is My API Sometimes Slow ?
    • Security
      • Security by Design
      • Zero Trust Security Model
      • Zero Trust Architecture
      • Principles
        • CIA
        • Least Privilege Principle
        • Defense in Depth
      • Security Threats & Mitigations
        • OWASP
          • Top 10 Security Threats
          • Application Security Verification Standard
          • Software Assurance Maturity Model
          • Dependency Check
          • CSRFGuard
          • Cheat Sheets
          • Security Testing Guide
          • Threat Dragon
        • Threat Modeling
      • Compliance & Regulation
        • PCI DSS
    • Deployment Patterns
    • Diagrams
      • UML Diagrams
        • PlantUML
          • Class Diagram
          • Object Diagram
          • Sequence Diagram
          • Use Case Diagram
          • Activity Diagram
          • State Diagram
          • Architecture Diagram
          • Component Diagram
          • Timing Diagram
          • ER Diagram (Entity-Relationship)
          • Network Diagram
    • Common Terminologies
    • Problems
      • Reference Materials
      • Cache Design
  • Interview Guide
    • Non-Technical
      • Behavioural or Introductory Guide
      • Project Specific
    • Technical
      • Java Interview Companion
        • Java Key Concepts
          • Set 1
          • Set 2
        • Java Code Snippets
        • Java Practice Programs
          • Set 3 - Strings
          • Set 4 - Search
          • Set 5 - Streams and Collection
      • SQL Interview Companion
        • SQL Practice Problems
          • Set 1
      • Spring Interview Companion
        • Spring Key Concepts
          • Set 1 - General
          • Set 2 - Core Spring
        • Spring Code Snippets
          • JPA
      • Application Server
      • Maven
      • Containerized Application
      • Microservices
    • General
      • Applicant Tracking System (ATS)
      • Flowchart - How to Solve Coding Problem?
Powered by GitBook
On this page

Was this helpful?

Last updated 8 months ago

Was this helpful?

How Java String and char works internally

In Java, String and char types are designed to work with Unicode characters, and they handle text internally using the UTF-16 encoding.

Internal Representation of Strings in Java

  • UTF-16 Encoding: Internally, Java uses UTF-16 encoding for its String and char types.

    • Each char in Java is a 16-bit Unicode character.

    • This means that characters in the Basic Multilingual Plane (BMP) are represented by a single 16-bit charvalue.

    • Characters outside the BMP (such as many emoji and some ancient scripts) are represented using a pair of char values known as surrogate pairs.

Default Character Set for I/O Operations

  • Default Charset: The default character set is used when we perform I/O operations without specifying an explicit charset.

    • It affects methods like String.getBytes() and new String(byte[]) when no charset is provided.

    • The default charset is typically determined by the system's locale and can vary. On most modern systems, it is often UTF-8.

How to Find Default Character Set ?

How They Work Together

  1. Internal String Representation (UTF-16):

    • When we create a String in Java, it is stored in memory using UTF-16 encoding.

    • For example:

    • In memory, each character in "Hello, 世界!" is represented using one or more 16-bit code units.

  2. Encoding and Decoding for I/O (Default Charset):

    • When we write this String to a file or send it over a network, it is converted from the internal UTF-16 representation to a byte sequence.

    • If you use getBytes() without specifying a charset, it uses the default charset (e.g., UTF-8 on your system).

  3. Displaying Strings

    • System.out.print: When we print a String using System.out.print or System.out.println, the internal UTF-16 encoded string is converted to bytes using the default charset of the environment (often UTF-8) and sent to the console.

      • The console's encoding (often UTF-8) will then render these bytes as characters on the screen.

Example with character "A"

Here we will understand about - Internal Representation, Conversion to Bytes, Display on Console

Here is a step-by-step diagram:

Detailed Explanation with Example Character 'A'

  1. Internal Representation (UTF-16):

    • When we declare a String in Java with the character 'A', it is stored using UTF-16 encoding.

    • Example: String example = "A";

    • Internally, the character 'A' is represented by a single 16-bit code unit: 0x0041.

  2. Conversion to Bytes (UTF-8):

    • When we print the String, Java converts the UTF-16 encoded String to bytes using the default charset (often UTF-8).

    • In UTF-8, the character 'A' (U+0041) is represented by a single byte: 0x41.

  3. Display on Console (UTF-8 Decoding):

    • The console receives the byte 0x41.

    • It decodes the byte using UTF-8 and displays the character 'A'.

How Text Editors Handle Encoding

  1. Internal Representation:

    • Text editors internally represent the text as a sequence of characters, often using an internal encoding like UTF-16 (as in Java's String class) or UTF-32 (common in some programming languages).

    • This internal representation allows the editor to handle and display the text correctly, regardless of the external file encoding.

  2. Conversion to Bytes:

    • When we save the file, the editor converts this internal character representation to a byte sequence according to the specified encoding.

    • This process involves mapping each character to its corresponding byte (or bytes) in the target encoding. For example, in UTF-8, characters can be represented by one to four bytes.

  3. Writing to Disk:

    • The resulting byte sequence is then written to the file on disk. The file's encoding determines how the text is stored.

    • If the file contains a BOM (Byte Order Mark), this marker is also written at the beginning of the file to indicate the encoding.

Example Scenario

Imagine we have the following text in an editor:

Saving as UTF-8

  1. Internal Representation:

    • The text is represented internally, say, using UTF-16.

  2. Conversion to UTF-8:

    • Each character is converted to UTF-8 bytes.

    • For example, "H" is 0x48, "á" is 0xC3 0xA1, and "А" is 0xD0 0x90.

  3. Writing to Disk:

    • The byte sequence 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 0A C3 A1 2C 20 65 2C 20 69 2C 20 6F 2C 20 75 2C 20 FC 2C 20 F1 2C 20 E7 0A D0 90 2C 20 D0 91 2C 20 D0 92 2C 20 D0 93 2C 20 D0 94 2C 20 D0 95 2C 20 D0 96 2C 20 D0 97 2C 20 D0 98 2C 20 D0 99 is written to the file.

Opening the File with a Different Encoding (ISO-8859-1)

  1. Reading Bytes:

    • The editor reads the byte sequence from the file.

  2. Conversion to Characters:

    • The editor interprets the bytes using ISO-8859-1 encoding rules.

    • Bytes that don’t map to valid ISO-8859-1 characters result in incorrect characters or replacement characters.

Mismatched Encoding

When we save a file in a specific encoding and then try to open it in a different encoding, the interpretation of the byte sequences in the file can lead to various issues and unexpected results. Here’s what typically happens:

  1. Character Misinterpretation:

    • Characters might be displayed incorrectly because the byte sequences are interpreted according to the rules of the wrong encoding.

    • For example, a file saved in UTF-8 might contain multibyte sequences for certain characters, but if opened as ISO-8859-1 (Latin-1), those byte sequences may be interpreted as different, often nonsensical, characters.

  2. Data Corruption:

    • Special characters, symbols, and characters from non-Latin scripts are particularly prone to corruption. You might see replacement characters like � or completely incorrect characters.

    • For example, the Chinese characters in UTF-8 might turn into garbled text when opened in ISO-8859-1.

  3. Loss of Information:

    • If the file is saved in a limited encoding like ISO-8859-1 and contains characters not supported by that encoding, those characters may be replaced or lost. When reopened in a different encoding, the original characters cannot be recovered.

  4. Readability Issues:

    • Text might become unreadable, especially if it contains special symbols, accented characters, or non-Latin alphabets.

    • Plain ASCII characters (0-127) generally remain readable because they are commonly shared across many encodings.

Preventing Encoding Issues

  1. Specify Encoding Explicitly:

    • Always specify the encoding when saving and opening files, especially when dealing with non-ASCII characters.

    • Many text editors allow you to set the encoding when saving a file. For example, in VS Code or Sublime Text, you can choose the encoding from the save dialog.

  2. Use BOM for UTF Encodings:

    • Use a Byte Order Mark (BOM) for UTF-16 and UTF-32 files to indicate the encoding explicitly.

    • UTF-8 BOM can also be used, although it’s less common and not recommended for compatibility reasons.

  3. Consistent Encoding:

    • Ensure that both the producer and consumer of the file agree on the encoding.

    • Standardize on a widely supported encoding like UTF-8, which can handle a broad range of characters.

  4. File Metadata and Headers:

    • Use metadata and headers in protocols (like HTTP headers or HTML meta tags) to specify the encoding.

    • Example: <meta charset="UTF-8"> in HTML files.

Defining encoding property in Springboot pom.xml file

In a Springboot application, the pom.xml file often contains a property to define the character encoding. This is usually set to UTF-8 to ensure that the project uses UTF-8 encoding for source files and resources. The property being referred is project.build.sourceEncoding and project.reporting.outputEncoding

Explanation of Properties

  1. project.build.sourceEncoding:

    • This property defines the character encoding for the source code files.

    • Setting it to UTF-8 ensures that the Java source files are read and compiled using UTF-8 encoding.

  2. project.reporting.outputEncoding:

    • This property defines the character encoding for the output of reporting tasks (e.g., site generation).

    • Setting it to UTF-8 ensures that the generated reports are encoded in UTF-8.

The Maven Compiler Plugin does not automatically take the project.build.sourceEncoding property from the <properties> section. We need to explicitly configure the Maven Compiler Plugin to use this property.

However, setting the encoding in the <properties> section ensures that we can reuse this property value consistently across different plugins and configurations.

Why UTF-8?

  • Consistency: Ensures that all parts of the project, including source code and reports, use the same character encoding.

  • Compatibility: UTF-8 is a widely used encoding that supports all Unicode characters, making it a good choice for internationalization.

  • Best Practice: Using UTF-8 helps avoid issues with character encoding mismatches, especially in projects that might involve multiple developers and different systems.

Impact on a Spring Boot Application

Setting these properties in pom.xml file ensures that:

  • The Maven compiler plugin uses UTF-8 to read and compile Java source files.

  • Any generated reports (e.g., Javadoc, site reports) are encoded in UTF-8.

  • Resources such as application.properties or application.yml are processed with the specified encoding, avoiding potential issues with special characters.

Example with Maven Compiler Plugin

The maven-compiler-plugin is configured to use the source encoding property. It is configured to use UTF-8 to read and compile Java source files.

javaCopy codeString text = "Hello, 世界!";
1. Internal Representation (UTF-16)
   --------------------------------
   Character: 'A'
   Unicode Code Point: U+0041
   UTF-16 Encoding: 0041

   [ char ]     [ char ]
     0x0041

2. Conversion to Bytes (UTF-8)
   ----------------------------
   Character: 'A'
   UTF-16 Code Unit: 0x0041
   UTF-8 Encoding: 41

   [ byte ]
     0x41

3. Display on Console (UTF-8 Decoding)
   -----------------------------------
   Console receives byte 0x41
   Decodes as 'A' using UTF-8

   [ Display ]
      'A'
plaintextCopy codeHello, World!
Accented characters: á, é, í, ó, ú, ü, ñ, ç
Cyrillic characters: А, Б, В, Г, Д, Е, Ж, З, И, Й
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>example-project</artifactId>
    <version>1.0.0</version>

    <properties>
        <java.version>11</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    </properties>

    <dependencies>
        <!-- Dependencies here -->
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>
            <!-- Other plugins -->
        </plugins>
    </build>

</project>
  1. Spring
  2. Security & Data Protection
  3. Encoding | Decoding
  4. Java Guidelines

Best Practices and Concepts

PreviousBase Encoding Decoding ExamplesNextLibraries
  • How Java String and char works internally
  • Internal Representation of Strings in Java
  • Default Character Set for I/O Operations
  • How They Work Together
  • Example with character "A"
  • How Text Editors Handle Encoding
  • Example Scenario
  • Mismatched Encoding
  • Preventing Encoding Issues
  • Defining encoding property in Springboot pom.xml file
  • Explanation of Properties
  • Why UTF-8?
  • Impact on a Spring Boot Application
  • Example with Maven Compiler Plugin
import java.nio.charset.Charset;

public class EncodingDecodingFiles {
    public static void main(String[] args) {
        Charset defaultCharset = Charset.defaultCharset();
        System.out.println("Default Charset: " + defaultCharset);
    }
}