By Paul Wicks


2008-10-13 06:10:15 8 Comments

StringTokenizer? Convert the String to a char[] and iterate over that? Something else?

14 comments

@Enyby 2018-12-24 10:54:13

If you need performance, then you must test on your environment. No other way.

Here example code:

int tmp = 0;
String s = new String(new byte[64*1024]);
{
    long st = System.nanoTime();
    for(int i = 0, n = s.length(); i < n; i++) {
        tmp += s.charAt(i);
    }
    st = System.nanoTime() - st;
    System.out.println("1 " + st);
}

{
    long st = System.nanoTime();
    char[] ch = s.toCharArray();
    for(int i = 0, n = ch.length; i < n; i++) {
        tmp += ch[i];
    }
    st = System.nanoTime() - st;
    System.out.println("2 " + st);
}
{
    long st = System.nanoTime();
    for(char c : s.toCharArray()) {
        tmp += c;
    }
    st = System.nanoTime() - st;
    System.out.println("3 " + st);
}
System.out.println("" + tmp);

On Java online I get:

1 10349420
2 526130
3 484200
0

On Android x86 API 17 I get:

1 9122107
2 13486911
3 12700778
0

@i_am_zero 2017-12-10 06:44:28

In Java 8 we can solve it as:

String str = "xyz";
str.chars().forEachOrdered(i -> System.out.print((char)i));
str.codePoints().forEachOrdered(i -> System.out.print((char)i));

The method chars() returns an IntStream as mentioned in doc:

Returns a stream of int zero-extending the char values from this sequence. Any char which maps to a surrogate code point is passed through uninterpreted. If the sequence is mutated while the stream is being read, the result is undefined.

The method codePoints() also returns an IntStream as per doc:

Returns a stream of code point values from this sequence. Any surrogate pairs encountered in the sequence are combined as if by Character.toCodePoint and the result is passed to the stream. Any other code units, including ordinary BMP characters, unpaired surrogates, and undefined code units, are zero-extended to int values which are then passed to the stream.

How is char and code point different? As mentioned in this article:

Unicode 3.1 added supplementary characters, bringing the total number of characters to more than the 216 characters that can be distinguished by a single 16-bit char. Therefore, a char value no longer has a one-to-one mapping to the fundamental semantic unit in Unicode. JDK 5 was updated to support the larger set of character values. Instead of changing the definition of the char type, some of the new supplementary characters are represented by a surrogate pair of two char values. To reduce naming confusion, a code point will be used to refer to the number that represents a particular Unicode character, including supplementary ones.

Finally why forEachOrdered and not forEach ?

The behaviour of forEach is explicitly nondeterministic where as the forEachOrdered performs an action for each element of this stream, in the encounter order of the stream if the stream has a defined encounter order. So forEach does not guarantee that the order would be kept. Also check this question for more.

For difference between a character, a code point, a glyph and a grapheme check this question.

@devDeejay 2017-03-15 09:39:00

This Example Code will Help you out!

import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;

public class Solution {
    public static void main(String[] args) {
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        map.put("a", 10);
        map.put("b", 30);
        map.put("c", 50);
        map.put("d", 40);
        map.put("e", 20);
        System.out.println(map);

        Map sortedMap = sortByValue(map);
        System.out.println(sortedMap);
    }

    public static Map sortByValue(Map unsortedMap) {
        Map sortedMap = new TreeMap(new ValueComparator(unsortedMap));
        sortedMap.putAll(unsortedMap);
        return sortedMap;
    }

}

class ValueComparator implements Comparator {
    Map map;

    public ValueComparator(Map map) {
        this.map = map;
    }

    public int compare(Object keyA, Object keyB) {
        Comparable valueA = (Comparable) map.get(keyA);
        Comparable valueB = (Comparable) map.get(keyB);
        return valueB.compareTo(valueA);
    }
}

@Hawkeye Parker 2016-11-05 23:59:27

Elaborating on this answer and this answer.

Above answers point out the problem of many of the solutions here which don't iterate by code point value -- they would have trouble with any surrogate chars. The java docs also outline the issue here (see "Unicode Character Representations"). Anyhow, here's some code that uses some actual surrogate chars from the supplementary Unicode set, and converts them back to a String. Note that .toChars() returns an array of chars: if you're dealing with surrogates, you'll necessarily have two chars. This code should work for any Unicode character.

    String supplementary = "Some Supplementary: 𠜎𠜱𠝹𠱓";
    supplementary.codePoints().forEach(cp -> 
            System.out.print(new String(Character.toChars(cp))));

@Touko 2011-03-08 14:30:48

If you have Guava on your classpath, the following is a pretty readable alternative. Guava even has a fairly sensible custom List implementation for this case, so this shouldn't be inefficient.

for(char c : Lists.charactersOf(yourString)) {
    // Do whatever you want     
}

UPDATE: As @Alex noted, with Java 8 there's also CharSequence#chars to use. Even the type is IntStream, so it can be mapped to chars like:

yourString.chars()
        .mapToObj(c -> Character.valueOf((char) c))
        .forEach(c -> System.out.println(c)); // Or whatever you want

@sabujp 2019-07-28 01:48:11

If you need to do anything complex then go with the for loop + guava since you can't mutate variables (e.g. Integers and Strings) defined outside the scope of the forEach inside the forEach. Whatever is inside the forEach also can't throw checked exceptions, so that's sometimes annoying also.

@jjnguy 2008-10-13 06:13:16

I use a for loop to iterate the string and use charAt() to get each character to examine it. Since the String is implemented with an array, the charAt() method is a constant time operation.

String s = "...stuff...";

for (int i = 0; i < s.length(); i++){
    char c = s.charAt(i);        
    //Process char
}

That's what I would do. It seems the easiest to me.

As far as correctness goes, I don't believe that exists here. It is all based on your personal style.

@Uri 2008-10-13 06:25:46

Does the compiler inline the length() method?

@jjnguy 2008-10-13 06:28:26

I dunno. I usually don't optimize my code. But it can't hurt to pull the length into a variable and use that instead. My guess is that the compiler in-lines the call though.

@ddimitrov 2008-10-13 06:50:06

@Uri, the Java compiler does not do optimization. For HotSpot the JVM will inline it pretty soon at runtime. There are other JVM implementations (i.e. some of the J2ME VMs used in phones) that do not do runtime optimizations.

@Dave Cheney 2008-10-13 08:04:39

it might inline length(), that is hoist the method behind that call up a few frames, but its more efficient to do this for(int i = 0, n = s.length() ; i < n ; i++) { char c = s.charAt(i); }

@slim 2008-10-13 08:13:44

Cluttering your code for a tiny performance gain. Please avoid this until you decide this area of code is speed-critical.

@jjnguy 2008-10-13 14:18:27

I usually don't optimize my code unless readability isn't sacrificed.

@Gabe 2011-03-24 01:04:13

Note that this technique gives you characters, not code points, meaning you may get surrogates.

@ikh 2014-06-20 10:22:09

charAt is not O(1) - it's O(N) for surrogates.

@LarsH 2016-12-27 17:06:27

@slim: Which clutter are you advising people to avoid -- caching the length using n? Or using an i loop instead of the for-each construct?

@slim 2016-12-28 09:09:47

@larsH in this case I was talking about the n but I would also usually code a construct that didn't use the i either.

@antak 2018-11-01 06:45:55

@ikh charAt is not O(1): How is that so? The code for String.charAt(int) is merely doing value[index]. I think you're confusing chatAt() with something else that give you code points.

@Inder malviya 2019-10-13 06:05:11

what if length of String is more then the range of int?

@Alex 2015-01-06 10:38:56

If you need to iterate through the code points of a String (see this answer) a shorter / more readable way is to use the CharSequence#codePoints method added in Java 8:

for(int c : string.codePoints().toArray()){
    ...
}

or using the stream directly instead of a for loop:

string.codePoints().forEach(c -> ...);

There is also CharSequence#chars if you want a stream of the characters (although it is an IntStream, since there is no CharStream).

@sk. 2008-12-11 23:04:09

Note most of the other techniques described here break down if you're dealing with characters outside of the BMP (Unicode Basic Multilingual Plane), i.e. code points that are outside of the u0000-uFFFF range. This will only happen rarely, since the code points outside this are mostly assigned to dead languages. But there are some useful characters outside this, for example some code points used for mathematical notation, and some used to encode proper names in Chinese.

In that case your code will be:

String str = "....";
int offset = 0, strLen = str.length();
while (offset < strLen) {
  int curChar = str.codePointAt(offset);
  offset += Character.charCount(curChar);
  // do something with curChar
}

The Character.charCount(int) method requires Java 5+.

Source: http://mindprod.com/jgloss/codepoint.html

@Prof. Falken supports Monica 2011-05-06 12:21:18

I don't get how you use anything but the Basic Multilingual Plane here. curChar is still 16 bits righ?

@sk. 2011-05-06 19:15:47

You either use an int to store the entire code point, or else each char will only store one out of the two surrogate pairs that define the code point.

@Prof. Falken supports Monica 2011-05-06 20:59:37

I think I need to read up on code points and surrogate pairs. Thanks!

@Jason S 2014-07-10 16:08:40

+1 since this seems to be the only answer that is correct for Unicode chars outside of the BMP

@Emmanuel Oga 2014-10-12 09:13:25

Wrote some code to illustrate the concept of iterating over codepoints (as opposed to chars): gist.github.com/EmmanuelOga/…

@Ciro Santilli 新疆改造中心法轮功六四事件 2015-05-07 15:46:28

Important point, and specifically asked at: stackoverflow.com/questions/1527856/…

@ 2008-12-11 21:08:23

I agree that StringTokenizer is overkill here. Actually I tried out the suggestions above and took the time.

My test was fairly simple: create a StringBuilder with about a million characters, convert it to a String, and traverse each of them with charAt() / after converting to a char array / with a CharacterIterator a thousand times (of course making sure to do something on the string so the compiler can't optimize away the whole loop :-) ).

The result on my 2.6 GHz Powerbook (that's a mac :-) ) and JDK 1.5:

  • Test 1: charAt + String --> 3138msec
  • Test 2: String converted to array --> 9568msec
  • Test 3: StringBuilder charAt --> 3536msec
  • Test 4: CharacterIterator and String --> 12151msec

As the results are significantly different, the most straightforward way also seems to be the fastest one. Interestingly, charAt() of a StringBuilder seems to be slightly slower than the one of String.

BTW I suggest not to use CharacterIterator as I consider its abuse of the '\uFFFF' character as "end of iteration" a really awful hack. In big projects there's always two guys that use the same kind of hack for two different purposes and the code crashes really mysteriously.

Here's one of the tests:

    int count = 1000;
    ...

    System.out.println("Test 1: charAt + String");
    long t = System.currentTimeMillis();
    int sum=0;
    for (int i=0; i<count; i++) {
        int len = str.length();
        for (int j=0; j<len; j++) {
            if (str.charAt(j) == 'b')
                sum = sum + 1;
        }
    }
    t = System.currentTimeMillis()-t;
    System.out.println("result: "+ sum + " after " + t + "msec");

@Emmanuel Oga 2014-10-11 07:48:00

This has the same problem outlined here: stackoverflow.com/questions/196830/…

@Dave Cheney 2008-10-13 08:06:23

Two options

for(int i = 0, n = s.length() ; i < n ; i++) { 
    char c = s.charAt(i); 
}

or

for(char c : s.toCharArray()) {
    // process c
}

The first is probably faster, then 2nd is probably more readable.

@Dennis 2012-02-29 17:43:55

plus one for placing the s.length() in the initialization expression. If anyone doesn't know why, it's because that is only evaluated once where if it was placed in the termination statement as i < s.length(), then s.length() would be called each time it looped.

@Rhyous 2012-05-15 15:02:59

I thought compiler optimization took care of that for you.

@Matthias 2014-08-14 10:30:33

Any further thoughts on this? Can we reasonably expect compiler optimization to take care of avoiding the repeated call to s.length(), or not?

@prasopes 2014-10-09 08:38:47

@Matthias You can use the Javap class disassembler to see that the repeated calls to s.length() in for loop termination expression are indeed avoided. Note that in the code OP posted the call to s.length() is in the initialization expression, so the language semantics already guarantees that it will be called only once.

@Emmanuel Oga 2014-10-11 07:47:04

@Isaac 2014-12-25 09:09:22

@prasopes Note though that most java optimizations happen in the runtime, NOT in the class files. Even if you saw repeated calls to length() that doesn't indicate a runtime penalty, necessarily.

@Lasse 2015-09-20 10:45:30

@DaveCheney, why would you define 'n = s.length()' instead just have '(int i = 0; i<s.length(); i++){' ?

@Steve 2015-11-23 04:11:58

@Lasse, the putative reason is for efficiency - your version calls the length() method on every iteration, whereas Dave calls it once in the initializer. That said, it is very likely the JIT ("just in time") optimizer will optimize the extra call away, so it's likely only a readability difference for no real gain.

@DavidS 2015-12-18 18:53:04

And in my opinion, @Steve, it's actually less readable because (1) it's unconventional so it will distract people reading your code (as it did Lasse and many of the commenters), and (2) it moves the declaration away from its use.

@Sergey Dirin 2018-01-24 06:21:16

I dont get why would the first be faster. I thought foreach is the best optimized for performance solution, isn't it?

@MattMerr47 2018-08-09 15:09:01

toCharArray copies the contents of the String into a new array, which you avoid using charAt with a regular for loop.

@Bruno Ely 2019-02-08 22:24:54

Also regarding the s.length() call in the initializer, that's also textbook premature optimization that hinders readability... isn't it?

@Alan Moore 2008-10-13 12:24:48

StringTokenizer is totally unsuited to the task of breaking a string into its individual characters. With String#split() you can do that easily by using a regex that matches nothing, e.g.:

String[] theChars = str.split("|");

But StringTokenizer doesn't use regexes, and there's no delimiter string you can specify that will match the nothing between characters. There is one cute little hack you can use to accomplish the same thing: use the string itself as the delimiter string (making every character in it a delimiter) and have it return the delimiters:

StringTokenizer st = new StringTokenizer(str, str, true);

However, I only mention these options for the purpose of dismissing them. Both techniques break the original string into one-character strings instead of char primitives, and both involve a great deal of overhead in the form of object creation and string manipulation. Compare that to calling charAt() in a for loop, which incurs virtually no overhead.

@Bruno De Fraine 2008-10-13 06:38:20

There are some dedicated classes for this:

import java.text.*;

final CharacterIterator it = new StringCharacterIterator(s);
for(char c = it.first(); c != CharacterIterator.DONE; c = it.next()) {
   // process c
   ...
}

@ddimitrov 2008-10-13 06:58:43

Looks like an overkill for something as simple as iterating over immutable char array.

@slim 2008-10-13 08:11:22

I don't see why this is overkill. Iterators are the most java-ish way to do anything... iterative. The StringCharacterIterator is bound to take full advantage of immutability.

@jjnguy 2008-10-13 15:57:04

If I were using an iterator I would have used a foreach loop then.

@Bruno De Fraine 2008-10-14 08:00:00

@jjnguy: foreach is only possible for java.lang.Iterable's

@Rob Gilliam 2010-02-04 08:39:12

Agree with @ddimitrov - this is overkill. The only reason to use an iterator would be to take advantage of foreach, which is a bit easier to "see" than a for loop. If you're going to write a conventional for loop anyway, then might as well use charAt()

@ceving 2013-06-18 09:04:10

Using the character iterator is probably the only correct way to iterate over characters, because Unicode requires more space than a Java char provides. A Java char contains 16 bit and can hold Unicode characters up U+FFFF but Unicode specifies characters up to U+10FFFF. Using 16 bits to encode Unicode results in a variable length character encoding. Most answers on this page assume that the Java encoding is a constant length encoding, which is wrong.

@Bruno De Fraine 2013-06-27 12:39:42

@ceving It does not seem that a character iterator is going to help you with non-BMP characters: oracle.com/us/technologies/java/supplementary-142654.html

@Eugene Yokota 2008-10-13 06:34:57

See The Java Tutorials: Strings.

public class StringDemo {
    public static void main(String[] args) {
        String palindrome = "Dot saw I was Tod";
        int len = palindrome.length();
        char[] tempCharArray = new char[len];
        char[] charArray = new char[len];

        // put original string in an array of chars
        for (int i = 0; i < len; i++) {
            tempCharArray[i] = palindrome.charAt(i);
        } 

        // reverse array of chars
        for (int j = 0; j < len; j++) {
            charArray[j] = tempCharArray[len - 1 - j];
        }

        String reversePalindrome =  new String(charArray);
        System.out.println(reversePalindrome);
    }
}

Put the length into int len and use for loop.

@Emmanuel Oga 2014-10-11 07:49:04

I'm starting to feel a bit spammerish... if there's such a word :). But this solution also has the problem outlined here: This has the same problem outlined here: stackoverflow.com/questions/196830/…

@Alan 2008-10-13 06:26:23

I wouldn't use StringTokenizer as it is one of classes in the JDK that's legacy.

The javadoc says:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

@ddimitrov 2008-10-13 06:56:17

String tokenizer is perfectly valid (and more efficient) way for iterating over tokens (i.e. words in a sentence.) It is definitely an overkill for iterating over chars. I am downvoting your comment as misleading.

@Powerlord 2008-10-13 14:44:30

ddimitrov: I'm not following how pointing out that StringTokenizer is not recommended INCLUDING a quotation from the JavaDoc (java.sun.com/javase/6/docs/api/java/util/StringTokenizer.ht‌​ml) for it stating as such is misleading. Upvoted to offset.

@Alan 2008-10-13 22:23:53

Thanks Mr. Bemrose ... I take it that the cited block quote should have been crystal clear, where one should probably infer that active bug fixes won't be commited to StringTokenizer.

Related Questions

Sponsored Content

41 Answered Questions

[SOLVED] How to generate a random alpha-numeric string?

27 Answered Questions

[SOLVED] Easiest way to convert int to string in C++

42 Answered Questions

[SOLVED] How do I convert a String to an int in Java?

39 Answered Questions

27 Answered Questions

[SOLVED] Convert a string to an integer?

41 Answered Questions

[SOLVED] How do I efficiently iterate over each entry in a Java Map?

76 Answered Questions

[SOLVED] How do I iterate over the words of a string?

  • 2008-10-25 08:58:21
  • Ashwin Nanjappa
  • 2144767 View
  • 2903 Score
  • 76 Answer
  • Tags:   c++ string split

5 Answered Questions

[SOLVED] How do I lowercase a string in Python?

7 Answered Questions

[SOLVED] Iterating each character in a string using Python

Sponsored Content