By PGODULTIMATE


2017-12-11 00:39:11 8 Comments

There was some confusion with the code I had posted in my previous version of this question and there was some good advice from @Oscar Smith.

The explanation is the same; please give improvements for speed in this new version:

#!/usr/bin/python

import numpy as np
import time

chosenNum = 429
numRuns = 10000
numCapturedTanks = 7
numGuessN = []
guesses = []
percentErrors = []
STDarray = []
start_time = time.time()
STDtimes = []

def getAverageStdTime(timetaken): # gets the average time it took to calculate standard deviations
  STDtimes.append(timetaken)
  if (len(STDtimes) == numRuns):
    print ("Average List of Standard Devations Generation Time: " + str(round(np.mean(STDtimes),2)) + " seconds")

def createListOfStandardDeviations(start,end):
    for y in range(start,int(end)):
        tankSerialNumbersSimulated = np.random.randint(1, y + 1, size=numCapturedTanks) #from Oscar Smith
        simulatedSTD = np.std(tankSerialNumbersSimulated)
        STDarray.append(simulatedSTD)

def getAllGuesses():
    print ("Your guesses are: " + str(guesses))

def getAvgPercentError():
    numCorrect = 0
    closestNumber = 0
    for x in range(len(guesses) - 1):
        percentError = '%.2f' % round(((np.abs(guesses[x] - chosenNum))/float(chosenNum) * 100), 2)
        percentErrors.append(float(percentError))
        if(guesses[x] == chosenNum):
            numCorrect = numCorrect + 1
        else:
            closestNumber = min(guesses, key=lambda x:abs(x-chosenNum))
    averagePercentError = np.mean(percentErrors)
    print ("The average Percent Error is: " + str(round(averagePercentError,2)) + "%")
    getAccuracy(numCorrect,closestNumber)

def getAccuracy(amountCorrect,closestNumberToActual):
    if (amountCorrect > 0):
        print ("You got the correct number " + str(amountCorrect) + " out of " + str(len(guesses)) + " times.")
    else:
        print ("Your closest number was: " + str(closestNumberToActual))
    getmode(guesses)

def getmode(inplist):
    dictofcounts = {}
    listofcounts = []
    for i in inplist:
        countofi = inplist.count(i) # count items for each item in list
        listofcounts.append(countofi) # add counts to list
        dictofcounts[i]=countofi # add counts and item in dict to get later
    maxcount = max(listofcounts) # get max count of items
    if maxcount ==1:
        print ("There is no mode for this dataset, values occur only once")
    else:
        modelist = [] # if more than one mode, add to list to print out
        for key, item in dictofcounts.items():
            if item ==maxcount: # get item from original list with most counts
                modelist.append(str(key))
        print ("Most guessed number(s):",' and '.join(modelist))
        return modelist

def getNumGuessed(givenSTD,maxNumber):
    minStd = min(STDarray, key=lambda x:abs(x-givenSTD)) #finds closest standard deviation to the given standard deviation
    for (z,this_std) in enumerate(STDarray):
        if(minStd == this_std): #find closest number to original standard deviation
            numGuessed = z + maxNumber
            return numGuessed

def main():
    print ("reached main")
    for runsRan in range(numRuns):
        tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks) #from Oscar Smith
        NumSTD = np.std(tankSerialNumbers)
        highestTankSerial = np.mean(tankSerialNumbers) + 3*NumSTD
        maxNum = np.amax(tankSerialNumbers)
        print ("Tank Serial Numbers Generated")
        print ("Standard Deviation and Range Calculated")
        ListOfStandardDeviationsStartTime = time.time()
        for _ in range(100):
            del STDarray[:]
            if (maxNum - highestTankSerial < 0):
                createListOfStandardDeviations(maxNum,highestTankSerial)
            else:
                createListOfStandardDeviations(highestTankSerial,maxNum)
            numGuessN.append(getNumGuessed(NumSTD,maxNum))
        print ("Initial List of Standard Deviations Generated")
        print ("List of Standard Devations Generation took " + str(round(time.time() - ListOfStandardDeviationsStartTime,2)) + " seconds")

        guess = int(np.mean(numGuessN))
        print ("Guess Generated " + str(runsRan + 1))
        getAverageStdTime(float(time.time() - ListOfStandardDeviationsStartTime))
        guesses.append(guess)
    getAllGuesses()
    getAvgPercentError()

main()
print ("My program took " + str(round((time.time() - start_time)/float(60),2)) + " minutes to run")

Currently, the runtime is approximately 7.26 minutes for 1,000 runs. I want to get it to run 10,000 times and at this rate, it would take too long.

If anyone is confused by what the purpose is or any part of the code, please mention specifically what is confusing and I'll explain it.

To clarify: We are given 7 serial numbers based on a value n which is unknown (this number is given only to check how good our process of determining n is and to generate the 7 random serial numbers from it).

My process first finds the standard deviation of the given serial numbers. Then I find a limit that n definitely cannot exceed (three standard deviations above the mean) and a max from the given list as what it can't be below. I then simulate what random serial numbers would be generated from the predicted n from the range I found out. I take the standard deviation of each simulation and find which one is the closest to the standard deviation of the given serial numbers and store the corresponding guessed n. I do this x times (the more the better - I used 100) to get x guessed n 's. I take the mean of those guesses to get my final guess. I then find the percent error of my guess based on the actual number.

3 comments

@mochi 2017-12-11 02:15:22

You could vectorize your entire process in numpy.

From your previous post it looks like, from a given n number of tanks, you want to estimate a true N number of total tanks. The method is:

  1. Generate m samples of n tanks
  2. Calculate the std of these m samples
  3. Find the sample with the closest std to your observed n tanks and use that as a guess to the true max, N
  4. Repeat for a certain # of times to improve accuracy

So first, let's set some constants:

import numpy as np 

true_n = 429 # Can set whatever constant here
num_tanks = 7
num_guesses = 10000
num_samples = 100

Then a function for generating data:

def simulate_data(lower,upper,size):
    return np.array([np.random.randint(1,x+1,(num_tanks,size)) for x in range(lower, upper)])

Now we can generate our n sample and calculate relevant statistics:

obs = simulate_data(1,true_n,1)
obs_mean = np.mean(obs)
obs_std = np.std(obs)
obs_max = np.max(obs)
upper = int(obs_mean + 3*obs_std)

Finally we're ready to simulate data and run your algorithm:

min_guess, max_guess = np.sort([upper,obs_max])
guesses = []
for _ in range(num_guesses):  
    obs = simulate_data(1,true_n,1)
    obs_mean = np.mean(obs)
    obs_std = np.std(obs)
    obs_max = np.max(obs)
    upper = int(obs_mean + 3*obs_std)
    sim_data = simulate_data(min_guess, max_guess, num_samples)

    sim_std = np.std(sim_data,axis=1)
    min_std_offset = np.argmin(np.abs(sim_std - obs_std),axis=1)
    guess = min_guess + min_std_offset

    guesses.append(int(np.mean(guess)))

print("actual:",true_n)
print("guess:",np.mean(guesses))

(1) This combines your data generation with the total # of samples you want to make to generate all num_samples * num_tanks samples at once, in this case, 100 rows of 7 tank samples.

(2) Then calculates the standard deviation by row(axis=1) for all 100 samples.

(3) Next find the value that generated samples with the smallest std, this happens to be the starting value + whatever offset(index) the smallest value is at. Random samples are generated from (1,guess) where guess ranges from upper bound to the max value, or from max value to upper, whichever is larger.

(4) Stores the guessed value as a guess and repeats for 10000x.

This completes 10000 guesses for me in

25.7 s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you had the memory for it, you could even add an additional dimension for num_guesses and skip the for loop.

You can also vectorize the way you calculate error in the same way:

def calc_error(guesses):
    diff = np.abs(guesses - true_n)
    perc_error = diff/true_n*100
    num_correct = sum(guesses == true_n)
    closest_guess = guesses[np.argmin(diff)]

    print("num correct:",num_correct)
    print("avg percent error: %.2f" % np.mean(perc_error))
    print("closest guess was:",closest_guess)

Then you just need to convert your guesses list into a numpy array and call the error function on it:

guesses = np.array(guesses)
calc_error(guesses)

This follows your procedure now I believe, and from what I can tell finishes in a fraction of the time of your current solution.

@PGODULTIMATE 2017-12-11 02:31:19

Why are you randomizing the true number? I think you misunderstood the first OP. I'll clarify it in this post.

@mochi 2017-12-11 02:33:34

I made it random so you could get more random simulations, that's why it's only initialized once, you can set it to whatever constant N that you want.

@PGODULTIMATE 2017-12-11 02:46:16

Ok. However, I don't get how you get one result from 10,000 guesses. I want 10,000 results from 10,000 guesses.

@mochi 2017-12-11 02:48:36

Guesses is a list of num_guesses guesses, which in this case is a list of 10000 guesses. Where does it look like I'm only getting 1 result? Each iteration of the loop creates 1 guess from num_samples(in this case 100) simulated samples, by finding the max value of the simulated sample with the standard deviation closest to our observed standard deviation.

@PGODULTIMATE 2017-12-11 02:59:41

My bad with the 10,000 guesses thing. I had forgotten to write the calc_error function.

@mochi 2017-12-11 03:05:28

@PGODULTIMATE 2017-12-12 16:02:36

I think that something is weird here because the results are too accurate. I know that there should be a 5-6% error. Also, there are too many of the answers that are correct.

@mochi 2017-12-12 23:54:35

Looks like the samples need to be re-sampled in every iteration, which I guess makes sense as that makes you less susceptible to a bad initial draw of rng. If samples are resampled every time, I get ~8% error which is about the same as what I get running your code.

@mkrieger1 2017-12-12 15:01:13

Drawing the serial numbers

tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks)

This is the wrong way to generate the serial numbers, because it is possible that the same serial number is drawn more than once (this is random sampling with replacement):

>>> np.random.randint(1, chosenNum + 1, size=numCapturedTanks)
array([ 50,  10,  71, 244, 394, 375,  10]

You might not notice it, because the pool of possible serial numbers is much larger (chosenNum = 429) than the size of the sample you draw (numCapturedTanks = 7) and therefore the probability to draw the same number twice is low1.

But the serial numbers really should be unique; you need random sampling without replacement.

This is done in plain Python using random.sample,

random.sample(range(1, chosenNum + 1), numCapturedTanks)

or in NumPy using np.random.choice with replace=False:

np.random.choice(range(1, chosenNum + 1), numCapturedTanks, replace=False)

This could be further simplified if you chose to use zero-based serial numbers (i.e. for 100 tanks, the serial numbers would be 0, …, 99 instead of 1, …, 100):

random.sample(range(chosenNum), numCapturedTanks)
np.random.choice(chosenNum, numCapturedTanks, replace=False)

1Approximately 5% according to a quick Monte Carlo simulation I've run. (Source code)

Some simulation results

A closed form solution seems to be discussed here.

@Oscar Smith 2017-12-13 00:01:21

That's really annoying, as it makes vectorization way more difficult. Is there a way to make my version of createListOfStandardDeviations vectorized, fast, and correct?

@mkrieger1 2017-12-13 00:57:07

I don't see how my version makes vectorization any more or less difficult than the original version and how it should affect createListOfStandardDeviations. In both cases you get a 1-dimensional array tankSerialNumbers, only in my versions the numbers are guaranteed to be unique

@Oscar Smith 2017-12-13 01:04:59

It is complicated because np.random.choice does not let you generate a 3d array where each element has a different maximum as I did in my version of createListOfStandardDeviations. This ability to vectorize led to much of the speed boost my solution provided.

@mkrieger1 2017-12-13 01:12:23

Then I seem to not quite understand what you're proposing. Anyway, the "offending" line tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks) is outside the createListOfStandardDeviations function.

@Oscar Smith 2017-12-13 01:13:34

How can I replace these two lines to pick without replacement maxes = np.broadcast_to(np.arange(start+1, end+1), (numCapturedTanks, 100, end - start)) tankSerialNumbersSimulated = np.floor(np.random.uniform(1, maxes))

@mkrieger1 2017-12-13 01:23:17

I don't yet understand how it works, but see here or here.

@Oscar Smith 2017-12-13 01:40:22

That won't work because my code relies on generating different max values for each row. So the first row is in 1:5, the second is in 1:6 for example.

@Oscar Smith 2017-12-11 00:53:24

Some profiling at this point shows that almost all of the time is spent in createListOfStandardDeviations, so this is the next place to look for performance improvements. The way we can speed this up is by removing the for loop and making numpy do the work for us. In the old code, we calculated one simulatedSTD at a time, but it is possible to use a 2darray and compute the std of each row at the same time. Here is the updated code that does this.

def createListOfStandardDeviations(start,end):
    end = int(end)
    sim_num = end - start
    maxes = np.broadcast_to(np.arange(start+1, end+1), (numCapturedTanks, sim_num))
    tankSerialNumbersSimulated = np.floor(np.random.uniform(1, maxes))
    simulatedSTDs = np.std(tankSerialNumbersSimulated, axis=0)
    STDarray.extend(simulatedSTDs)

the main piece that is likely confusing here is the use of np.floor(np.random.uniform instead of np.randint. These do the same thing, and the switch is necessary because randint takes an int as the max, while uniform allows an ndarray instead. This change gives me a 35x speedup from before.

The next speedup we get is from making STDarray into an ndarray, and while we're at it, making it no longer a global variable. To do this, we make the last line of createListOfStandardDeviations be return np.std(tankSerialNumbersSimulated, axis=0), store it in main, and pass it into getNumGuessed. This change lets getNumGuessed become simply return maxNumber + np.argmin(np.abs(STDarray-givenSTD), axis=1) which is simpler and faster. This is a smaller boost than before, but it still gives about a 2x speedup.

At this point createListOfStandardDeviations is the slow point again, and the solution is yet more vectorization. At this point we are running this function in a loop 100 times, but it is much faster to let numpy do this. All we have to do is change the function to

def createListOfStandardDeviations(start,end):
    maxes = np.broadcast_to(np.arange(start+1, end+1), (numCapturedTanks, 100, end - start))
    tankSerialNumbersSimulated = np.floor(np.random.uniform(1, maxes))
    return np.std(tankSerialNumbersSimulated, axis=0)

and it will just work. It is important that getNumGuessed has the axis=0 in the argmin call or else it will not work properly. This gives another 2x speedup. Here is the main function after removing the loop.

def main():
    for runsRan in range(numRuns):
        tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks) #generates initial tank serial numbers
        NumSTD = np.std(tankSerialNumbers)
        highestTankSerial = np.mean(tankSerialNumbers) + 3*NumSTD
        maxNum = np.amax(tankSerialNumbers)
        ListOfStandardDeviationsStartTime = time.time()
        start, end = min(maxNum, highestTankSerial), int(max(maxNum, highestTankSerial))

        STDarray = createListOfStandardDeviations(start, end)
        numGuessN.extend(getNumGuessed(STDarray, NumSTD, maxNum))

        guess = int(np.mean(numGuessN)) #store actual guess
        print ("Guess Generated " + str(runsRan + 1))
        guesses.append(guess) #add guesses to a list
    getAllGuesses()
    getAvgPercentError()

The last speedup worth bothering with is making numGuessN an ndarray. This will make finding it's mean much faster, and is easy enough to do. The three changes needed are changing the declaration to numGuessN = np.array([], dtype=np.int64), and changing the `extend call to

global numGuessN
numGuessN = np.concatenate((numGuessN, getNumGuessed(STDarray, NumSTD, maxNum)))

This yields another 2x speedup for me, At this point, we have gotten well over 200x over the original, and almost all of the time is spent either calculating std, mean or generating random numbers. It is possible that another factor of 2 could be gained from here, but it would be a pain, and it would probably require more effort than it's worth.

@PGODULTIMATE 2017-12-11 01:11:12

This was a great improvement! The program took 10.22 minutes for 10,000 guesses.

@Oscar Smith 2017-12-11 01:13:54

There is still plenty of room for speedup. getNumGuessed is still quite slow for similar reasons. I'll update this answer once I have a good solution.

@PGODULTIMATE 2017-12-11 01:33:27

Since STDarray is no longer a global variable, what do I put for STDarray.extend in the createListOfStandardDeviations(start,end)?

@Oscar Smith 2017-12-11 01:36:58

At this point, the pace to gain speed is by in-lining createListOfStandardDeviationsand removing the loop around it through more vectorization

@PGODULTIMATE 2017-12-11 01:40:27

I have an error saying that STDarray is not defined in the createListOfStandardDeviations(start,end). I do not know what to put there since we are defining STDarray from this function and I have no idea what we would be extending in that case.

@Oscar Smith 2017-12-11 01:52:36

Does this make it more clear?

@PGODULTIMATE 2017-12-11 01:57:26

Yes, I made the changes as you specified but am getting an error: TypeError: 'numpy.int64' object is not iterable. This is in reference to the numGuessN.extend(getNumGuessed(STDarray, NumSTD, maxNum)) in line 81

@PGODULTIMATE 2017-12-11 02:03:44

FIxed the problem

@PGODULTIMATE 2017-12-11 04:02:18

Your method takes about 11.72 minutes to run for 10,000. This is weird because 1000 is under a minute.

@Oscar Smith 2017-12-11 04:04:54

I think that nonlinearity is true of yours as well. numGuessN never gets reset, so it grows between iterations.

@PGODULTIMATE 2017-12-11 04:40:26

I noticed another weird thing: The guesses generated the start of how they should be (quite a bit of variability), but progressively get closer and closer to the actual number indicating that each guess is being influenced by the previous guess. All guess should be independent of one another. Another weird thing is that the accuracy is insanely good at around a sample size of 100 but gets extremely off if it isn't. I would think that as sample size increases, the accuracy would increase because it would have more values to draw from.

@Oscar Smith 2017-12-11 05:01:58

Problem fixed getNumGuessed needed to use axis=1 not axis=0

@PGODULTIMATE 2017-12-11 05:10:16

Yes, it is working normally now (the average percent error has become consistent).

Related Questions

Sponsored Content

1 Answered Questions

2 Answered Questions

[SOLVED] Guess-the-number game by a Python beginner

1 Answered Questions

[SOLVED] FirstDuplicate Finder

3 Answered Questions

[SOLVED] Countdown numbers game (Solution generator)

1 Answered Questions

2 Answered Questions

1 Answered Questions

[SOLVED] Predicting random numbers based on past output

2 Answered Questions

1 Answered Questions

[SOLVED] Find the smallest regular number that is not less than N (Python)

  • 2012-02-12 23:29:25
  • finnw
  • 282 View
  • 5 Score
  • 1 Answer
  • Tags:   python

Sponsored Content